vertopal.com_Python Seaborn Tutorial for Beginners v2
vertopal.com_Python Seaborn Tutorial for Beginners v2
What is Seaborn?
• Python Data Visualization Library - based on MatPlotLib (see previous tutorials)
• Used for plotting statistical graphs, identifying trends, relationships & outliers
• In my opinion, Seaborn is easier & faster to use (less code) Vs MatPlotLib
# Visual representation of Seaborn
import os
from IPython.display import Image
PATH = "F:\\Github\\Python tutorials\\Introduction to Seaborn\\"
Image(filename = PATH + "Seaborn.png", width=900, height=900)
Tutorial Overview
• What is matplotlib and how/why it's used
• Trend Plots:
– Line Plots
• Summary Plots:
– Bar Plots
• Distribution of Data:
– Histogram
– Box Plots
• Relationship Plots
– Scatter Plots
– lmplot (combo of regplot() and FacetGrid)
• Holistic views / Combo:
– Sub Plots
– Pair Plots
– Join Plots
• Correlation / Relationships:
– Heat Maps
Video 2:
1. Box Plots
2. Scatter Plots
3. lmplot (combo of regplot() and FacetGrid)
Video 3:
1. Sub Plots
2. Pair Plots
3. Join Plots
4. Heat Maps
print(raw_data.shape)
(182, 11)
2. Line Gragh
# Example 1 - Simple 1 line graph
# Assuming we want to investigate the Revenue by Date
# By Promo
ax = sns.lineplot(x='Week_ID', y='Revenue', hue = 'Promo', data =
raw_data)
# Example 3 - By Promo with style
ax = sns.lineplot(x='Week_ID', y='Revenue', hue = 'Promo', style =
'Promo', data = raw_data)
# Example 4 - By Promo with style & Increase the size & Remove error
bars
3. Bar Plots
# Example 1 - Total Revenue by Month
# Notes:
# 1 - the lines signify the confidence interval
# 2 - Takes mean by default
Month_ID Revenue
0 11 11255.454545
1 12 11667.806452
2 13 9588.516129
3 14 10683.892857
4 15 10555.354839
5 16 10806.500000
6 17 7636.000000
x = raw_data['Revenue'].values
x = raw_data['Revenue'].values
<matplotlib.lines.Line2D at 0x1f6ca13ca58>
# Example 3 - Investigating the distribution of Visitors, adding the
mean
x = raw_data['Visitors'].values
<matplotlib.lines.Line2D at 0x1f6c9ba4358>
5. Box Plots
# Example 1 - Investigating the distribution of Revenue
x = raw_data['Revenue'].values
ax = sns.boxplot(x)
# Notes:
# The line signifies the median
# The box in the middle show the beginning of Q1 (25th percentile) and
the end of the Q3 (75th percentile)
# The whiskers (left - right) show the minimum quartile and maximum
quartile
# The dots on the right are "outliers"
6. ScatterPlots
raw_data.columns
# Notes:
# What is Linear Regression: It is a predictive statistical method for
modelling the relationship between x (independent variable) & y
(dependent V).
# How it works (cost function MSE):
https://towardsdatascience.com/machine-learning-fundamentals-via-
linear-regression-41a5d11f5220
# Example 2 - Relationship between Marketing Spend and Revenue -
changing color, hue & Style
a = raw_data['Revenue'].values
b = raw_data['Visitors'].values
c = raw_data['Marketing Spend'].values
# plot 1
sns.distplot(a, color = 'blue', ax=axes[0,0])
# plot 2
sns.distplot(b, color = 'blue', ax=axes[0,1])
# plot 3
sns.distplot(c, color = 'blue', ax=axes[1,0])
# plot 4
sns.boxplot(x="Revenue", y="Day_Name", data=raw_data, color =
'#52F954', ax=axes[1,1])
sns.swarmplot(x="Revenue", y="Day_Name", data=raw_data, color="blue",
ax=axes[1,1])
<matplotlib.axes._subplots.AxesSubplot at 0x1f6cde977f0>
9. Pairplots
# Example 1 - running on all dataframe - green color
g = sns.pairplot(raw_data, plot_kws={'color':'green'})
# Example 2 - running on specific columns - green color
g = sns.pairplot(raw_data[['Revenue','Visitors','Marketing Spend']],
plot_kws={'color':'#0EDCA9'})
10. JoinPlots
Draw a plot of two variables with bivariate and univariate graphs.
# Example 1 - Revenue vs marketing Spend Relationship with
g = sns.jointplot("Revenue", "Marketing Spend", data=raw_data,
kind="reg", color = 'green', size = 10)
11. Heat Map
# First we need to create a "Dataset" to display on a Heatmap - we
will use a correlation dataset
# .corr() is used to find the pairwise correlation of all columns in
the dataframe. Any null values are automatically excluded
# The closer to 1 or -1 the better. As one variable increases, the
other variable tends to also increase / decrease
# More Info here: https://statisticsbyjim.com/basics/correlations/
pc = raw_data[['Revenue','Visitors','Marketing Spend',
'Promo']].corr(method ='pearson')
ax = sns.heatmap(pc, annot=True,
yticklabels=cols,
xticklabels=cols,
annot_kws={'size': 50})
ax = sns.heatmap(pc, annot=True,
yticklabels=cols,
xticklabels=cols,
annot_kws={'size': 50},
cmap="BuPu")
# Examples:
# cmap="YlGnBu"
# cmap="Blues"
# cmap="BuPu"
# cmap="Greens"