Data Visualization in Python With Libraries
Data Visualization in Python With Libraries
Libraries
The process of finding trends and correlations in our data by representing it
pictorially is called Data Visualization. To perform data visualization in python, we
can use various python data visualization modules such as Matplotlib, Seaborn,
Plotly, etc. In this article, The Complete Guide to Data Visualization in Python, we
will discuss how to work with some of these modules for data visualization in python
and cover the following topics in detail.
Line Charts
Bar Graphs
Histograms
Scatter Plots
Heat Maps
Data visualization is a field in data analysis that deals with visual representation of
data. It graphically plots data and is an effective way to communicate inferences
from data.
Using data visualization, we can get a visual summary of our data. With pictures,
maps and graphs, the human mind has an easier time processing and understanding
any given data. Data visualization plays a significant role in the representation of
both small and large data sets, but it is especially useful when we have large data
sets, in which it is impossible to see all of our data, let alone process and understand
it manually.
Python offers several plotting libraries, namely Matplotlib, Seaborn and many other
such data visualization packages with different features for creating informative,
customized, and appealing plots to present data in the most simple and effective
way.
Matplotlib and Seaborn are python libraries that are used for data visualization. They
have inbuilt modules for plotting different graphs. While Matplotlib is used to embed
graphs into applications, Seaborn is primarily used for statistical graphs.
But when should we use either of the two? Let’s understand this with the help of a
comparative analysis. The table below provides comparison between Python’s two
well-known visualization packages Matplotlib and Seaborn.
Matplotlib Seaborn
It mainly works with datasets and arrays. It works with entire datasets.
Let's consider the apple yield (tons per hectare) in Kanto. Let's plot a line graph
using this data and see how the yield of apples changes over time. We start by
importing Matplotlib and Seaborn.
Using Matplotlib
To better understand the graph and its purpose, we can add the x-axis values too.
Figure 4: Axis values
To plot multiple datasets on the same graph, just use the plt.plot function once for
each dataset. Let's use this to compare the yields of apples vs. oranges on the same
graph.
Figure 6: Plotting multiple graphs
We can add a legend which tells us what each line in our graph means. To
understand what we are plotting, we can add a title to our graph.
Figure 7: Plotting multiple graphs
To show each data point on our graph, we can highlight them with markers using the
marker argument. Many different marker shapes like a circle, cross, square,
diamond, etc. are provided by Matplotlib.
Figure 8: Using markers
You can use the plt.figure function to change the size of the figure.
Figure 9: Changing graph size
Using Seaborn
An easy way to make your charts look beautiful is to use some default styles from
the Seaborn library. These can be applied globally using the sns.set_style function.
Figure 10: Using Seaborn
We can also use the darkgrid option to change the background color to a darker
shade.
Figure 11: Using darkgrid in Seaborn
Bar Graphs
When you have categorical data, you can represent it with a bar graph. A bar graph
plots data with the help of bars, which represent value on the y-axis and category on
the x-axis. Bar graphs use bars with varying heights to show the data which belongs
to a specific category.
Figure 12: Plotting Bar graphs
We can also stack bars on top of each other. Let's plot the data for apples and
oranges.
Figure 13: Plotting stacked bar graphs
Let’s use the tips dataset in Seaborn next. The dataset consists of :
Time of day
Total bill
We can draw a bar chart to visualize how the average bill amount varies across
different days of the week. We can do this by computing the day-wise averages and
then using plt.bar. The Seaborn library also provides a barplot function that can
automatically compute averages.
Figure 15: Plotting averages of each bar
If you want to compare bar plots side-by-side, you can use the hue argument. The
comparison will be done based on the third feature specified in this argument.
Histograms
A Histogram is a bar representation of data that varies over a range. It plots the
height of the data belonging to a range along the y-axis and the range along the x-
axis. Histograms are used to plot data over a range of values. They use a bar
representation to show the data belonging to each range. Let's again use the ‘Iris’
data which contains information about flowers to plot histograms.
Figure 18: Iris datase
We can change the number and size of bins using numpy too.
Figure 21: Changing number and size of bins
Similar to line charts, we can draw multiple histograms in a single chart. We can
reduce each histogram's opacity so that one histogram's bars don't hide the others'.
Let's draw separate histograms for each species of flowers.
Figure 23: Multiple histograms
Multiple histograms can be stacked on top of one another by setting the stacked
parameter to True.
Figure 24: Stacking histograms
Scatter Plots
Scatter plots are used when we have to plot two or more variables present at
different coordinates. The data is scattered all over the graph and is not confined to a
range. Two or more variables are plotted in a Scatter Plot, with each variable being
represented by a different color. Let's use the ‘Iris’ dataset to plot a Scatter Plot.
Let’s try plotting the data with the help of a line chart.
This is not very informative. We cannot figure out the relationship between different
data points.
Figure 28: Scatter plot
This is much better. But we still cannot differentiate different data points belonging to
different categories. We can color the dots using the flower species as a hue.
Since Seaborn uses Matplotlib's plotting functions internally, we can use functions
like plt.figure and plt.title to modify the figure.
Figure 30: Changing dimensions of scatter plot
Heat Maps
Heatmaps are used to see changes in behavior or gradual changes in data. It uses
different colors to represent different values. Based on how these colors range in
hues, intensity, etc., tells us how the phenomenon varies. Let's use heatmaps to
visualize monthly passenger footfall at an airport over 12 years from the flights
dataset in Seaborn.
Figure 31: Flights dataset
The above dataset, flights_df shows us the monthly footfall in an airport for each
year, from 1949 to 1960. The values represent the number of passengers (in
thousands) that passed through the airport. Let’s use a heatmap to visualize the
above data.
Figure 32: Plotting heatmap
The brighter the color, the higher the footfall at the airport. By looking at the graph,
we can infer that :
1. The annual footfall for any given year is highest around July and August.
2. The footfall grows annually. Any month in a year will have a higher footfall
when compared to the previous years.
Let's display the actual values in our heatmap and change the hue to blue.