Data Visualization - U5
Data Visualization - U5
Seaborn helps you explore and understand your data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots. Its
dataset-oriented, declarative API lets you focus on what the different elements of your
plots mean, rather than on the details of how to draw them.
Overview -
Seaborn
Seaborn in Python is a data visualisation toolkit that simplifies the process of
constructing visually appealing and useful statistics graphs. Simply said, it is a tool that
allows you to make your data not only understandable but also visually appealing.It's
based on Matplotlib, another famous plotting package, but it has a more advanced
interface that makes it easier to create meaningful statistical visualisations.
Seaborn's appeal comes from its capacity to generate complex visualizations with only
a few lines of code.Its concise syntax allows you to generate elaborate statistical
graphs without delving into the complexity of plotting processes.
It goes beyond only creating visually appealing graphs; it also incorporates statistical
estimates into the visualizations. In a scatter plot, for example, it can automatically
build a linear regression line, demonstrating insights into the underlying trends in your
data. This combination of visualization and statistical analysis is what Seaborn is
known for and everyone who wants to extract useful information from their data.
Installing Seaborn on multiple operating systems is simple, and here's a brief instruction
to get you started.
Windows: Windows users can install Seaborn by opening their command prompt and
typing the command:
MacOS:For the Mac users, open up your terminal and enter the installation command:
Linux:Installing Seaborn on Linux is an easy task. Open your terminal and type:
Barplot
A bar plot gives an estimate of the central tendency for a numeric variable with the
height of each rectangle. It provides some indication of the uncertainty around that
estimate using error bars. To build this plot, you usually choose a categorical column on
the x-axis and a numerical column on the y-axis.
In the above plot, you have used the barplot() function and passed it in the cylinder (cyl)
column in the x-axis and carburetors (carb) in the y-axis. The code depicted below is
another way to create the same bar plot.
Here you are exclusively defining the x and y-axis columns and also passing the name of
the data frame using the data argument.
Python Seaborn allows the users to assign colors to the bars. The bar chart below will
convert all the bars to yellow color.
Seaborn library also has the palette attribute which you can use to give different colors
to the bars.In the example below, there is a bar plot that uses palette = ‘rocket’.
Countplot
The countplot() function in the Python Seaborn library returns the count of total values
for each category using bars.The below count plot returns the number of vehicles for
each category of cylinders.
The next count plot shows the number of cars for each carburetor.
Python Seaborn allows you to create horizontal count plots where the feature column is
in the y-axis and the count is on the x-axis.The below visualization shows the count of
cars for each category of gear.
From the above plot, you can see that we have 15 vehicles with 3 gears, 12 vehicles with
4 gears, and 5 vehicles with 5 gears.Now, you can also create a grouped count plot
using the hue parameter. The hue parameter accepts the column name for color
encoding.In the below count plot, you have the count of cars for each category of gears
that are grouped based on the number of cylinders.
Distribution Plot
The Seaborn library supports the distplot() function that creates the distribution of any
continuous data.In the below example, you must plot the distribution of miles per gallon
of the different vehicles. The mpg metrics measure the total distance the car can travel
per gallon of fuel.
Heatmap
Heatmaps in the Seaborn library lets you visualize matrix-like data. The values of the
variables are contained in a matrix and are represented as colors.Below is an example
of the heatmap where you are finding the correlation between each variable in the
mtcars dataset.
Scatterplot
The Seaborn scatterplot() function helps you create plots that can draw relationships
between two continuous variables.
Moving ahead, to understand scatter plots and other plotting functions, you must use
the IRIS flower dataset.So, go ahead and load the iris dataset.
The scatter plot below shows the relationship between sepal length and petal length for
different species of iris flowers.
Now, you can classify the different species of flowers using the hue parameter as
“species” in the function.From the below plot, you can easily differentiate the three types
of iris flowers based on their sepal length and petal length.
Pairplot
The Python Seaborn library lets you visualize data using pair plots that produce a matrix
of relationships between each variable in the dataset.
In the below plot, all the plots are histograms that represent the distribution of each
feature.
You can convert the diagonal visuals to KDE plots and the rest to scatter plots using the
hue parameter. This makes the pairplot easier to classify each type of flower.
.
Linear Regression Plot
The lmplot() function in the Seaborn library draws a linear relationship as determined
through regression for the continuous variables.The plot below shows the relationship
between petal length and petal width of the different species of iris flowers.
The hue parameter can differentiate between each species of flower and you can set
markers for different species.
Boxplot
A boxplot, also known as a box and whisker plot, depicts the distribution of quantitative
data. The box represents the quartiles of the dataset. The whiskers show the rest of the
distribution, except for the outlier points.The boxplot below shows the distribution of
the three species of iris flowers based on their sepal width.
Functionality Utilized for making basic graphs. Contains several patterns and
Datasets visualized with bar graphs, plots for data visualization.
histograms, pie charts, scatter plots, Uses fascinating themes.
lines, etc. Helps compile whole data
into a single plot. Provides
data distribution.
Dealing Can open and use multiple figures Sets the time for figure
Multiple simultaneously, but they are closed creation, potentially leading
Figures distinctly. to out-of-memory issues.
matplotlib.pyplot.close()
(one figure) and
matplotlib.pyplot.close("all
") (all figures).
Visualization Well-connected with NumPy and More comfortable handling
Pandas. Acts as a graphics package Pandas data frames. Uses
for data visualization in Python. basic methods to provide
Pyplot provides similar features and beautiful graphics in Python.
syntax as in MATLAB.
Data Frames Works efficiently with data frames More functional and
and Arrays and arrays. Treats figures and axes organized, treats the whole
as objects. Stateful APIs allow dataset as a single unit. Less
plot() methods to work without stateful, requires parameters
parameters. for methods like plot().
For example, a residential area in a city having more expensive properties will likely have
people with higher incomes, and they will spend higher amounts of money.
Applications and Uses of Spatial Analysis
● Spatial analysis can be used to map natural resources, track weather phenomena
like rainfall, snow, humidity, air pressure, etc.
● Commercial data such as sales can be plotted on maps to analyze the most
profitable locations and make better decisions.
● Urban planning and city planning can take the help of map-based analysis
techniques to understand the growing population’s electricity and water needs.
● With the data plotted on a map, one can determine which regions need an urgent
upgrade and more supply, and all aspects of urban planning can be done easily
with proper geospatial analysis.
To plot some interesting locations in folium, if you know the map coordinates its very
easy.
#victoria memorial
tooltip_1 = "This is Victoria Memorial"
tooltip_2 ="This is Eden Gardens"
folium.Marker(
[22.54472, 88.34273], popup="Victoria Memorial", tooltip=tooltip_1).add_to(kol)
folium.Marker(
[22.56487826917627, 88.34336378854425], popup="Eden Gardens",
tooltip=tooltip_2).add_to(kol)
Kol
folium.Marker(
location=[22.55790780507432, 88.35087264462007],
popup="Indian Museum",
icon=folium.Icon(color="red", icon="info-sign"),
).add_to(kol)
kol
Here are the results of the above code.
kol2 = folium.Map(location=[22.55790780507432, 88.35087264462007], tiles="Stamen
Toner", zoom_start=13)
kol2
Output:
Adding markers to the map serves the purpose of labelling and identifying something.
With labelling, one can mark any particular point of interest on the map.
#adding circle
folium.Circle(
location=[22.585728381244373, 88.41462932675563],
radius=1500,
popup="Salt Lake",
color="blue",
fill=True,
).add_to(kol2)
folium.Circle(
location=[22.56602918189088, 88.36508424354102],
radius=2000,
popup="Old Kolkata",
color="red",
fill=True,
).add_to(kol2)
kol2
Output:
The map is movable and interactable. Usage of circles can be used for zoning and zone
marking purposes in the case of real-life data.
# Create a map
india = folium.Map(location=[20.180862078886562, 78.77642751195584],
tiles='openstreetmap', zoom_start=5)
india
To choose any specific place on the map, we can change the coordinates and edit the
zoom_start parameter.
Output:
folium.PolyLine(locations = loc,
line_opacity = 0.5).add_to(india)
india
Output:
In this way, we can plot some basic data based on coordinates.
When worked on Kaggle Dataset, having Indian states’ population centres as per 2011
census data. Let us proceed.
df_state=pd.read_csv("/kaggle/input/indian-census-data-with-geospatial-indexing/state
wise centroids_2011.csv")
df_state.head()
Output:
Plot the data which has 35 entries.
#creating a new map for India, for all states population centres to be plotted
# Create a map
india2 = folium.Map(location=[20.180862078886562, 78.77642751195584],
tiles='openstreetmap', zoom_start=4.5)
#adding the markers
india2
Output:
The plot is generated, and the location of each of the markers is the population centre
for the respective state/UT.
Applications
● Supply Chain & Logistics – Tracks shipment routes, distribution hubs, and
delivery statuses for better route optimization.
● Public Health & Epidemiology – Maps disease outbreaks, vaccination rates, and
healthcare facility locations for informed decision-making.
● Urban Planning & Smart Cities – Helps city planners analyze infrastructure
projects, transportation networks, and urban development.
Now that you are familiar with folium, let us use it for our next case study which is as
mentioned below:
Case Study: An e-commerce company ‘ wants to get into logistics “Deliver4U” . It wants
to know the pattern for maximum pickup calls from different areas of the city
throughout the day. This will result in:
i) Build optimum number of stations where its pickup delivery personnel will be located.
ii) Ensure pickup personnel reaches the pickup location at the earliest possible time.
For this the company uses its existing customer data in Delhi to find the highest density
of probable pickup locations in the future.
Solution:
Let us now visualize the rides data using a class method called Heatmap()
Code for reference:
from folium.plugins import HeatMap
df_copy = df[df.month>4].copy()
df_copy['count'] = 1
base_map = generateBaseMap()
HeatMap(data=df_copy[['pickup_latitude', 'pickup_longitude',
'count']].groupby(['pickup_latitude',
'pickup_longitude']).sum().reset_index().values.tolist(), radius=8,
max_zoom=13).add_to(base_map)
We can also animate our heat maps to dynamically change the data on a timely basis
based on a certain dimension of time. This can be done using HeatMapWithTime(). Use
the following code :
df_hour_list = []
for hour in df_copy.hour.sort_values().unique():
df_hour_list.append(df_copy.loc[df_copy.hour == hour,['pickup_latitude',
'pickup_longitude', 'count']].groupby(['pickup_latitude',
'pickup_longitude']).sum().reset_index().values.tolist())
from folium.plugins import HeatMapWithTime
base_map = generateBaseMap(default_zoom_start=11)
HeatMapWithTime(df_hour_list, radius=5, gradient={0.2: 'blue', 0.4: 'lime', 0.6:
'orange', 1: 'red'}, min_opacity=0.5, max_opacity=0.8,
use_local_extrema=True).add_to(base_map)
Base_map
Conclusion
Throughout the city, pickups are more probable from the central area so it is better to
set a lot of pickup stops at these locations. Therefore, by using maps we can highlight
trends and uncover patterns and derive insights from the data.