Basic Applied Visualizations: Data Visualization
Basic Applied Visualizations: Data Visualization
• Tableau
• Power BI
• Matplotlib
• Seaborn
• Plotly
• D3.js
Data visualization process:
• Pictogram
• Pie chart
• Bar charts
Characteristics of 1D techniques:
Types of pictograms:
Now since each picture of book represent two books and the
pictograph shows student A has two book pictures hence,
student A has in total 2 × 2 = 4 books. Similarly Student B has
4 × 2 = 8 books, Student C has 3 × 2 = 6 books and Student
D has 4 × 2 = 8 books.
Example 2
Example 1 :
In order to read a pie chart, the first thing to notice is the data
presented in the pie chart. If the data is given in percentage, it
should be converted accordingly in order to analyze and interpret
the data. Let’s take a look at an example in
order to learn how to interpret pie charts.
300 = 75 people.
300 = 15 people.
Number of people who like sci-fi = 20/100 × 300 = 60 people.
Bar chart
A bar chart is a type of one-dimensional (1D) data visualization
technique thatdisplays categorical data using rectangular bars,
each representing a category’s value.
The length or height of each bar is proportional to the value it
represents, making it easy to compare different categories at a
glance.
They are called one dimensional diagrams because only length
of the bar matters and not the width. That is, width of each
bar remains same in a diagram, but it may vary diagram to
diagram depending on the space available and number of bars
to be presented.
Solution:
Two-dimensional (2D)
Two-dimensional (2D) data visualization techniques are used to
display datawith two variables or dimensions i.e.., length and
width using visualizations like charts, plots, and graphs.
histogram
line plot
frequency curves and polygons
ogive curves
scatter plot
Histogram in two dimensional data
visualization
A histogram is a type of two-dimensional (2D) data visualization
that displaysthe distribution of a continuous variable using
rectangular bars.
Key components:
Best practices:
For the example we have considered, a frequency curve looks like the
following,
And for the same example, a frequency polygon looks like the
following,
We can see that the frequency polygon and the frequency
curve, both depend on the class mark to be expressed as
graphs.
The only difference between a frequency polygon and a
frequency curve is the following,
Characteristics:
• X-axis: Represents the continuous variable.
• Y-axis : Represents the cumulative frequency or percentage.
Characteristics:
DV TECHNIQUES
Gantt chart
A Gantt chart is a type of bar chart used in two-dimensional
(2D) data visualization, primarily for project management and
scheduling.
It visually represents the timeline of tasks, activities, or
events within a project, showing their start and end dates,
duration, and sometimes dependencies.
On the left of the chart is a list of the activities and along
the top is a suitable time scale. Each activity is represented
by a bar; the position and length of the bar reflects the start
date, duration and end date of the activity..
Utilizing gantt charts to display timelines can be incredibly
helpful and enable team members to keep track of every
aspects of project. Even if you are not a project management
professional, familaring your self gantt charts can help you
stay organized.
Purpose:
Heat maps
• Heat maps are two-dimensional techniques.
• It is a graphical representation of data where values
are depicted using colors. The data is typically
arranged in a grid or matrix format, with each cell
assigned a color based on its value. Heatmaps are
particularly useful for
visualizing large datasets and identifying areas of interest or
concentration.
• Heatmap data visualization is a technique that uses
color to represent data values. The most common
color schemes range from warm colors (such asred) to
cool colors (such as blue), with warm colors typically
representing higher values and cool colors representing
lower values.
Key characteristics:
Best practices:
• Avoid 3D effects : Keep the Heat Map flat, as 3D effects can distort
the data.
How to Create a Heat Map:
• Prepare Data: Organize your data into a matrix or
table format, where each row and column represents
a category or variable, and the intersecting cells
contain the data values.
• Assign Colors: Determine a color scale that
represents the range of data values. Assign colors to
each cell based on the corresponding data value.
• Plot the Heat Map: Plot the heat map using software
or a visualization tool that supports heat map creation.
Tools like Excel, Python (with libraries like Seaborn or
Matplotlib), and R are commonly used.
• Add a Legend: bInclude a color legend that
indicates what the colors represent in terms of
data values.
Example 1 :
• Box :
• The central part of the plot is a rectangle
(the “box”) that spans from the first
quartile (Q1) to the third quartile (Q3) of
the data.
• The height (or width, if the plot is
horizontal) of the box represents
interquartile range (IQR), which is the middle
50% of the data.
• Median Line:
• A line inside the box represents the median
(Q2), which is the middle value of the
dataset. This divides the data into two equal
parts.
• Whiskers:
• The “whiskers” extend from the edges of the
box to the minimum and maximum values
within a specified range.
The following steps are involved in making Box and Whisker Plot:
• Gather Information:
• collect the dataset and sort the data in ascending
order
• Calculate Quartiles:
• calculate minimum, maximum, first
quartile (Q1), third quartile (Q3), and
median (Q2) from the given information.
• Identify any outliers using the 1.5 times IQR rule.
• Draw the Box:
• Draw a rectangle from Q1 to Q3..
• Inside the box, draw a line at the median (Q2).
• Add Whiskers:
• Extend lines (whiskers) from Q1 to the
minimum value within the 1.5 times IQR
range and from Q3 to the maximum
value within this range.
• Identify Outliers:
• Plot any pieces of information outside the
stubbles as individual focuses.
Solution:
Step 1: Collect Data
Dataset: 78, 85, 90, 92, 95, 96, 97, 98, 99, 100, 105, 110, 120
-Q1 (the first quartile) is the median of the lower half of the
data (78, 85, 90, 92, 95, 96) = 91
-Q2 (the median) is the median of the entire dataset = 97
-Q3 (the third quartile) is the median of the upper half of
the data: (98, 99, 100, 105, 110, 120) = 102.5
Any data points that fall outside the whiskers are considered
outliers. In this case, we do not have any outliers. This Box
and Whisker Plot gives a visual rundown of the grades, showing
the middle (Q2) at 97, the interquartile range (IQR) from Q1 to
Q3 (91 to 102.5), and the shortfall of exceptions. It
successfully outlines the focal
propensity, spread, and dissemination of the dataset.
Waterfall Chart
A Waterfall Chart is a data visualization tool used to show
how an initial value is affected by a series of positive or
negative values, leading to a final value.
It is particularly useful in illustrating how an initial value is
affected by a series of intermediate values, leading to a final
result.
Key components:
• Initial value : Starting point of the waterfall.
• Positive values : Increases to the initial value.
Add bars for each intermediate value. Each bar will either
add to (positive value) or subtract from (negative value) the
previous total.
Area charts
An area chart is similar to a line chart, except the region below
the lines in an area chart is filled with color or shading, making
it simple to view the overall value across multiple data series.
Key components:
• _X-axis_:Typically represents time or another
continuous variable. This axis isused to plot the data points
in sequential order.
Best practices:
Key components:
Sub Plots
Subplots in dimensional data visualization are used to display
multiple plots in a single visual, allowing for:
Types of Subplots:
• _Small Multiples_: Multiple small plots displaying the
same variable across different dimensions.
Strengths of Subplots:
Limitations of Subplots:
import
matplotlib.pypl
ot as plt import
numpy as np
Example Plots:
• Line Plot:
A line plot is used to display data points connected by
straight lines. It’s commonly used to visualize trends over
time.
initia
lizing
the
data
X=
[10,
20,
30,
40]
Y = [20, 25, 35, 55]
#
plotti
ng
the
data
Plt.pl
ot(x,
y)
Plt.title(“Line Chart”)
# Adding
label on
the y-axis
Plt.ylabel(‘
Y-Axis’)
# Adding
label on
the x-axis
Plt.xlabel(‘
X-Axis’)
Plt.show()
Output :
• Bar Chart:
Bar charts are used to represent categorical data with
rectangular bars. Each bar’s length or height corresponds
to the value it represents.
Code :
Import
matplotlib.pypl
ot as plt Import
pandas as pd
# Reading the
tips.csv file
Data =
pd.read_csv(‘
tips.csv’)
#
initia
lizing
the
data
X=
data[
‘day’
]
Y = data[‘total_bill’]
plotti
ng
the
data
Plt.b
ar(x,
y)
# Adding
title to
the plot
Plt.title
(“Tips
Dataset
”)
# Adding
label on the
y-axis
Plt.ylabel(‘
Total Bill’)
# Adding
label on the
x-axis
Plt.xlabel(‘
Day’)
Plt.show()
Output :
Seaborn Styles
Seaborn is a well-known Python library for data visualization
that offers a user- friendly interface for producing visually
appealing and informative statistical
graphics. It is designed to work with Pandas dataframes,
making it easy to visualize and explore data quickly and
effectively.
matplotlib.pypl
ot as plt
Sns.set_style(“d
arkgrid”)
Tips = sns.stripplot(‘tips’)
Background Color
sns.set_style(“dark”)
Sns.set_style(“ticks”)
Sns.stripplot(x=”day”, y=”total_bill”, data=tips)
Grids
It’s a good choice to use a grid when you want your audience to
be able to draw their own conclusions about data. A grid allows
the audience to read your chart and get
specific information about certain values. Research papers
and reports are a goodexample of when you would want to
include a grid.
Import Seaborn as sns
Import
matplotlib.pypl
ot as plt
Sns.set_style(“w
hitegrid”)
Tips = sns.stripplot(‘tips’)
Despine
In addition to changing the color background, you can also define the
usage of
spines. Spines are the borders of the figure that contain the
visualization. By default, an image has four spines.
You may want to remove some or all of the spines for various
reasons. A figure with the left and bottom spines resembles
traditional graphs. You can automatically take away the top
and right spines using the sns.despine()function. Note: this
function must be called after you have called your plot.
Import
matplotlib.pypl
ot as plt
Sns.set_style(“w
hite”)
Tips = sns.stripplot(‘tips')
Sns.despine()
Not including any spines at all may be an aesthetic decision.
You can also specify how many spines you want to include by
calling despine() and passing in the spines you want to get rid
of, such as: left, bottom, top, right.
Import Seaborn
as sns
Sns.set_style(“w
hitegrid”)
Sns.stripplot(x=”day”, y=”total_bill”,
data=tips) Sns.despine(left=True,
bottom=True)
Box plot
Box Plot is a graphical method to visualize data distribution
for gaining insights and making informed decisions. Box plot
is a type of chart that depicts a group of numerical data
through their quartiles.
First Quartile (Q1) – 25% of the data lies below the First (lower)
Quartile.
Median (Q2) – It is the mid-point of the dataset. Half of the
values lie below it and half above.
Third Quartile (Q3) – 75% of the data lies below the Third (Upper)
Quartile.
Maximum – It is the maximum value in the dataset excluding the
outliers.
The area inside the box (50% of the data) is known as the
Inter Quartile Range. The IQR is calculated as –
IQR = Q3-Q1
Outlies are the data points below and above the lower and
upper limit. The lower and upper limit is calculated as –
Lower Limit = Q1
– 1.5*IQR Upper
Limit = Q3 +
1.5*IQR
Smooth Curve:
A density plot represents the distribution as a smooth curve,
which is generated using kernel density estimation (KDE). The
curve is continuous and shows the
probability density of the data across the entire range of the
variable.
Comparison of Distributions:
Tree map
A tree map is a type of data visualization that displays
hierarchical data using nested rectangles. Each branch of the
hierarchy is represented as a rectangle,
which is then subdivided into smaller rectangles that
represent sub-branches. The size and color of the rectangles
can be used to represent different variables.
Advantages:
• Visualizes complex hierarchical data
• Displays multiple variables and relationships
Common uses:
• Business intelligence
• Financial analysis
• Marketing and sales
Best practices:
• Use clear and consistent colors
• Label rectangles and provide tooltips
Graph Networks.
Graph networks are a powerful tool in data visualization,
used to represent relationships between huge amount of
entities as nodes and edges.
Biological Networks
Purpose: To visualize and analyze relationships
Fraud Detection