CT127-3-2& Programming for Data Analysis
DATA VISUALIZATION
Module Code & Module Title Slide Title SLIDE 1
TOPIC LEARNING OUTCOMES
At the end of this topic, you should be able to:
•Understand how to use ggplot2 package for data visualization.
Module Code & Module Title Slide Title SLIDE 2
Contents & Structure
- Data Visualization
- ggplot2 package
- Line and path plots
- Bar chart
- Histogram
- Frequency Polygon
- Box plot
- Scatter plot
- Count plot
- Using colors and shapes in plots
- Axis and plot labels
- Facetting
Module Code & Module Title Slide Title SLIDE 3
1
Data Visualization using
ggplot2 package
Module Code & Module Title Slide Title SLIDE 4
Data Visualization
• Data visualization is the graphic representation of data.
• Graphics are used in statistics primarily for two reasons:
exploratory data analysis (EDA) and presenting results.
• The ggplot2 package is widely used to perform data
visualization in R.
• To install ggplot2: install.packages(“ggplot2”)
• To load ggplot2: library(ggplot2)
Module Code & Module Title Slide Title SLIDE 5
ggplot2 package
• Unlike most other graphics packages, ggplot2 has a deep
underlying grammar.
• This grammar is made up of a set of independent
components that can be composed in many different ways.
This makes ggplot2 very powerful because you are not
limited to a set of pre-specified graphics, but you can create
new graphics that are precisely tailored for your problem.
• In brief, the grammar tells us that a statistical graphic is a mapping
from data to aesthetic attributes (colour, shape, size) of geometric
objects (points, lines, bars).
Module Code & Module Title Slide Title SLIDE 6
ggplot2 package
• Every ggplot2 plot has three key components:
1. data,
2. A set of aesthetic mappings between variables in the data and
visual properties, and
3. At least one layer which describes how to render each
observation.
• The basic structure for ggplot2 starts with the ggplot function,
which takes the data as its first argument. After that, layers can
be added using the + symbol.
Module Code & Module Title Slide Title SLIDE 7
ggplot2 package
Here’s a simple example:
ggplot(mpg, aes(x = displ, y = hwy))
+ geom_point()
This produces a scatterplot defined by:
1. Data: mpg.
2. Aesthetic mapping: engine size (displ) mapped to x position, highway
miles per gallon (hwy) to y position.
3. Layer: points.
Module Code & Module Title Slide Title SLIDE 8
2
Importance of Data
Visualization
Module Code & Module Title Slide Title SLIDE 9
Anscombe's quartet
• Frank Anscombe constructed four datasets. Each
dataset consists of eleven (x,y) points. These data
sets have nearly identical simple descriptive
statistics.
• Anscombe's quartet demonstrates both the
importance of graphing data before analyzing it and
the effect of outliers and other influential
observations on statistical properties.
Frank Anscombe
https://en.wikipedia.org/wiki/Frank_Anscombe
Module Code & Module Title Slide Title SLIDE 10
Anscombe's quartet
https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Module Code & Module Title Slide Title SLIDE 11
Anscombe's quartet
1 Two variables correlated and
2
following the assumption of
normality. 1
2 It is not distributed normally +
There is non-linear relationship.
A perfect linear relationship, 3 4
3 except for one outlier which
lower the correlation
coefficient from 1 to 0.816.
4 Clearly shows that one outlier is enough to produce a high correlation coefficient,
even though the relationship between the two variables is not linear.
https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Module Code & Module Title Slide Title SLIDE 12
3
Line and Path plots
Module Code & Module Title Slide Title SLIDE 13
Line and Path plots
• Line and path plots are typically used for time series data.
• Line plots join the points from left to right, while path plots join
them in the order that they appear in the dataset.
• Line plots usually have time on the x-axis, showing how a single
variable has changed over time. Path plots show how two variables
have simultaneously changed over time, with time encoded in the
way that observations are connected.
Module Code & Module Title Slide Title SLIDE 14
Line and Path plots
• geom_line () function in the ggplot2 package is used to plot line
graphs.
• geom_line () function connects the observations in order of the
variable on the x axis.
ggplot(economics, aes(x=date, y=pop)) +
geom_line()
Module Code & Module Title Slide Title SLIDE 15
Line and Path plots
geom_path() connects observations in original order.
ggplot(economics, aes(unemploy / pop, uempmed, colour = date)) +
geom_path()
Module Code & Module Title Slide Title SLIDE 16
4
Plotting a Categorical
Variable using Bar chart
Module Code & Module Title Slide Title SLIDE 17
Bar chart
• A bar chart shows categorical variable’s data in bars with heights
proportional to that variable's values.
• ggplot () with geom_bar () functions in the ggplot2 package are used to
plot bar charts.
• geom_bar() makes the height of the bar proportional to the number of
cases in each group
ggplot(diamonds, aes(x = cut)) +
geom_bar()
Module Code & Module Title Slide Title SLIDE 18
5
Plotting a Continuous Variable
using Histogram/Frequency
Polygon/Boxplot
Module Code & Module Title Slide Title SLIDE 19
Histogram
• Histogram shows the distribution of values for a variable.
• Histograms break the data into buckets and the heights of
the bars represent the number of observations that fall into
each bucket.
Module Code & Module Title Slide Title SLIDE 20
Histogram
ggplot() with geom_histogram() in the ggplot2 package are used to plot
histograms.
ggplot(diamonds, aes(x=carat)) +
geom_histogram()
Module Code & Module Title Slide Title SLIDE 21
Frequency Polygon
• Like Histograms, frequency polygons show the distribution of a single
numeric variable.
• Histograms use bars and frequency polygons use lines.
ggplot(diamonds, aes(x=carat)) +
geom_freqpoly()
Module Code & Module Title Slide Title SLIDE 22
Boxplot
• Boxplot depicts groups of numerical data through their quartiles.
• The middle line in the boxplot represents the median and the
box is bounded by the first and third quartiles.
• The Interquartile Range (IQR) represents the middle 50% of
data.
https://towardsdatascience.com/understanding-boxplots-
Module Code & Module Title 5e2df7bcbd51
Slide Title SLIDE 23
Boxplot
• A series of hourly temperatures were measured throughout the
day in degrees Fahrenheit.
• The recorded values are listed in order as follows: 64, 64, 64,
65, 70, 73, 73, 74, 74, 75, 76, 77, 77, 77, 77, 79, 80, 82, 82,
83, 83, 85, 86, 88.
> summary(temp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
64.00 73.00 77.00 76.17 82.00 88.00
Module Code & Module Title Slide Title SLIDE 24
Boxplot
ggplot() with geom_boxplot() in the ggplot2 package are used to produce
boxplots.
ggplot(diamonds, aes(y=carat, x=1)) +
geom_boxplot()
Even though it is one-dimensional, using only a y aesthetic, there needs to be some
x aesthetic, so we will use 1.
Module Code & Module Title Slide Title SLIDE 25
Boxplot
https://r4ds.had.co.nz/exploratory-data-analysis.html#missing-values-2
Module Code & Module Title Slide Title SLIDE 26
6
Visualize the covariation
between two continuous
variables using Scatterplot
Module Code & Module Title Slide Title SLIDE 27
Scatterplot
• Scatterplot is a diagram that is used to visualize the
covariation between two continuous variables. Covariation is
a correlated variation of two or more variables.
• Every point represents an observation in two variables
where the x-axis represents one variable and the y-axis
another.
Module Code & Module Title Slide Title SLIDE 28
Scatterplot
• ggplot() with geom_point() in the ggplot2 package are used to create
scatterplots.
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point()
Module Code & Module Title Slide Title SLIDE 29
7
Visualize the covariation
between two categorical
variables using Count plot
Module Code & Module Title Slide Title SLIDE 30
Count plot
Two categorical variables: can be explored by counting the
number of observations for each combination.
ggplot(diamonds, aes(x = cut, y = color)) +
geom_count() +
labs(title="The co-variation between diamond's cut quality and
color", x="cut", y="color")
N represents how many
observations occurred
at each combination of
values.
Module Code & Module Title Slide Title SLIDE 31
8
Visualize the covariation
between a categorical and
continuous variables using
Box plot
Module Code & Module Title Slide Title SLIDE 32
Box plot
A categorical and continuous variables: can be explored using
boxplots.
ggplot(diamonds, aes(x = color, y = price)) +
geom_boxplot() +
labs(title="The co-variation between diamond's color and price",
x="color", y="price")
Module Code & Module Title Slide Title SLIDE 33
9
More about plots
Module Code & Module Title Slide Title SLIDE 34
Using colors in plots
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point()
Module Code & Module Title Slide Title SLIDE 35
Using colors in plots
ggplot(data=diamonds, aes(x=carat)) +
geom_histogram(col="white", fill="blue")
Module Code & Module Title Slide Title SLIDE 36
Using colors in plots
ggplot(diamonds, aes(carat, colour = cut)) +
geom_freqpoly()
Module Code & Module Title Slide Title SLIDE 37
Using shapes in plots
ggplot(diamonds, aes(x=carat, y=price, color=cut , shape=cut))
+
geom_point()
Module Code & Module Title Slide Title SLIDE 38
Axis and plot labels
ggplot(data=diamonds, aes(carat)) +
geom_histogram(col="white", fill="blue") +
labs(title="Histogram for carat", x=" Carat", y="Count")
Module Code & Module Title Slide Title SLIDE 39
Facetting
• Facetting creates tables of graphics by splitting the data into subsets and
displaying the same graph for each subset.
• To facet a plot you simply add a facetting specification with
facet_wrap(), which takes the name of a variable preceded by ˜.
ggplot(diamonds, aes(x=carat)) +
geom_histogram() +
facet_wrap(~cut)
Module Code & Module Title Slide Title SLIDE 40
Facetting
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
facet_wrap(~cut)
Module Code & Module Title Slide Title SLIDE 41
Quick Review Questions
• Describe the following plots and how use them in R:
– Line plot
– Path plot
– Bar chart
– Histogram plot
– Frequency polygon plot
– Box plot
– Scatter plot
– Count plot
• How to use colors and shapes in plots?
• How to give axis and plots labels in R?
• How to facet a plot in R?
Module Code & Module Title Slide Title SLIDE 42
Summary of Main Teaching Points
• Line and path plots
• Bar chart
• Histogram
• Frequency Polygons
• Box plot
• Scatter plot
• Count plot
• Using colors and shapes in plots
• Axis and plot labels
• Facetting
Module Code & Module Title Slide Title SLIDE 43
What To Expect Next Week
In Class Preparation for Class
• Data Manipulation • Various manipulation functions
Module Code & Module Title Slide Title SLIDE 44