Module 2c - Exploratory Data Analysis
Module 2c - Exploratory Data Analysis
UP Open University
Introduction
In this module, you will learn what is exploratory data analysis, descriptive
analytics and various statistical methods. You will gain an understanding on the
measure of central tendency, measure of dispersion or variation, frequency
distribution, and measures of position.
Objectives:
There are two known methods in performing exploratory data analysis, these
are the numerical methods and graphical methods.
EDA uses descriptive statistics to summarize the data such as mean, standard
deviation, variance, and other quantitative analysis such as regression
analysis, principal component analysis (PCA) and cross tabulation.
The graphical method is done to visualize the data distribution, detect the
presence of extremely high/low values or outliers, test the assumptions of the
data, identify important variables, and detect relationships between variables.
Some of the graphical methods include bar plots, boxplot, pie charts,
histograms, and scatterplot.
The population refers to the whole or entire data set of the subject being
studied. For example, the population of enrolled students in the Open
University refers to all enrolled students.
The sample data are the data taken from the population and used as
representative of the population data set. For example, the BS Education and
Diploma in Computer Science are sample of the UPOU students.
The measure of central tendency is to determine the value where the data set
converge. This value is the central location of the distribution. The three
common measures of central tendency are the mean, median and mode.
Process:
1. Compute the total of all elements.
Sum = (x1 + x2 + …. + xn) = ∑"!#$ 𝑥!
Where: x1 – refers to the first element/value in the data set
x2 - refers to the 2nd element/value in the data set
xn - refers to the last element/value in the data set
n - the total number of elements in the data set
2. Compute the mean value, mean = Sum/n.
• The mode is the most common element or the element that has the
most count in the data set.
Process:
1. re-arrange the elements in ascending order (although this is not
necessary but this will accelerate in finding the most frequently
occurring value)
2. count the occurrence of each value in the dataset
3. the value with the highest count is the mode of the data set.
4. The data set may have one, two or more mode values.
• The median is the middle value or element of the data set. This can
be obtained by sorting or arranging the elements or values either in
descending or ascending order. Finding the median value depends on
the total number of elements (n) in the set.
Example 1: Determine the central values (mean, median, mode) of the given
data set?
Age of students = {19, 18, 25, 35, 21, 43, 20, 30, 30, 30}
Number of elements = 10
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}
Mean:
a) Get the sum of all elements, Sum = (19 + 18 + 25 +… + 30)/10 =
271
b) Mean = sum/n = 271/10 = 27.10
Median:
a) Sort the elements in ascending or descending order.
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}
b) The total number of elements = 10
c) Since 10 is even, the two middle elements are in the 5th and 6th
positions in the list = {25, 30}
d) Mediam = (25+30)/2 = 27.5
Course code: COMP ED 20 Page | 3
Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University
Mode:
a) Sort the elements in ascending or descending order.
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}
b) Count the number of times that each value occurs in the list
Age 18 19 20 21 25 30 35 43
Count 1 1 1 1 1 3 1 1
Gender M F
Count 6 4
Mode is M.
Mean = 3.4
Median = 4
Mode = 4
Mean, median and mode can be used for quantitative data but only mode can
be used for qualitative data.
An outlier value is a data that lies far from the rest of the data set. It can be at
the lower end or at the upper part of the data set.
Assuming that we have following data sets 1 and 2 and their computed mean,
median and mode. The result shows that median and mode are not affected
by the outlier value but mean changes from 25.44 to 27.33. This is because
the mean includes all of the elements in the data set.
Data Set 1: 18, 19, 20, 21, 25, 30, 30, 30, 35, 36
Data Set 2: 18, 19, 20, 21, 25, 30, 30, 30, 35, 53 (53 is outlier)
To determine the mean value of a data set using MS Excel, click the cell
address where you want the mean value to appear and then, type
=average(SRC: ERC). SRC means the starting Row-Column Cell and ERC is
the End Row-Column Cell.
To determine the mean and mode, use the same approach above but use the
keyword =median(SRC: ERC) and =mode(SRC: ERC) to determine their
values, respectively. The figures below illustrate how to perform these
operations in MS Excel.
The measure of dispersion will provide information of how far or close are the
values with respect to the average or mean value. The common measures of
dispersion include the following:
a) Range. Range is the simplest method and easy to calculate. It uses only two
values or elements in the data set, which are the maximum and minimum
values.
Process:
a. Determine the largest value (max) and smallest value (min)
b. Compute: Range = max - min
b) Standard Deviation. Unlike range, which uses only two values in the data
set, standard deviation uses all the elements in the data set in getting its
value. It measures the absolute variability of each element in the data set with
respect to the mean value. Standard deviation is used to measure how many
percent of the data elements fall within the mean and one, two and three
standard deviations
Process:
1. Compute the mean value: Mean = ∑ (𝑥𝑖)/ n,
• where n = total number of elements
2. Compute: Sum of Square (SS) = ∑ (𝑥𝑖 − 𝑚𝑒𝑎𝑛)^2
• xi – is the individual element in the data set
3. Compute:
• Sample data: Standard Deviation (SD) = .𝑆𝑆/(𝑛 − 1)
• Population data: Standard Deviation (SD) = .𝑆𝑆/𝑛
Process:
1. Compute the standard deviation
2. Compute Variance = (Standard Deviation)2
To help you understand the above concept, let us consider the dataset below
with six variables or attributes (Obs, X, Y, Z, Gender, Opinion).
Using MS Excel and using X, Y and Z, we can determine their respective values
for the Mean, Range, Standard deviation, Variance and Coefficient of Variation
values. At the cells where these values will be displayed, define the following
formula:
With reference to the result of the computation, X has a range value of 20, mean
value of 40.97, standard deviation of 6.55, variance of 42.86, and coefficient of
variation 15.98.
Using the standard deviation value, it will tell us that 68% of the data falls
within the range from (40.97 – 6.55) to (40.97 + 6.55) or from 34.42 to 47.52.
The Coefficient of Variation of X means that the variation of the data is 15.98%
from its mean value of 40.97.
When Y is compared with X, it follows that Y’s dispersion measure values are
also higher because its elements are more dispersed than X because its value
ranges from 20 to 60. We can use range, standard deviation and variance to
compare the variabilities of the two variables because the data of X is a subset
of Y.
For data set that are mutually exclusive from each other like Z, its variability can
be compared to others using the Coefficient of Variation (CV).
As shown from the descriptive statistics, although Z has the smallest range,
standard deviation and variance values compared to X and Y however, it has
the most dispersed or diverse data based on its CV value.
1. Make a table and count the number of times (or frequency) that each
distinct element occurs in the list. Assuming that we will create a
frequency distribution for Gender, our frequency table will be:
Element Count
F 17
M 13
2. Using MS Excel, select the cell addresses of the data to graph
3. Click Insert and choose the appropriate Chart format. Your frequency
distribution may look as follows:
18
16
14
12
Frequency
10
8
6
4
2
0
F M
Element Count
1 7
2 6
3 8
4 5
5 4
9
8
7
6
Frequency
5
4
3
2
1
0
1 2 3 4 5
Option1: Adjust the class width to 5 to include 50 and retain the lower
value of class 1 at 30 or adjust both the class width to 5 and the lower
value to 28 as shown below.
Class Interval Range Class Interval Range
1 30 34 1 28 32
2 35 39 2 33 37
3 40 44 3 38 42
4 45 49 4 43 47
5 50 54 5 48 54
Option2: Maintain the class width at 4 and adjust the class size to 6. In
the second table below, the value of the lower limit in class 1 is adjusted
to 29.
Class Interval Range Class Interval Range
1 30 33 1 29 32
2 34 37 2 33 36
3 38 41 3 37 40
4 42 45 4 41 44
5 46 49 5 45 48
6 50 53 6 49 52
f. Then, find the frequency for each group using MS Excel countifs()
function. Below is the frequency distribution table of X.
The data at the frequency column can be generated using the following
command in MS Excel:
Interval
Class Range Frequency
1 30 34 =COUNTIFS(B$2:B$31,"<=34.5")
=COUNTIFS(B$2:B$31,">34.5",B$2:B$31,
2 35 39 "<=39.5" )
=COUNTIFS(B$2:B$31,">39.5",B$2:B$31,
3 40 44 "<=44.5" )
=COUNTIFS(B$2:B$31,">44.5",B$2:B$31,
4 45 49 "<=49.5" )
5 50 54 =COUNTIFS(B$2:B$31,">49.5")
Note: Cell B$2:B$31 is the cell address range of the data.
In this activity you will perform a descriptive analysis on a dataset involving five
variables with 737 records. You can use MS Excel to perform this activity. You
can download the file from: https://bit.ly/3sSZRRr
This data set is from the USDA's commissioned study of women’s nutrition in
1985. Nutrient intake was measured for a random sample of 737 women aged
25-50 years. The following variables were measured:
• Calcium(mg)
• Iron(mg)
• Protein(g)
• Vitamin A(μg)
• Vitamin C(mg)
Kurtosis measures the height or flatness of the curve. The following are the
summary on the kurtosis of the distribution:
With MS Excel, we can determine the skewness and kurtosis of the data without
necessarily graphing their distribution. These can be done by using the
SKEW.P(array) and KURT(array) functions, respectively. The KURT(array)
function is equal to the kurtosis value minus 3. Therefore, if:
KURT(array) = 0, the kurtosis is mesokurtic
KURT(array) < 0, the kurtosis is platykurtic
KURT(array) > 0, the kurtosis is leptokurtic
The measure of position is to determine the position of a specific value within the
data set. This will give us an idea as to where such value falls in the distribution
– whether it is close to the mean value or if it is at the extreme lower end or
higher end of the data set.
Box and Whiskers Plot or Box plot. This provides a visualization on the
spread and centers of a data set. The five numbers are the minimum, 1st
quartile, median, the 3rd quartile and the maximum value, and the mean value of
the data set.
In MS Excel, this can be done by selecting the array cells of the data set, then
click Insert from the main menu, select the Histogram Icon and select the Box
and Whisker button.
With reference to X and Y in the previous data set, their Box and Whisker plots
are shown below:
70.00
60.00 60.00
52.25
50.00 50.00
46.00
40.00 40.97 42.00 39.37 37.50
34.75
30.00 30.00 28.75
20.00 20.00
10.00
0.00
1
The box at the center represents the middle portion of the data set. This box
gives us an idea which data set is dispersed.
The graphical presentations, and the values of the mean and median could
give us an idea on the skewness of the distribution of the data sets. For
example, the median value (42) of X is higher than the mean value (40.97),
then the distribution is expected to be skewed to the left.
With reference to the second boxplot in the graph, it shows that the minimum
and maximum values are 20 and 60, respectively. The second value (28.75) is
the threshold of the first quartile, while the middle line with a value of 37.5
represents the median. The value at the center (39.37) is the average value,
while 52.25 is the threshold value of the 3rd quartile. The difference between
52.25 and 37.5 is inter quartile.
Outlier values are values that appears at the extremely lower end or extremely
upper end of the data set. The occurrence of these outlier values can be due
to clerical errors or incorrect data entry, and other factors.
Assuming that an element in the variable Y is changed to 15 and 90, the box
and whisker plot shows the outlier values as follows:
90.00 90.00
80.00
70.00
60.00 60.00
0.00
1
The inclusion of outliers in the data set affects the mean value. This outlier
value can be corrected by validating from the raw data or this value can be
substituted with new value or be dropped from the data set.
Using the same data set on USDA's study of women’s nutrition in 1985.
a. Compute the skewness and kurtosis of the variable that you selected for
frequency distribution in the previous activity.
b. Based on the value that you got, what is your interpretation on the
skewness and kurtosis? Is it consistent with your interpretation done via
visual inspection?
c. Compute the skewness and kurtosis of the other variables and provide their
answers.
d. Create a box and whisker plots of these five variables and determine which
one is the least dispersed variable by using all the statistical methods that
you applied to this variable.
Reading Activity:
1. Dugar, D. (2018). Skew and Kurtosis: 2 Important Statistics terms you need
to know in Data Science. URL: https://codeburst.io/2-important-statistics-
terms-you-need-to-know-in-data-science-skewness-and-kurtosis-
388fef94eeaa
2. NIST/SEMATECH e-Handbook of Statistical Methods, Accessed from:
https://doi.org/10.18434/M32189
3. Kallner, A. (2018). Formulas. Accessed from:
https://www.sciencedirect.com/topics/neuroscience/kurtosis#:~:text=A%20sta
ndard%20normal%20distribution%20has,recognized%20as%20leptokurtic%2
0and%20%3C3.
4. Statistics Canada. Constructing box and whisker plots.
https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-
eng.htm#:~:text=A%20box%20and%20whisker%20plot%20is%20a%20way%
20of%20summarizing,central%20value%2C%20and%20its%20variability.
5. Gomes, G. 2021. Descriptive Statistics: Expectations vs. Reality (Exploratory
Data Analysis) https://towardsdatascience.com/descriptive-statistics-
expectations-vs-reality-exploratory-data-analysis-eda-8336b1d0c60b.
Accessed: 21 January 2021.
References:
1. https://www.statisticshowto.com/measures-of-position/#:~:text =Measures%20of%20
position%20give%20us,falls%20on%20some%20numerical%20scale.
2. Gordon, S. (2006). The Normal Distribution. Accessed from:
https://www.sydney.edu.au/content/dam/students/documents/mathematics-
learning-centre/normal-distribution.pdf
3. Normal Distributions, Standard Deviations, Modality, Skewness and Kurtosis:
Understanding concepts. https://www.youtube.com/watch?v=HnMGKsupF8Q.
4. Chen, J. (2021). Skewness.
https://www.investopedia.com/terms/s/skewness.asp
5. Meyer, P. 2015. Exploratory Data Analysis.
https://www.youtube.com/watch?v=zHcQPKP6NpM
6. Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards Preview.
https://www.brainscape.com/flashcards/lecture-2-descriptive-statistics-amp-expl-
6422027/packs/10091201