0% found this document useful (0 votes)

14 views

Module 2c - Exploratory Data Analysis

Uploaded by

upneo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Module 2c - Exploratory Data Analysis

Uploaded by

upneo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

COMP ED 20 – INTRODUCTION TO ANALYTICS

UP Open University

Module 2d – Exploratory Data Analysis

Introduction

In this module, you will learn what is exploratory data analysis, descriptive
analytics and various statistical methods. You will gain an understanding on the
measure of central tendency, measure of dispersion or variation, frequency
distribution, and measures of position.

Objectives:

At the end of this module, you should be able to:

• Understand and explain what is exploratory data analytics, what it is for,
and what are its methods?
• Apply the different descriptive statistics methods, and
• Perform exploratory data analysis using MS Excel software.

Key Concepts and Activities

1.4 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach in data analytics to determine

characteristics by summarizing and describing the dataset.

There are two known methods in performing exploratory data analysis, these
are the numerical methods and graphical methods.

EDA uses descriptive statistics to summarize the data such as mean, standard
deviation, variance, and other quantitative analysis such as regression
analysis, principal component analysis (PCA) and cross tabulation.

The graphical method is done to visualize the data distribution, detect the
presence of extremely high/low values or outliers, test the assumptions of the
data, identify important variables, and detect relationships between variables.
Some of the graphical methods include bar plots, boxplot, pie charts,
histograms, and scatterplot.

Exploratory data analysis is needed to understand the dataset under

investigation and increase our confidence on the correctness of further analysis
that will be done on the dataset.

1.5. What is Descriptive Statistics?

Statistics is a science of collecting, organizing, analyzing, and interpreting large

data. It can be divided into two general types namely: descriptive statistics and
inferential statistics.
Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Descriptive statistics or univariate analysis is used to provide an overall view of

the dataset, determine at which point the data converge (e.g. mean, median,
mode), and find if there are data anomalies and outliers. Descriptive analysis is
necessary before conducting further advance analysis.

Inferential statistics is used to infer or provide conclusion about the population

based from the analysis done on the sample data. The samples are selected
from the population and the manner that they are selected from the population
can be done via simple random sampling, purposive sampling, stratified
sampling, clustered sampling and other sampling techniques.

The population refers to the whole or entire data set of the subject being
studied. For example, the population of enrolled students in the Open
University refers to all enrolled students.

The sample data are the data taken from the population and used as
representative of the population data set. For example, the BS Education and
Diploma in Computer Science are sample of the UPOU students.

Descriptive analytics uses descriptive statistics to summarize and understand

the characteristics of the data set and answer the question “what happened?”.

Descriptive statistics can be grouped into the following:

1. Measures of Central Tendency

2. Measures of Dispersion or Variation
3. Measures of Frequency
4. Measures of Skewness and Kurtosis
5. Measures of Position

1.5.1. Measures of Central Tendency

The measure of central tendency is to determine the value where the data set
converge. This value is the central location of the distribution. The three
common measures of central tendency are the mean, median and mode.

• The mean is the average value of a data set. It can be obtained by

getting the sum of the elements in the data set divided by the total
number of observation or elements in the data set.

Process:
1. Compute the total of all elements.
Sum = (x1 + x2 + …. + xn) = ∑"!#$ 𝑥!
Where: x1 – refers to the first element/value in the data set
x2 - refers to the 2nd element/value in the data set
xn - refers to the last element/value in the data set
n - the total number of elements in the data set
2. Compute the mean value, mean = Sum/n.

Course code: COMP ED 20 Page | 2

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

• The mode is the most common element or the element that has the
most count in the data set.

Process:
1. re-arrange the elements in ascending order (although this is not
necessary but this will accelerate in finding the most frequently
occurring value)
2. count the occurrence of each value in the dataset
3. the value with the highest count is the mode of the data set.
4. The data set may have one, two or more mode values.

• The median is the middle value or element of the data set. This can
be obtained by sorting or arranging the elements or values either in
descending or ascending order. Finding the median value depends on
the total number of elements (n) in the set.

Process (if n is odd):

1. Sort the elements of the data set in ascending or descending order.

2. Determine the index of the middle element, index = (n+1)/2
3. The median is the element/value at position/location index;

Process (if n is even):

1. Sort the elements of the data set in ascending or descending order.

2. Determine the indeces of the two middle values.
a. Index 1 = n/2
b. Index 2 = n/2+1
3. The median = (Data[index 1] + Data[index 2]) /2

Example 1: Determine the central values (mean, median, mode) of the given
data set?

Age of students = {19, 18, 25, 35, 21, 43, 20, 30, 30, 30}
Number of elements = 10
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}

Mean:
a) Get the sum of all elements, Sum = (19 + 18 + 25 +… + 30)/10 =
271
b) Mean = sum/n = 271/10 = 27.10

Median:
a) Sort the elements in ascending or descending order.
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}
b) The total number of elements = 10
c) Since 10 is even, the two middle elements are in the 5th and 6th
positions in the list = {25, 30}
d) Mediam = (25+30)/2 = 27.5
Course code: COMP ED 20 Page | 3
Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Mode:
a) Sort the elements in ascending or descending order.
Age in sorted order = {18, 19, 20, 21, 25, 30, 30, 30, 35, 43}
b) Count the number of times that each value occurs in the list

Age 18 19 20 21 25 30 35 43
Count 1 1 1 1 1 3 1 1

c) Mode = 30 (it occurs thrice)

Example 2: In this example, we will be dealing with nominal data such as

gender which has the following values: M - Male, F - Female

Gender of students = {M, M, M, F, M, F, F, M, M, F} .

Mean: Not applicable. Cannot be used for qualitative data.

Median: Not applicable. Cannot be used for qualitative data.
Mode: Count the frequency of each value

Gender M F
Count 6 4

Mode is M.

Example 3: Assuming that we have the following values to represent the

opinion of customers on certain product (1 - very poor, 2 - poor, 3 - average, 4
- above average, 5 - Outstanding):

Mean = 3.4
Median = 4
Mode = 4

When to use Mean, Median and Mode?

Mean, median and mode can be used for quantitative data but only mode can
be used for qualitative data.

Data Type Mean Median Mode

Qualitative - Nominal NA NA Yes
Qualitative – Ordinal Yes Yes Yes
Quantitative - Discrete Yes Yes Yes
Quantitative - Continuous Yes Yes Yes

Course code: COMP ED 20 Page | 4

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Strengths and Weaknesses

Data Type Strengths Weaknesses

Mean Quantitative Include all data in the Sensitive to outlier values.
computation Cannot be used for
qualitative - nominal data
(e.g. color)
Median Quantitative Best for data that are Does not include all data
skewed or when outlier in the computation
values are present. Its
value is not affected by
the presence of outlier
values
Mode Quantitative & Best for qualitative If applied to quantitative, it
Qualitative data. When used for does not include all data in
quantitative data, its the computation.
value is not affected by
outlier values

An outlier value is a data that lies far from the rest of the data set. It can be at
the lower end or at the upper part of the data set.

Assuming that we have following data sets 1 and 2 and their computed mean,
median and mode. The result shows that median and mode are not affected
by the outlier value but mean changes from 25.44 to 27.33. This is because
the mean includes all of the elements in the data set.

Data Set 1: 18, 19, 20, 21, 25, 30, 30, 30, 35, 36
Data Set 2: 18, 19, 20, 21, 25, 30, 30, 30, 35, 53 (53 is outlier)

Mean Median Mode

Data set 1 25.44 25.00 30.00
Data set 2 27.33 25.00 30.00

Getting the Mean, Median and Mode values in MS Excel

To determine the mean value of a data set using MS Excel, click the cell
address where you want the mean value to appear and then, type
=average(SRC: ERC). SRC means the starting Row-Column Cell and ERC is
the End Row-Column Cell.

To determine the mean and mode, use the same approach above but use the
keyword =median(SRC: ERC) and =mode(SRC: ERC) to determine their
values, respectively. The figures below illustrate how to perform these
operations in MS Excel.

Course code: COMP ED 20 Page | 5

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

1.5.3. The Measures of Dispersion

The measure of dispersion will provide information of how far or close are the
values with respect to the average or mean value. The common measures of
dispersion include the following:

a) Range. Range is the simplest method and easy to calculate. It uses only two
values or elements in the data set, which are the maximum and minimum
values.

Process:
a. Determine the largest value (max) and smallest value (min)
b. Compute: Range = max - min

b) Standard Deviation. Unlike range, which uses only two values in the data
set, standard deviation uses all the elements in the data set in getting its
value. It measures the absolute variability of each element in the data set with
respect to the mean value. Standard deviation is used to measure how many
percent of the data elements fall within the mean and one, two and three
standard deviations

Image Source: Free image from pixabay

Course code: COMP ED 20 Page | 6

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

In a normally distributed data, approximately, 68% of the elements in the data

set is within mean ± one standard deviation (shaded portion), 95% of the data
is within the mean ± 2 standard deviation, and 99.7% is within the mean ± 3
standard deviation.

Process:
1. Compute the mean value: Mean = ∑ (𝑥𝑖)/ n,
• where n = total number of elements
2. Compute: Sum of Square (SS) = ∑ (𝑥𝑖 − 𝑚𝑒𝑎𝑛)^2
• xi – is the individual element in the data set
3. Compute:
• Sample data: Standard Deviation (SD) = .𝑆𝑆/(𝑛 − 1)
• Population data: Standard Deviation (SD) = .𝑆𝑆/𝑛

c) Variance. It is the square of the standard deviation. It measures the

variability of the data and it provides a single value that tells us the average of
the square of the difference between the individual element and the mean
value.

Process:
1. Compute the standard deviation
2. Compute Variance = (Standard Deviation)2

d) Coefficient of Variation (CV). It is a measure of variability and this is used

when we compare the variability of two or more data sets. It is the ratio of the
standard deviation and the mean times 100%. If the value of the CV is high,
this means that the data has more variation with respect to the mean.

Process: Compute CV = SD/mean x 100%

To help you understand the above concept, let us consider the dataset below
with six variables or attributes (Obs, X, Y, Z, Gender, Opinion).

Variable Obs or Observation is just a sequential number that represents the

record number while X, Y and Z were randomly generated with values that
ranges from 30 to 50, 20 to 60, and 0 to 10, respectively.

Using MS Excel and using X, Y and Z, we can determine their respective values
for the Mean, Range, Standard deviation, Variance and Coefficient of Variation
values. At the cells where these values will be displayed, define the following
formula:

• Mean value, use =average(SC:EC)

• Range value, use =max(SC:EC)-min(SC:EC)
• Standard Deviation, use =stdev(SC:EC)
• Variance, use =var(SC:EC)

Where: SC – Starting cell, EC – Ending Cell

Course code: COMP ED 20 Page | 7

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Obs X Y Z Gender Opinion

1 33 32 6.386 M 1
2 46 28 2.364 M 1
3 37 54 8.119 F 1
4 39 41 5.122 F 2
5 45 60 7.68 F 3
6 49 30 2.622 F 4
7 48 51 4.036 M 5
8 32 24 0.999 F 3
9 40 44 3.193 M 2
10 35 23 5.909 F 5
11 42 34 4.568 M 2
12 50 30 6.237 F 4
13 46 31 7.216 F 5
14 44 29 1.947 F 2
15 30 52 8.267 M 1
16 42 41 8.614 M 3
17 34 46 1.662 M 1
18 42 45 6.44 F 3
19 36 53 2.941 M 4
20 44 20 3.755 F 5
21 30 46 7.294 F 3
22 50 28 9.914 F 4
23 31 29 6.561 M 2
24 39 58 0.748 F 1
25 44 30 7.471 M 2
26 46 22 6.368 F 3
27 49 56 8.502 M 1
28 49 57 5.203 F 3
29 45 27 4.406 M 4
30 32 60 4.071 F 3

This sample data set can be downloaded from https://bit.ly/3wpnmnl.

With reference to the result of the computation, X has a range value of 20, mean
value of 40.97, standard deviation of 6.55, variance of 42.86, and coefficient of
variation 15.98.

Using the standard deviation value, it will tell us that 68% of the data falls
within the range from (40.97 – 6.55) to (40.97 + 6.55) or from 34.42 to 47.52.

The Coefficient of Variation of X means that the variation of the data is 15.98%
from its mean value of 40.97.

Course code: COMP ED 20 Page | 8

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

When Y is compared with X, it follows that Y’s dispersion measure values are
also higher because its elements are more dispersed than X because its value
ranges from 20 to 60. We can use range, standard deviation and variance to
compare the variabilities of the two variables because the data of X is a subset
of Y.

For data set that are mutually exclusive from each other like Z, its variability can
be compared to others using the Coefficient of Variation (CV).

As shown from the descriptive statistics, although Z has the smallest range,
standard deviation and variance values compared to X and Y however, it has
the most dispersed or diverse data based on its CV value.

1.5.3. Measures of Frequency

A frequency distribution is a graphical presentation showing the number of

times that each element occurs in the data set. Using the same dataset, let
us consider variables Gender and Opinion. Opinion is an ordinal data coded
with numeric values from 1 to 5.

Creating a Simple Frequency Distribution of Qualitative Data

1. Make a table and count the number of times (or frequency) that each
distinct element occurs in the list. Assuming that we will create a
frequency distribution for Gender, our frequency table will be:
Element Count
F 17
M 13
2. Using MS Excel, select the cell addresses of the data to graph
3. Click Insert and choose the appropriate Chart format. Your frequency
distribution may look as follows:
18
16
14
12
Frequency

10
8
6
4
2
0
F M

Course code: COMP ED 20 Page | 9

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Creating a Simple Frequency Distribution (Discrete Data or Ordinal

data coded as number)

a. Assuming that variable Opinion is a discrete data or ordinal data

coded as numbers.
b. Following the same process above, the resulting frequency table will
look as follows:

Element Count
1 7
2 6
3 8
4 5
5 4

c. Using MS Excel, select the cell addresses of the data to graph,

d. Click Insert and choose the appropriate Chart format. Your frequency
distribution may look as follows:

9
8
7
6
Frequency

5
4
3
2
1
0
1 2 3 4 5

Creating Grouped Frequency Distribution

a. For data with many values such as X, Y, and Z, create a group

frequency distribution of these data.
b. Determine the range of the data set: Maximum value – minimum value.

Course code: COMP ED 20 Page | 10

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

c. Determine the number of classes or groups (5 to 20) of your data. You

may choose your desired class size or use the 2k ≧ n rule, which can be
computed manually or mathematically. In this equation,
k = number of classes and
n = total number of elements.

Manual method of finding k by using trial and error.

1. Assign k with some values and compute the value of 2k
2. Compare the result with the value of n
3. If 2k ≧ n then, use the value of k for the number of classes.
k 2k Is 2k ≧ n?
3 8 No
4 16 No
5 32 Yes

Mathematical method of determining the value of k:

2k = n
k log (2) = log (n)
k = log (n)/log (2)
k = log (30)/ log (2) = 1.477/0.301 = 4.99

Since we can’t have a class size of 4.99, we have to round this up to

the next higher integer, which is 5.
d. Determine the class width = range/k. Using the above data set, the class
width of X=4, Y=8 and Z =1.83 (round ed to 2.
e. Start the first class with a value that is less than or equal to the minimum
value in the data set. For X, the lowest class can start at 30, then the
lowest value of the second class is 30+class width = 34. For X, if the
class width is 4, the highest value (50) in the data will not be counted in
the frequency table.
Class Interval Range
1 30 33
2 34 37
3 38 41
4 42 45
5 46 49
We can address this problem with the following options:

Course code: COMP ED 20 Page | 11

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Option1: Adjust the class width to 5 to include 50 and retain the lower
value of class 1 at 30 or adjust both the class width to 5 and the lower
value to 28 as shown below.
Class Interval Range Class Interval Range
1 30 34 1 28 32
2 35 39 2 33 37
3 40 44 3 38 42
4 45 49 4 43 47
5 50 54 5 48 54

Option2: Maintain the class width at 4 and adjust the class size to 6. In
the second table below, the value of the lower limit in class 1 is adjusted
to 29.
Class Interval Range Class Interval Range
1 30 33 1 29 32
2 34 37 2 33 36
3 38 41 3 37 40
4 42 45 4 41 44
5 46 49 5 45 48
6 50 53 6 49 52

f. Then, find the frequency for each group using MS Excel countifs()
function. Below is the frequency distribution table of X.

Class Interval Range Frequency

1 30 34 7
2 35 39 5
3 40 44 7
4 45 49 9
5 50 54 2

The data at the frequency column can be generated using the following
command in MS Excel:

Interval
Class Range Frequency
1 30 34 =COUNTIFS(B$2:B$31,"<=34.5")
=COUNTIFS(B$2:B$31,">34.5",B$2:B$31,
2 35 39 "<=39.5" )

Course code: COMP ED 20 Page | 12

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

=COUNTIFS(B$2:B$31,">39.5",B$2:B$31,
3 40 44 "<=44.5" )
=COUNTIFS(B$2:B$31,">44.5",B$2:B$31,
4 45 49 "<=49.5" )
5 50 54 =COUNTIFS(B$2:B$31,">49.5")
Note: Cell B$2:B$31 is the cell address range of the data.

Activity: Descriptive Statistics

In this activity you will perform a descriptive analysis on a dataset involving five
variables with 737 records. You can use MS Excel to perform this activity. You
can download the file from: https://bit.ly/3sSZRRr

This data set is from the USDA's commissioned study of women’s nutrition in
1985. Nutrient intake was measured for a random sample of 737 women aged
25-50 years. The following variables were measured:

• Calcium(mg)
• Iron(mg)
• Protein(g)
• Vitamin A(μg)
• Vitamin C(mg)

1. Using MS Excel, determine the characteristics of each of the variables by

determining the range, mean, standard deviation, variance and coefficient of
variations.

Variable Median Mean Standard Variance CV

Deviation
Calcium
Iron
Protein
Vitamin A
Vitamin B

2. Based on the result of your descriptive analysis, which attribute is more

dispersed? Support your answer.
3. Select one variable, create a group frequency distribution and graph/chart of
this variable.
4. After creating the chart, evaluate by visual inspection what do you think is the
skewness of your graph or chart – is it normal, positively skewed or negatively
skewed?
5. Again, by visual inspection, what do you think is the kurtosis of your graph, is
it mesokurtic, leptokurtic or platykurtic?

Course code: COMP ED 20 Page | 13

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

1.5.4 Measures of Skewness and Kurtosis

Skewness can be defined as the measure of symmetry of the data’s distribution

with respect to the mean. With skewness, we can determine if the data is
normally distributed, positively skewed or negatively skewed.

The skewness of a distribution can be summarized by the following

characteristics:

Skewness Shape of the Value Value of Mean

Distribution
Normal Symmetric with respect to Skew = (-0.5, 0.5) mean = mode =
the mean median
Positive Skew Shape has longer tail at Skew ≥ 0.50 mode < median
the right side of the mean < mean
Negative Skew Shape has longer tail at Skew ≤ -0.50 mode > median
the left side of the mean > mean

Source: Dugar, D. (2018).

Kurtosis measures the height or flatness of the curve. The following are the
summary on the kurtosis of the distribution:

Shape of the Distribution Kurtosis Value of the MS Excel

kurtosis (k) function
Normal or Medium peak. Mesokurtic k=3 KURT(array) = 0
Distribution is symmetric
Higher/Taller Peak. Data Leptokurtic. k>3 KURT(array) > 0
is highly distributed at the Lepto=thin
center. Data has few
outliers.
Lower Peak. Data Platykurtic. k<3 KURT(array) < 0
distribution is slightly higher Platy =
at the center and even at broad
the rest of the distribution.

Course code: COMP ED 20 Page | 14

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

With MS Excel, we can determine the skewness and kurtosis of the data without
necessarily graphing their distribution. These can be done by using the
SKEW.P(array) and KURT(array) functions, respectively. The KURT(array)
function is equal to the kurtosis value minus 3. Therefore, if:
KURT(array) = 0, the kurtosis is mesokurtic
KURT(array) < 0, the kurtosis is platykurtic
KURT(array) > 0, the kurtosis is leptokurtic

1.5.5 Measures of Position

The measure of position is to determine the position of a specific value within the
data set. This will give us an idea as to where such value falls in the distribution
– whether it is close to the mean value or if it is at the extreme lower end or
higher end of the data set.

Box and Whiskers Plot or Box plot. This provides a visualization on the
spread and centers of a data set. The five numbers are the minimum, 1st
quartile, median, the 3rd quartile and the maximum value, and the mean value of
the data set.

In MS Excel, this can be done by selecting the array cells of the data set, then
click Insert from the main menu, select the Histogram Icon and select the Box
and Whisker button.

With reference to X and Y in the previous data set, their Box and Whisker plots
are shown below:

70.00

60.00 60.00

52.25
50.00 50.00
46.00
40.00 40.97 42.00 39.37 37.50
34.75
30.00 30.00 28.75

20.00 20.00

10.00

0.00
1

Course code: COMP ED 20 Page | 15

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

The box at the center represents the middle portion of the data set. This box
gives us an idea which data set is dispersed.

The graphical presentations, and the values of the mean and median could
give us an idea on the skewness of the distribution of the data sets. For
example, the median value (42) of X is higher than the mean value (40.97),
then the distribution is expected to be skewed to the left.

With reference to the second boxplot in the graph, it shows that the minimum
and maximum values are 20 and 60, respectively. The second value (28.75) is
the threshold of the first quartile, while the middle line with a value of 37.5
represents the median. The value at the center (39.37) is the average value,
while 52.25 is the threshold value of the 3rd quartile. The difference between
52.25 and 37.5 is inter quartile.

Outliers and its Effects.

Outlier values are values that appears at the extremely lower end or extremely
upper end of the data set. The occurrence of these outlier values can be due
to clerical errors or incorrect data entry, and other factors.

Assuming that an element in the variable Y is changed to 15 and 90, the box
and whisker plot shows the outlier values as follows:

Box Plot of X and Y

100.00

90.00 90.00

80.00

70.00

60.00 60.00

50.00 50.00 52.25

46.00
40.00 40.40 42.00 40.37
37.50
34.75
30.00 30.00 28.75
20.00 20.00
15.00
10.00

0.00
1

The inclusion of outliers in the data set affects the mean value. This outlier
value can be corrected by validating from the raw data or this value can be
substituted with new value or be dropped from the data set.

Course code: COMP ED 20 Page | 16

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

Activity: Skewness, Kurtosis and Box and Whisker Plot

Using the same data set on USDA's study of women’s nutrition in 1985.

a. Compute the skewness and kurtosis of the variable that you selected for
frequency distribution in the previous activity.
b. Based on the value that you got, what is your interpretation on the
skewness and kurtosis? Is it consistent with your interpretation done via
visual inspection?
c. Compute the skewness and kurtosis of the other variables and provide their
answers.

Variable Skewness Skewness Kurtosis Kurtosis

value description* value description**
Calcium
Iron
Protein
Vitamin A
Vitamin B
Note: * - normal, positively skewed, negatively skewed
** - Mesokurtic, Leptokurtic, Platykurtic

d. Create a box and whisker plots of these five variables and determine which
one is the least dispersed variable by using all the statistical methods that
you applied to this variable.

Reading Activity:

1. Dugar, D. (2018). Skew and Kurtosis: 2 Important Statistics terms you need
to know in Data Science. URL: https://codeburst.io/2-important-statistics-
terms-you-need-to-know-in-data-science-skewness-and-kurtosis-
388fef94eeaa
2. NIST/SEMATECH e-Handbook of Statistical Methods, Accessed from:
https://doi.org/10.18434/M32189
3. Kallner, A. (2018). Formulas. Accessed from:
https://www.sciencedirect.com/topics/neuroscience/kurtosis#:~:text=A%20sta
ndard%20normal%20distribution%20has,recognized%20as%20leptokurtic%2
0and%20%3C3.
4. Statistics Canada. Constructing box and whisker plots.
https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-
eng.htm#:~:text=A%20box%20and%20whisker%20plot%20is%20a%20way%
20of%20summarizing,central%20value%2C%20and%20its%20variability.
5. Gomes, G. 2021. Descriptive Statistics: Expectations vs. Reality (Exploratory
Data Analysis) https://towardsdatascience.com/descriptive-statistics-
expectations-vs-reality-exploratory-data-analysis-eda-8336b1d0c60b.
Accessed: 21 January 2021.

Course code: COMP ED 20 Page | 17

Calag, VB (2021). Introduction to Analytics. Los Baños: University of the Philippines Open University

6. National Institutes of Standards and Technology. What is Exploratory Data Analysis.

https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm

References:

1. https://www.statisticshowto.com/measures-of-position/#:~:text =Measures%20of%20
position%20give%20us,falls%20on%20some%20numerical%20scale.
2. Gordon, S. (2006). The Normal Distribution. Accessed from:
https://www.sydney.edu.au/content/dam/students/documents/mathematics-
learning-centre/normal-distribution.pdf
3. Normal Distributions, Standard Deviations, Modality, Skewness and Kurtosis:
Understanding concepts. https://www.youtube.com/watch?v=HnMGKsupF8Q.
4. Chen, J. (2021). Skewness.
https://www.investopedia.com/terms/s/skewness.asp
5. Meyer, P. 2015. Exploratory Data Analysis.
https://www.youtube.com/watch?v=zHcQPKP6NpM
6. Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards Preview.
https://www.brainscape.com/flashcards/lecture-2-descriptive-statistics-amp-expl-
6422027/packs/10091201

Course code: COMP ED 20 Page | 18

Mark Juergensmeyer - Global Religions - An Introduction (2003)
No ratings yet
Mark Juergensmeyer - Global Religions - An Introduction (2003)
168 pages
Descriptive Analysis
No ratings yet
Descriptive Analysis
20 pages
Almendralejo Statistics
No ratings yet
Almendralejo Statistics
19 pages
Advance Statistics for Data Science and Data Analysis (2)
No ratings yet
Advance Statistics for Data Science and Data Analysis (2)
47 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
PUPSPC BUMA30063 - Chapter 2 Instructional Material
No ratings yet
PUPSPC BUMA30063 - Chapter 2 Instructional Material
10 pages
2 - Introduction To Statistics
No ratings yet
2 - Introduction To Statistics
97 pages
LESSON-5-PLANNING-DATA-ANALYSES
No ratings yet
LESSON-5-PLANNING-DATA-ANALYSES
19 pages
Stats Assingment
No ratings yet
Stats Assingment
12 pages
Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
Gec3 - Module 5
No ratings yet
Gec3 - Module 5
18 pages
MMW Data Management
No ratings yet
MMW Data Management
2 pages
Statistics
No ratings yet
Statistics
21 pages
Business Statistics - Session Descriptive Statistics
No ratings yet
Business Statistics - Session Descriptive Statistics
28 pages
MS102
No ratings yet
MS102
9 pages
4.12 Measure of Central Tendency: The Mean
No ratings yet
4.12 Measure of Central Tendency: The Mean
4 pages
Statistics Intro 1
No ratings yet
Statistics Intro 1
41 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
MMW Nursing
No ratings yet
MMW Nursing
23 pages
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
No ratings yet
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
44 pages
3 - Descriptive Stat
No ratings yet
3 - Descriptive Stat
70 pages
C4 Descriptive Statistics
No ratings yet
C4 Descriptive Statistics
34 pages
Statistics
100% (4)
Statistics
124 pages
chapter2-statistical analysis
No ratings yet
chapter2-statistical analysis
86 pages
Exp-3
No ratings yet
Exp-3
16 pages
SSM & Da All Unit Notes
No ratings yet
SSM & Da All Unit Notes
152 pages
3 Descriptive Statistics PDF
No ratings yet
3 Descriptive Statistics PDF
58 pages
Assignment
No ratings yet
Assignment
30 pages
Assignment
No ratings yet
Assignment
23 pages
Statistics
No ratings yet
Statistics
13 pages
Research Report
No ratings yet
Research Report
47 pages
Lesson 5 (Descriptive Statistics Part 1)_Oct 2024
No ratings yet
Lesson 5 (Descriptive Statistics Part 1)_Oct 2024
72 pages
Chap 4 Part1 Intro Measures of Central Tendency of Ungrouped Data 1
No ratings yet
Chap 4 Part1 Intro Measures of Central Tendency of Ungrouped Data 1
74 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Statistics
No ratings yet
Statistics
68 pages
Statistical Analysis_ Descriptive Stat (2)
No ratings yet
Statistical Analysis_ Descriptive Stat (2)
6 pages
Data Management
No ratings yet
Data Management
48 pages
Basic Statistics (3685) PPT - Lecture On 20-01-2019
100% (1)
Basic Statistics (3685) PPT - Lecture On 20-01-2019
64 pages
Central Tendency Steven Lee 1
No ratings yet
Central Tendency Steven Lee 1
15 pages
Learning Activity 1 Jigsaw
No ratings yet
Learning Activity 1 Jigsaw
15 pages
Data science-Unit-3-Complete
No ratings yet
Data science-Unit-3-Complete
33 pages
Topic 2- Descriptive_statistics
No ratings yet
Topic 2- Descriptive_statistics
36 pages
20230630-Statistical Skills
No ratings yet
20230630-Statistical Skills
12 pages
Measures of Central Tendency and Dispersion
No ratings yet
Measures of Central Tendency and Dispersion
9 pages
Ssmda End Sem
No ratings yet
Ssmda End Sem
152 pages
LU 3 Descriptive Statistics in SPSS
No ratings yet
LU 3 Descriptive Statistics in SPSS
60 pages
Presentation On Data Analysis: Submitted by
No ratings yet
Presentation On Data Analysis: Submitted by
38 pages
المحاضرة رقم 3
No ratings yet
المحاضرة رقم 3
44 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
42 pages
List The Importance of Data Analysis in Daily Life
100% (1)
List The Importance of Data Analysis in Daily Life
22 pages
Health Statistics: Principles of Secondary Data Analysis
No ratings yet
Health Statistics: Principles of Secondary Data Analysis
61 pages
E Book - Unit 4
No ratings yet
E Book - Unit 4
12 pages
Exercise 5 - MMW Statistics - For Asynch
No ratings yet
Exercise 5 - MMW Statistics - For Asynch
18 pages
Statistics and Its Types(v1.0)
No ratings yet
Statistics and Its Types(v1.0)
6 pages
Notes On Data Processing, Analysis, Presentation
No ratings yet
Notes On Data Processing, Analysis, Presentation
63 pages
4 2 Measure of Central Tendency
No ratings yet
4 2 Measure of Central Tendency
11 pages
Lecture 3-4Descriptive Statistics Measures of Central Tendency (1)
No ratings yet
Lecture 3-4Descriptive Statistics Measures of Central Tendency (1)
32 pages
Statistics 1
No ratings yet
Statistics 1
16 pages
Data Analysis: Mean, Median, Mode
No ratings yet
Data Analysis: Mean, Median, Mode
54 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
The History of Welding - MillerWelds
No ratings yet
The History of Welding - MillerWelds
7 pages
PLD Quiz-2
No ratings yet
PLD Quiz-2
5 pages
Form Three
No ratings yet
Form Three
3 pages
Permeability of Stratified Soils For Flow Normal To Bedding Plane
No ratings yet
Permeability of Stratified Soils For Flow Normal To Bedding Plane
9 pages
Close Chest Tube Insertion Thoracostomy
No ratings yet
Close Chest Tube Insertion Thoracostomy
15 pages
Velvia 50 Datasheet
No ratings yet
Velvia 50 Datasheet
8 pages
How To Create A Citrix Xenapp 6.5 Vdisk: As Described Here As Described Here
No ratings yet
How To Create A Citrix Xenapp 6.5 Vdisk: As Described Here As Described Here
24 pages
Etic111 Prelim
No ratings yet
Etic111 Prelim
6 pages
Diode Animation
100% (2)
Diode Animation
87 pages
Zone of Proximal Development
No ratings yet
Zone of Proximal Development
6 pages
Mpi-Atlas Upvc-Valves-Catalogue A4 2
No ratings yet
Mpi-Atlas Upvc-Valves-Catalogue A4 2
7 pages
Objet30 Pro Brochure - Letter High Res PDF
No ratings yet
Objet30 Pro Brochure - Letter High Res PDF
4 pages
Annotated Lesson
No ratings yet
Annotated Lesson
3 pages
Iot IAT-2
No ratings yet
Iot IAT-2
2 pages
SP29 3 PDF
No ratings yet
SP29 3 PDF
244 pages
I.mx 6Dual6Quad Linux Reference Manual
No ratings yet
I.mx 6Dual6Quad Linux Reference Manual
317 pages
Language Complexity Typology contact change Studies in Language Companion Series 94th Edition Matti Miestamo All Chapters Instant Download
No ratings yet
Language Complexity Typology contact change Studies in Language Companion Series 94th Edition Matti Miestamo All Chapters Instant Download
81 pages
Summer Training Report at Agile Capital Services 2
No ratings yet
Summer Training Report at Agile Capital Services 2
44 pages
MC83WC With Weichai Wp4.1d80e201
No ratings yet
MC83WC With Weichai Wp4.1d80e201
1 page
2026 2028 Syllabus
No ratings yet
2026 2028 Syllabus
42 pages
How To Write An Incident Report
No ratings yet
How To Write An Incident Report
5 pages
LC
No ratings yet
LC
44 pages
alternative english xi
No ratings yet
alternative english xi
4 pages
in Arts 9 Arts of The Renaissance and Baroqueq2wk34
No ratings yet
in Arts 9 Arts of The Renaissance and Baroqueq2wk34
70 pages
Renaissance Worksheet
No ratings yet
Renaissance Worksheet
1 page
Introduction
No ratings yet
Introduction
2 pages
Hiện Tại Tiếp Diễn Đề 2 (Done)
No ratings yet
Hiện Tại Tiếp Diễn Đề 2 (Done)
2 pages
2008 Zoltners, Sinha & Lorimer - Sales Force Effectiveness PDF
No ratings yet
2008 Zoltners, Sinha & Lorimer - Sales Force Effectiveness PDF
18 pages
HVAC Concepts
No ratings yet
HVAC Concepts
25 pages