• I dream of a digital India where quality education reaches the most
inaccessible corners driven by digital learning.
– Shri Narendra Modi
Unit 4
Univariate Analysis
Unit IV :Univariate Data Analysis :
• Description and summary of data set.
• measure of central tendency – mean: Arithmetic, geometric and
harmonic mean – Raw and grouped data
• confidence limit of mean
• median for raw (odd and even numbers) and grouped data,
• mode for raw and grouped data,
• quartile and percentile,
• interpretation of quartile and percentile values.
• Introduction to spreadness of data,
• measure of dispersion,
• concepts on error, range, variance, standard deviation,
• confidence limit of variance and standard deviation,
• coefficient of variation,
• mean absolute deviation, mean deviation, quartile deviation,
interquartile range,
• concepts on symmetry of data, skewness and kurtosis
What is Univariate Analysis?
• Univariate analysis is the simplest form of analyzing data.
• “Uni” means “one”, so in other words your data has only one variable.
• It doesn’t deal with causes or relationships and it’s major purpose is to describe; it
takes data, summarizes that data and finds patterns in the data.
What is a variable in Univariate Analysis?
• A variable in univariate analysis is just a condition or subset that your data falls
into.
• You can think of it as a “category.”
• For example, the analysis might look at a variable of “age” or it might look at
“height” or “weight”.
• However, it doesn’t look at more than one variable at a time otherwise it
becomes bivariate analysis (or in the case of 3 or more variables it would be
called multivariate analysis).
Median for raw (odd and even numbers) and grouped data:
• The median is the middle value.
• It is the value that splits the dataset in half.
• To find the median, order your data from smallest to largest, and then find
the data point that has an equal amount of values above it and below it.
• The method for locating the median varies slightly depending on whether
your dataset has an even or odd number of values.
Ungrouped or Raw data:
• Arrange the given values in the ascending order. If the number of values
are odd, median is the middle value.
• If the number of values are even, median is the mean of middle two values.
Arithmetic Mean:
• The arithmetic mean is calculated as the sum of the values divided by
the total number of values, referred to as N.
• Arithmetic Mean = (x1 + x2 + … + xN) / N
• A more convenient way to calculate the arithmetic mean is to
calculate the sum of the values and to multiply it by the reciprocal of
the number of values (1 over N); for example:
• Arithmetic Mean = (1/N) * (x1 + x2 + … + xN)
• The arithmetic mean is appropriate when all values in the data
sample have the same units of measure, e.g. all numbers are heights,
or dollars, or miles, etc.
• When calculating the arithmetic mean, the values can be positive,
negative, or zero.
• The arithmetic mean can be easily distorted if the sample of observations contains
outliers (a few values far away in feature space from all other values), or for data
that has a non-Gaussian distribution (e.g. multiple peaks, a so-called multi-modal
probability distribution).
• The arithmetic mean is useful in machine learning when summarizing a variable,
e.g. reporting the most likely value. This is more meaningful when a variable has a
Gaussian or Gaussian-like data distribution.
• The arithmetic mean can be calculated using the mean() NumPy function.
• The example below demonstrates how to calculate the arithmetic mean for a
list of 10 numbers.
from numpy import mean
# define the dataset
data = [1, 2, 3, 4, 5, 6, 7, 8, 9,10]
# calculate the mean
result = mean(data)
print('Arithmetic Mean:',result)
Geometric Mean
• The geometric mean is calculated as the N-th root of the product of all values,
where N is the number of values.
• Geometric Mean = N-root(x1 * x2 * … * xN)
• For example, if the data contains only two values, the square root of the product
of the two values is the geometric mean. For three values, the cube-root is used,
and so on.
• The geometric mean is appropriate when the data contains values with different
units of measure, e.g. some measure are height, some are dollars, some are miles,
etc.
• The geometric mean does not accept negative or zero values, e.g. all values must
be positive.
• One common example of the geometric mean in machine learning is in the
calculation of the G-Mean (geometric mean) metric that is a model evaluation
metric that is calculated as the geometric mean of the sensitivity and specificity
metrics.
• The geometric mean can be calculated using the gmean() SciPy module.
• The example below demonstrates how to calculate the geometric
mean for a list of 10 numbers.
from scipy.stats import gmean
# define the dataset
data = [1, 2, 3, 40, 50, 60, 0.7, 0.88, 0.9, 1000]
# calculate the mean
result = gmean(data)
print('Geometric Mean:',result)
Harmonic Mean
• The harmonic mean is calculated as the number of values N divided by the
sum of the reciprocal of the values (1 over each value).
• Harmonic Mean = N / (1/x1 + 1/x2 + … + 1/xN)
• If there are just two values (x1 and x2), a simplified calculation of the
harmonic mean can be calculated as:
• Harmonic Mean = (2 * x1 * x2) / (x1 + x2)
• The harmonic mean is the appropriate mean if the data is comprised of
rates.
• Recall that a rate is the ratio between two quantities with different
measures, e.g. speed, acceleration, frequency, etc.
• In machine learning, we have rates when evaluating models, such as the
true positive rate or the false positive rate in predictions.
• The harmonic mean does not take rates with a negative or zero value, e.g.
all rates must be positive.
• One common example of the use of the harmonic mean in machine
learning is in the calculation of the F-Measure (also the F1-Measure
or the Fbeta-Measure); that is a model evaluation metric that is
calculated as the harmonic mean of the precision and recall metrics.
• The harmonic mean can be calculated using the hmean() SciPy
function.
• The example below demonstrates how to calculate the harmonic
mean for a list of nine numbers.
from scipy.stats import hmean
# define the dataset
data = [0.11, 0.22, 0.33, 0.44, 0.55, 0.66, 0.77, 0.88, 0.99]
# calculate the mean
result = hmean(data)
print('Harmonic Mean:',result)
• How to Choose the Correct Mean?
• The arithmetic mean is the most commonly used mean, although it
may not be appropriate in some cases.
• Each mean is appropriate for different types of data; for example:
If values have the same units: Use the arithmetic mean.
If values have differing units: Use the geometric mean.
If values are rates: Use the harmonic mean.
• The exceptions are if the data contains negative or zero values, then
the geometric and harmonic means cannot be used directly.
Univariate Descriptive Statistics
• Different ways we can describe patterns found in univariate data include
central tendency (mean, mode and median)
and dispersion: range, variance, maximum, minimum, quartiles (including
the interquartile range), and standard deviation.
You have several options for describing data with univariate data(Visual
Representation).
• Frequency Distribution Tables.
• Bar Charts.
• Histograms.
• Pie Charts.
• When n is odd, Median =
• When n is even, Average of
Example1
• If the weights of some fruits are 45, 60,48,100,65 gms, calculate the
median
• Solution
• Here n = 5
• First arrange it in ascending order
• 45, 48, 60, 65, 100
• Median = (n+1)/2 value=(5+1)/2=3rd value=60
Example 2
• If the weights of some fruits are 5,48, 60, 65, 65, 100 gms, calculate
the median.
Solution
Here n = 6
• Median=Average of(n/2) and(n/2+1) value
• (n/2)=6/2=3rd value=60 and (n/2+1)=(6/2+1)=4th value=65
• Median=(60+65)/2=62.5 g
Grouped data
• In a grouped distribution, values are associated with frequencies.
• Grouping can be in the form of a discrete frequency distribution or a
continuous frequency distribution.
• Whatever may be the type of distribution, cumulative frequencies
have to be calculated to know the total number of items.
Cumulative frequency (cf):
Cumulative frequency of each class is the sum of the frequency of the
class and the frequencies of the pervious classes, i.e adding the
frequencies successively, so that the last cumulative frequency gives
the total number of items.
Discrete Series
• A discrete frequency distribution is a table that lists each number and
the number of times (frequency) that it occurs in a list.
Step1: Find cumulative frequencies.
Step2: Find (n/2+1) where N = Σf
Step3: Check the cumulative frequencies the value just greater
than (n/2+1)
Step4: Then the corresponding value of x is median.
Find out the median value of the following
discrete series.
Item Value Frequency
2 5
4 7
6 12
8 18
10 11
12 6
14 4
Item Value Frequency Cumulative Frequency
2 5 5
4 7 12
6 12 24
8 18 42
10 11 53
12 6 59
14 4 63
M = Value of (n+1)/2 th item
=value of (63+1)/2 th item=value of 32 th item
Item 32th is included in the cumulative frequency 42. The value under cumulative frequency
is 42 is 8.
Hence, Median =8.
Example 3:
• The following data pertaining to the number of insects per plant. Find
median number of insects per plant.
Number of 1 2 3 4 5 6 7 8 9 10 11 12
insects per
plant (x)
No. of plants(f) 1 3 5 6 10 13 9 5 3 2 2 1
Form the cumulative frequency table
x 1 2 3 4 5 6 7 8 9 10 11 12
f 1 3 5 6 10 13 9 5 3 2 2 1 60
cf 1 4 9 15 25 38 47 52 55 57 59 60
Median = size of (n+1)/2 item
Here the number of observations is even. Therefore median = average of (n/2)th item and
(n/2+1)th item.
= (30th item +31st item) / 2 = (6+6)/2 = 6
Hence the median size is 6 insects per plant.
Continuous Series
• Continuous Frequency distribution is an arrangement of the values that one or more variables take in a
sample. Each entry in the table contains the frequency or count of the occurrences of values within a
particular group or interval, and in this way, the table summarizes the distribution of values in the sample.
• The steps given below are followed for the calculation of median in continuous series.
Step1: Find cumulative frequencies.
Step2: Find n/2
Step3: Check the cumulative frequency the value just greater than (n/2), Then the corresponding class interval
is called the Median class. Then apply the formula
Median=
where l = Lower limit of the median class
m = cumulative frequency preceding the median class
c = width of the class
f =frequency in the median class.
n=Total frequency.
• A survey regarding the weight of 45 pets was conducted and
following data was obtained. Find the median weight.
No of pets (f)
Weight ( in kg)
20-25 2
25-30 5
30-35 8
35-40 10
40-45 7
45-50 10
50-55 3
Fi=45
No of pets (f) Cumulative frequency (m)
Weight ( in kg)
20-25 2 2
25-30 5 7
30-35 8 15
35-40 10 25
40-45 7 32
45-50 10 42
50-55 3 45
n=45
n=45, n/2=45/2=22.5 Median =
l=Lower limit of the median class =35 (Median class is obtained by
considering the cumulative frequency just bigger value than n/2)
n=total number of observations=45 = 35+(22.5-15)/10*5
m=Cumulative frequency of the class preceding the median class =15 =35+37.5/10=35+3.75
f=Frequency of the median class=10 =38.75
C=class length = 5
Calculate the median for the following data.
Class Interval Frequency
0-10 7
10-20 18
20-30 34
30-40 50
40-50 35
50-60 20
60-70 6
Class Interval Frequency(f) Cumulative frequency (m)
0-10 7 7
10-20 18 25
20-30 34 59
30-40 50 109
40-50 35 144
50-60 20 164
60-70 6 170
n=170
Median=
n=170, n/2=170/2=85
l=Lower limit of the median class =30 (Median =30+[(85-59)/50]*10
class is obtained by considering the just bigger
value than n/2) =30+5.2=35.2
n=total number of observations=170
m=Cumulative frequency of the class preceding
the median class =59
f=Frequency of the median class=50
C=class length = 10
• When you have a symmetrical distribution for continuous data, the
mean, median, and mode are equal. In this case, analysts tend to use
the mean because it includes all of the data in the calculations.
• If you have a skewed distribution, the median is often the best
measure of central tendency.
• When you have ordinal data, the median or mode is usually the best
choice. For categorical data, you have to use the mode.
Mode for raw and grouped data
• The mode refers to that value in a distribution, which occur most
frequently.
• It is an actual value, which has the highest concentration of items in and
around it.
• It shows the centre of concentration of the frequency in around a given
value.
• Therefore, where the purpose is to know the point of the highest
concentration it is preferred. It is, thus, a positional measure.
• Its importance is very great in agriculture like to find typical height of a crop
variety, maximum source of irrigation in a region, maximum disease prone
paddy variety. Thus the mode is an important measure in case of
qualitative data.
Computation of the mode:
1)Ungrouped or Raw Data
• For ungrouped data or a series of individual observations, mode is often found by mere inspection.
• Example 1:
Find the mode for the following seed weight
2 , 7, 10, 15, 10, 17, 8, 10, 2 gms
•
Mode = 10
• In some cases the mode may be absent while in some cases there may be more than one mode.
•
Example 2
(1) 12, 10, 15, 24, 30 (no mode)
•
(2) 7, 10, 15, 12, 7, 14, 24, 10, 7, 20, 10
•
the modal values are 7 and 10 as both occur 3 times each.
2)Grouped Data
• For Discrete distribution, see the highest frequency and
corresponding value of x is mode.
• Find the mode for the following:
Weight of sorghum in gms (x) No. of ear head(f)
50 4
65 6
75 16
80 8
95 7
100 4
The maximum frequency is 16. The corresponding x value is 75.
mode = 75 gms.
Continuous distribution
• Locate the highest frequency , the class corresponding to that frequency is
called the modal class. Then apply the formula.
• Mode =
• Where l= lower limit of the mode class
fp= the frequency of the class preceding the mode class
fs= the frequency of the class succeeding the mode class
and c = class interval
For the frequency distribution of weights given in table below. Calculate the mode
Weights of ear heads (g) No of ear heads (f)
60-80 22
80-100 38 fp
100-120 45 f
120-140 35 fs
140-160 20
Total 160
Mode =
Here l=100, f = 45, c = 20, fp=38, fs =35
Mode =
= 109.589
Find out the mode from the following series:
Obtained Marks No. of
students Mode =
0-10 2
10-20 5 Here l=30, f = 15, c = 10, fp=8, fs =12
20-30 8
Mode = 30+[12/(8+12)]*10
30-40 15
40-50 12 = 36
50-60 6
60-70 3
It is clear by the series that the frequency of mode-class 30-40 is highest, Thus, Mode-Class is 30-40.
Percentiles
• A percentile is a measure used in statistics indicating the value below which a given percentage of
observations in a group of observations falls.
2,2,3,4,5,5,5,6,7,8,8,8,8,8,9,9,10,11,11,12
What is the percentile ranking of 10?
Percentage rank of x=(No. of Values below x)/n *100
=(16/20)*100
=80%
• Percentiles indicate the percentage of scores that fall below a particular value. They tell you where a
score stands relative to other scores.
• Percentiles are a great tool to use when you need to know the relative standing of a value.
• Percentiles tell you how a value compares to other values. The general rule is that if value X is at the
kth percentile, then X is greater than K% of the values.
• For example, the 90th percentile is the value (or score) below which 90% of the observations may be
found.
Quartiles
• Quartiles in statistics are values that divide your data into quarters.
• A quartile is a statistical term describing a division of observations into four defined intervals based
upon the values of the data and how they compare to the entire set of observations.
• A quartile is any of three numbers that separate a sorted data into four equal parts.
• When we talk about quartiles, we are dividing the data set into 4 quarters.
• Each quarter is 25% of the total number of data points.
• The first quartile or Q1 is the value in the data set such that 25% of the data points are less than
this value and 75% of the data set is greater than this value.
• The second quartile or Q2 is the value in the data set such that 50% of the data points are less than
this value and 50% of the data set are greater than this value.
• The third quartile or Q3 is the value such that 75% of the values are less than this value and 25% of
the values are greater than this value.
• The term Interquartile Range (IQR) refers to the difference between Q3 and Q1 (IQR = Q3 – Q1).
• The quartile measures the spread of values above and below the mean by dividing
the distribution into four groups.
• A quartile divides data into three points – a lower quartile, median, and upper
quartile – to form four groups of the data set.
• Quartiles are used to calculate the interquartile range, which is a measure of
variability around the median.
• Percentile is a fairly common word. Surprisingly, there isn’t a single
standard definition for it. Consequently, there are multiple methods
for calculating percentiles.
1. The smallest value that is greater than k percent of the values.
2. The smallest value that is greater than or equal to k percent of values.
3. An interpolated value between the two closest ranks.
• These methods are used by analysts to calculate percentiles when
looking at the actual data values in relatively small datasets. These
three definitions define the kth percentile.
• To calculate percentiles using these three approaches, start by
ranking your dataset from the lowest to highest values.
• Consider the following dataset (n=11) to find the 70th percentile.
Rank 1 2 3 4 5 6 7 8 9 10 11
Value 2 4 6 8 13 16 22 35 40 42 48
1. The smallest value that is greater than k percent of the values:
Using this definition, we need to find the value that is greater than 70% of the values, and there are 11 values.
Take 70% of 11, which is 7.7. Then, round 7.7 up to 8.
The value for the 70th percentile must be greater than eight values.
So, we pick the 9th ranked value in the dataset, which is 40.
2. The smallest value that is greater than or equal to k percent of values:
Here, we need to find the value that is greater than or equal to 70% of the values.
we can use the 8th ranked value, which is 35.
3. An interpolated value between the two closest ranks:
To calculate an interpolated percentile, perform the following steps:
1. Calculate the rank to use for the percentile.
Use: rank = k(n+1)/100, where k = the percentile and n = the sample size.
Example, to find the rank for the 70th percentile, we take 70*(11 + 1)/100 = 8.4.
2. If the rank in step 1 is an integer, find the data value that corresponds to that rank and use it for
the percentile.
3. If the rank is not an integer, you need to interpolate between the two closest observations.
Example, 8.4 falls between 8 and 9, which corresponds to the data values of 35 and 40.
4. Take the difference between these two observations and multiply it by the fractional portion of
the rank.
Example, this is: (40 – 35)0.4 = 2.
5. Take the lower-ranked value in step 3 and add the value from step 4 to obtain the interpolated
value for the percentile.
For our example, that value is 35 + 2 = 37.
Example 1
• Consider the dataset given below:
11.5 10.2 8.00 8.25 9.00 9.15
9.75 7.5 8.00 12.5 13.00 11.25
10.75 9.5 9.25 9.45 7.75
a) What are the quartiles for this data set?
b) Arun’s pay is in the 85th percentile for this group. What does the percentile mean? What is
Arjun’s hourly pay rate?
Note: rank=k(n+1)/100 where k is the percentile and n is the no. of observations.
• Arrange the data from smallest to largest:
1 2 3 4 5 6 7 8 9
7.5 7.75 8 8 8.25 9 9.15 9.25 9.45
10 11 12 13 14 15 16 17
9.5 9.75 10.2 10.75 11.25 11.5 12.5 13.0
a) Q1=25% = rank=k(n+1)/100=25(17+1)/100=4.5 observation=(8+8.25)/2=8.125
Q2=50%=50(17+1)/100=9th observation=9.45
Q3=75%=75(17+1)/100=13.5th observation=(10.75+11.25)/2=11
b) 85% of the observations are less paid than Arjun.
85%=85(17+1)/100=15.3 observation=12.50
The following cumulative frequency graph shows the distribution of marks scored by a class of 40 students in
a test.
• Use the given graph to estimate
a) the median b) the upper quartile
c) the lower quartile d) the interquartile range
a) Median corresponds to the 50th percentile i.e. 50% of the total
frequency.
50% of the total frequency = (50/100)*40=20
From the graph, 20 on the vertical axis corresponds to 44 on the
horizontal axis. The median mark is 44.
b) The upper quartile corresponds to the 75th percentile i.e. 75% of the total frequency.
75% of the total frequency = (75/100)*40=30
From the graph, 30 on the vertical axis corresponds to 52 on the horizontal axis. The upper quartile is 52.
c) The lower quartile corresponds to the 25th percentile i.e. 25% of the total frequency.
25% of the total frequency = (25/100)*40=10
From the graph, 10 on the vertical axis corresponds to 36 on the horizontal axis. The lower quartile is 36.
d) The interquartile range = upper quartile – lower quartile= 52 – 36 = 16
Calculate the IQR value for the given data
set.
• Formula required for percentile ranking:
• Rank=k(n+1)/100
Where K=percentile, N=no. of observations
• Data Set: 5, 4, 2, 1, 7, 9, 8, 10, 12, 0, 15
• The first step is to put the data in increasing order, we get the following…
Sorted Data Set: 0, 1, 2, 4, 5, 7, 8, 9, 10, 12, 15
Q1=25(11+1)/100=25(12)/100=300/100=3
• So, the first quartile is the value that is located at the 3rd data point (Q1 = 2).
Q2=50(11+1)/100=50(12)/100=600/100=6
• The second quartile is the value that is located at the 6th data point (Q2 = 7).
Q3=75(11+1)/100=900/100=9
• The third quartile is the value that is located at the 9th data point (Q3 = 10).
• Hence, the IQR = Q3-Q1=10-2 = 8.
• Alternate Method:
Q1 is the value of the data set located at the (N+1)/4th location, Q2 is
the value of the data set located at the (N+1)/2nd location, and Q3 is
the value of the data set that is located at the 3*(N+1)/4th location.
1) Find the quartiles of the following data: 4, 6, 7, 8, 10, 23, 34.
2) Find the Quartiles of the following age:-
23, 13, 37, 16, 26, 35, 26, 35
Question 1: Find the quartiles of the following data: 4, 6, 7, 8, 10, 23, 34.
Solution: Here the numbers are arranged in the ascending order and number of items, n = 7
Lower quartile, Q1 = [(n+1)/4] th item
Q1= 7+1/4 = 2nd item = 6
Median, Q2 = [(n+1)/2]th item
Q2= 7+1/2 item = 4th item = 8
Upper Quartile, Q3 = [3(n+1)/4]th item
Q3 = 3(7+1)/4 item = 6th item = 23
• Question 2: Find the Quartiles of the following age:-
• 23, 13, 37, 16, 26, 35, 26, 35
• Solution:
• First, we need to arrange the numbers in increasing order.
• Therefore, 13, 16, 23, 26, 26, 35, 35, 37
• Number of items, n = 8
• Lower quartile, Q1 = [(n+1)/4] th item
• Q1 = 8+1/4 = 9/4 = 2.25th term
• From the quartile formula we can write;
• Q1 = 2nd term + 0.25(3rd term-2nd term)
• Q1= 16+0.25(23-16) = 17.75
• Similarly,
• Median, Q2 = [(n+1)/2]th item
• Q2 = 8+1/2 = 9/2 = 4.5
• Q2 = 4th term+0.5 (5th term-4th term)
• Q2= 26+0.5(26-26) = 26
• And,
• Upper Quartile, Q3 = [3(n+1)/4]th item
• Q3 = 3(8+1)/4 = 6.75th term
• Q3 = 6th term + 0.75(7th term-6th term)
• Q3 = 35+0.75(35-35) = 35
Measures of the Spread of Data
• An important characteristic of any set of data is the variation in the data.
• Measures of spread(or dispersion) summarize the data in a way that shows how scattered the
values are and how much they differ from the mean value.
• In some data sets, the data values are concentrated closely near the mean; in other data sets, the
data values are more widely spread out from the mean.
• Common examples of measures of dispersion are the variance, standard deviation, and interquartile
range.
• The most common measure of variation, or spread, is the standard deviation.
• The standard deviation is a number that measures how far data values are from their mean.
• The standard deviation provides a numerical measure of the overall amount of variation in a data
set, and can be used to determine whether a particular data value is close to or far from the mean.
• The standard deviation provides a measure of the overall variation in a data set.
• The standard deviation is always positive or zero. The standard deviation is small when the data
are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation
is larger when the data values are more spread out from the mean, exhibiting more variation.
• The range is the difference between the highest and lowest scores in a data set and is the simplest
measure of spread.
Calculating the Standard Deviation
• If x is a number, then the difference “x – mean” is called its deviation.
• In a data set, there are as many deviations as there are items in the data set.
• The deviations are used to calculate the standard deviation.
• If the numbers belong to a population, in symbols a deviation is x – μ. For sample data, in
symbols a deviation is
• To calculate the standard deviation, we need to calculate the variance first.
• The variance is the average of the squares of the deviations.
• (the values for a sample, or the x – μ values for a population).
• The symbol σ2 represents the population variance; the population
2
standard deviation σ is
the square root of the population variance. The symbol s represents the sample variance;
the sample standard deviation s is the square root of the sample variance.
• If the numbers come from a census of the entire population and not a sample, when we
calculate the average of the squared deviations to find the variance, we divide by N, the
number of items in the population. If the data are from a sample rather than a
population, when we calculate the average of the squared deviations, we divide by n – 1,
one less than the number of items in the sample.
Steps to Calculate Variance and Standard
Deviation:
1. List elements of data set.
2. Calculate the mean.
3. Find the deviation from the mean for each data point.
4. Square Deviation obtained in Step3.
5. The average of all squared differences is the variance. To find it, add all squared variances and
divide the sum by a number of elements in data set (n).
6. Standard deviation is calculated by taking the square root of variance.
Note:
1) If the numbers come from a census of the entire population and not a sample, when we
calculate the average of the squared deviations to find the variance, we divide by N, the number
of items in the population.
2) If the data are from a sample rather than a population, when we calculate the average of the
squared deviations, we divide by n – 1, one less than the number of items in the sample.
• The following are ages of students pursuing a Advanced Data Science course:
Data=[28,25,26,27,31,32,24]. Find the mean, sample variance and standard deviation.
Step1:List elements of data set.
Data=[28,25,26,27,31,32,24]
Step2: Calculate the mean(͞x)
(28 + 25 +26 +27 +31 +32 + 24) / 7 = 193/7=27.57
Step3: Find the deviation from the mean for each data point and squaring it.
Data Difference between Squarring (x-͞x)2
Point(x) data point and mean (x-͞x)
28 (28-27.57)=0.43 0.1849
25 (25-27.57)=-2.57 6.6049
26 (26-27.57)=-1.57 2.4649
27 (27-27.57)=-0.57 0.3249
31 (31-27.57)=3.43 11.7649
32 (32-27.57)=4.43 19.6249
24 (24-27.57)=-3.57 12.7449
• The average of all squared differences is the variance.
• To find it, add all squared variances and divide the sum by a number of elements in data set minus 1
(sample variance).
=(0.1849 + 6.6049 + 2.4649 + .3249 + 11.7649 + 19.6249 + 12.7449) / (7-1)
=53.7143 /(7-1) = 53.7143/6=8.9523
• Population variance=total/n=53.7143/7=7.6734
• Standard Deviation for sample variance=Square root of sample variance=2.9920
• Standard Deviation for population variance=Square root of population variance=2.77
Find the mean, sample variance and standard deviation for the following data set.
data = [9,9.5,9.5,10,10,10,10,10.5,10.5,10.5,10.5,11,11,11,11,11,11,11.5,11.5,11.5]
• Mean= ͞x =9+9.5(2)+10(4)+10.5(4)+11(6)+11.5(3)20=10.525
Data Freq. Deviations Deviations2 (Freq.)( Deviations2)
x f ( x – ͞x ) ( x – ͞x)2 ( f)(x –͞x)2
9 1 9-10.525= -1.525 2.3256 2.3256
9.5 2 9.5-10.525=-1.025 1.0506 2.1012
10 4 10-10.525=-0.525 0.2756 1.1024
10.5 4 10.5-10.525=-0.025 0.000625 0.0025
11 6 11-10.525=0.474 0.2256 1.3536
11.5 3 11.5-10.525=0.975 0.9506 2.8518
Total 9.7371
• The sample variance, s2, is equal to the total (9.7371) divided by the total number of data values
minus one (20 – 1): s2=9.7371/(20−1)
=0.5125
• Standard deviation of data: Square root of variance
=0.7158
• Population Variance=Total/N=9.7371/20
=0.4868
• standard deviation of data: Square root of population variance =0.6977
• variance() :- This function calculates the variance i.e measure of deviation of data, more the value
of variance, more the data values are spread. Sample variance is computed in this function,
assuming data is of a part of population. If passed argument is empty, StatisticsError (Type
Error) is raised.
• pvariance() :- This function computes the variance of the entire population. The data is interpreted
as it is of the whole population. If passed argument is empty, StatisticsError (Type Error) is raised.
• stdev() :- This function returns the standard deviation ( square root of sample variance ) of the data.
If passed argument is empty, StatisticsError (Type Error) is raised.
• pstdev() :- This function returns the population standard deviation ( square root of population
variance ) of the data. If passed argument is empty, StatisticsError (Type Error) is raised.
import statistics as st
data = [9,9.5,9.5,10,10,10,10,10.5,10.5,10.5,10.5,11,11,11,11,11,11,11.5,11.5,11.5]
# using variance to calculate variance of data
print ("The sample variance of data is : ")
print (st.variance(data))
# using pvariance to calculate population variance of data
print ("The population variance of data is : ",end="")
print (st.pvariance(data))
# using stdev to calculate standard deviation of data
print ("The standard deviation of data is : ",end="")
print (st.stdev(data))
# using pstdev to calculate population standard deviation of data
print ("The population standard deviation of data is : ",end="")
print (st.pstdev(data))
Topics to be covered
• Mean Deviation
• Mean Absolute Deviation
• Quartile Deviation
• Skewness
• Kurtosis
• EDA on house Pricing Dataset(Univariate Analysis)
Mean Deviation
• Deviation is a measure of the difference between the observed value of a variable and some other value, often
that variable’s mean.
• The mean deviation is a measure that tells how much the observations in the data set deviates from the mean
value of the observations in the data set.
• To calculate the mean deviation you first have to calculate the deviation of each observation from the mean
value of all observations in the data set.
• Consider the following dataset:
5, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10
• The sum of all observations in this data set is 66.
• The data set contains 12 observations.
• The mean value of the observations of this data set 5.5.
• Once we have calculated the mean value of the observations in the data set, we can calculate the deviation of
each observation from the mean value.
• For instance, the deviation of the value 5 (first observation in the data set) from the mean value 5.5 is -0.5
-0.5, -3.5, -3.5, -2.5, -1.5, -0.5, -0.5, 0.5, 1.5, 2.5, 3.5, 4,5
• If we sum all these values the result will be 0 because each deviation is calculated as the difference between
the mean value of all observations and the value of the observation itself. Therefore it does not make sense to
calculate the mean deviation based on deviation alone. The result would always be 0.
Mean Absolute Deviation
• Mean absolute deviation is a statistical measure of the average deviation of values
from the mean in a sample.
• This analysis is used to calculate how observations are scattered from the mean.
Formula to find the mean absolute deviation is:
• Mean Absolute Deviation = Σ|x − μ|/N , where
✔ x is each value in the data set.
✔ μ is the mean
✔ N is the number of values
Steps to find the mean deviation:
1. Find the mean of all values in the data set.
2. Find the distance of each value from that mean (subtract the mean from each
value, ignore minus signs)
3. Then find the mean of those distances.
Find the Mean Absolute Deviation for the data:
3, 6, 6, 7, 8, 11, 15, 16
Step 1: Find the mean:
Mean = 3 + 6 + 6 + 7 + 8 + 11 + 15 + 16 = 72/8 = 9
Step 2: Find the distance of each value from that mean(absolute value):
Value 3 6 6 7 8 11 15 16
Distance 6 3 3 2 1 2 6 7
from
Mean(9)
Step 3. Find the mean of those distances:
Mean Deviation = (6 + 3 + 3 + 2 + 1 + 2 + 6 + 7)/8 = 30/8 = 3.75
Note:
• Mean deviation tells us how far, on average, all values are from the middle.
• In this example the values are, on average, 3.75 away from the middle.
#Mean absolute deviation using Numpy
from numpy import mean, absolute
data = [3, 6, 6, 7, 8, 11, 15, 16]
# Find mean value of the sample
M = mean(data)
print("Sample Mean Value = ",mean(data))
# Calculate absolute deviation
print("Data-Mean","","deviation")
sum=0
for i in range(len(data)):
dev = absolute(data[i] - M)
sum = sum + round(dev,2)
print(data[i],"-",M,"=",round((dev),2))
print("Total Deviation:",sum)
print("Mean Absolute Deviation: ", sum/len(data))
• We can directly use the function.
x=mean(absolute(data - mean(data)))
print(x)
• Mean Absolute Deviation Using mad() Function In Pandas:
• Mean Absolute Deviation (MAD) is computed as the mean of absolute
deviation of data points from their mean.
• pandas DataFrame class has the method mad() that computes the Mean
Absolute Deviation for rows or columns of a pandas DataFrame object.
• When mad() is invoked with axis = 0, the Mean Absolute Deviation is
calculated for the columns. When axis=1, mad is calculated for the rows.
• NaN values can be skipped using the optional Boolean parameter skipna.
skipna=True skips the NaN values. skipna=False includes the NaN values.
• The default value for skipna is True.
import pandas as pd
dataSet = {"C1":(6.5, 5.1, 5.6, 7.0, 7.1, 7.45, 7.75, 8),
"C2":(7, 7.1, 7.2, 6, 6.1, 6.3, 5.1, 5.2)
}
dataFrame = pd.DataFrame(data=dataSet)
print("DataFrame:")
print(dataFrame)
# Calculate Mean Absolute Deviation of DataFrame columns
mad = dataFrame.mad()
print("Mean absolute deviation of columns:")
print(mad)
# Calculate Mean Absolute Deviation of DataFrame rows
mad = dataFrame.mad(axis=1)
print("Mean absolute deviation of rows:")
print(mad)
import pandas as pd
dataValues = [(2, 3, 5, 7,),
(11, 13, 17, 19),
(23, 29, None, None),
(31, 37, None, None)]
dataFrameObject = pd.DataFrame(data=dataValues)
print("DataFrame:")
print(dataFrameObject)
# Calculate Mean absolute deviation for the DataFrame columns
mad = dataFrameObject.mad(axis=0)
print("Mean absolute deviation of columns(NaNs skipped):")
print(mad)
# Calculate Mean absolute deviation for the DataFrame columns
mad = dataFrameObject.mad(axis=0, skipna=False)
print("Mean absolute deviation of columns(NaNs included):")
print(mad)
# Calculate MAD for the DataFrame rows
mad = dataFrameObject.mad(axis=1)
print("Mean absolute deviation of rows(NaNs skipped):")
print(mad)
# Calculate Mean absolute deviation for the DataFrame rows
mad = dataFrameObject.mad(axis=1, skipna=False)
print("Mean absolute deviation of rows(NaNs included):")
print(mad)
Quartile Deviation
• The Quartile Deviation is a simple way to estimate the spread of a distribution about a measure of
its central tendency (usually the mean).
• It gives you an idea about the range within which the central 50% of your sample data lies.
• Based on the quartile deviation, the Coefficient of Quartile Deviation can be defined, which
makes it easy to compare the spread of two or more different distributions.
Quartiles
• A median divides a given dataset (which is already sorted) into two equal halves similarly, the
quartiles are used to divide a given dataset into four equal halves. Logically there should be three
quartiles for a given distribution.
• The first quartile or the lower quartile or the 25th percentile, also denoted by Q1, corresponds to
the value that lies halfway between the median and the lowest value in the distribution (when it is
already sorted in the ascending order). Hence, it marks the region which encloses 25% of the
initial data.
• Similarly, the third quartile or the upper quartile or 75th percentile, also denoted
by Q3, corresponds to the value that lies halfway between the median and the highest value in the
distribution (when it is already sorted in the ascending order). It, therefore, marks the region
which encloses the 75% of the initial data or 25% of the end data.
• Inter Quartile Range=Q3-Q1
•
The number of vehicles sold by a major Automobile Showroom in a day
was recorded for 10 working days as follow:
Day 1 2 3 4 5 6 7 8 9 10
Frequency 20 15 18 5 10 17 21 19 25 28
Find the Quartile Deviation and its coefficient for the given distribution.
Sort the data:
5, 10, 15, 17, 18, 19, 20, 21, 25, 28
n= 10
To find the quartiles, we use the logic that the first quartile lies halfway between the lowest value and the
median; and the third quartile lies halfway between the median and the largest value.
First Quartile Q1 = (n+1)/4=(10+1)/4=2.75th term.
= 2nd term + 0.75 × (3rd term – 2nd term)
= 10 + 0.75 × (15 – 10)
= 10 + 3.75 = 13.75
Third Quartile Q3 = 3(n+1)/4=3(10+1)/4=8.25th term
= 8th term + 0.25 × (9th term – 8th term)
= 21 + 0.25 × (25 – 21)
= 21 + 1= 22
• Using the values for Q1 and Q3, now we can calculate the Quartile Deviation and its coefficient as follows:
Quartile Deviation=(Q3-Q1)/2
=(22-13.75)/2=4.125
Coefficient of Quartile Deviation=(Q3-Q1)/(Q3+Q1)*100
=(22-13.75)/(22+13.75)*100
=(8.25/35.75)*100
=23.076
Concepts on symmetry of data
• Symmetry is an attribute used to describe the shape of a data distribution.
• When it is graphed, a symmetric distribution can be divided at the center so that each half is a
mirror image of the other. A non-symmetric distribution cannot.
• Symmetrical distribution is a situation in which the values of variables occur at regular
frequencies, and the mean and median occur at the same point.
Find the mean and median of the following symmetric distribution.
1, 1, 4, 4, 5, 6, 7, 7, 10, 10
Mean of distribution = (1+1+4+4+5+6+7+7+10+10)/10=55/10 = 5.5
Median = (5+6)/2 = 5.5
So mean and median of symmetric distribution = 5.5
• Find the mean and median of the following symmetric distribution.
2, 2, 4, 4, 5, 6, 7, 7, 9, 9
Mean of distribution = (2+2+4+4+5+6+7+7+9+9)/10=55/10 = 5.5
Median = (5+6)/2 = 5.5
• So mean and median of symmetric distribution = 5.5
Normal Distribution
• The normal distribution is one of the most important concepts in statistics since nearly all statistical
tests require normally distributed data.
• It basically describes how large samples of data look like when they are plotted.
• It is sometimes called the “bell curve“ or the “Gaussian curve“.
#Normal Distribution plot
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
x_min = 0.0
x_max = 16.0
mean = 8.0
std = 2.0
x = np.linspace(x_min, x_max, 100)
y = stats.norm.pdf(x,mean,std)
plt.plot(x,y, color='blue')
plt.grid()
plt.xlim(x_min,x_max)
plt.ylim(0,0.25)
plt.title('Normal distribution')
plt.xlabel('x')
plt.ylabel('Normal Distribution')
plt.show()
Skewness
• Skewness is a measurement of the symmetry of a distribution.
• It describes how much a distribution differs from a normal distribution, either to the left or to the right.
• The skewness value can be either positive, negative or zero.
• A perfect normal distribution would have a skewness of zero because the mean equals the median.
• A positive skew if the data is piled up to the left, which leaves the tail pointing to the right.
• A negative skew occurs if the data is piled up to the right, which leaves the tail pointing to the
left.
• A good measurement for the skewness of a distribution is Pearson’s skewness coefficient that
provides a quick estimation of a distributions symmetry.
• To compute the skewness in pandas use the function ”skew()“ function.
Formula to find Skewness
Kurtosis
• Kurtosis measures whether the dataset is heavy-tailed or light-tailed compared to a normal
distribution.
• Data sets with high kurtosis have heavy tails and more outliers and data sets with low kurtosis tend to
have light tails and fewer outliers.
• Histogram is an effective way to show both the skewness and kurtosis of a data set because you can
easily spot if something is wrong with your data.
• A probability plot is also a great tool because a normal distribution would just follow the straight
line.
• A good way to mathematically measure the kurtosis of a distribution is fishers measurement of
kurtosis.
Types of kurtosis:
• A normal distribution is called mesokurtic and has kurtosis of zero or around zero.
• A platykurtic distribution has negative kurtosis and tails are very thin compared to the normal
distribution.
• Leptokurtic distributions have kurtosis greater than 3 and the fat tails mean that the distribution
produces more extreme values and that it has a relatively small standard deviation.
• In pandas you can view the kurtosis by calling the ”kurtosis()“ function.
Calculate Sample Skewness, Sample Kurtosis for the following data:
X 0 1 2 3 4
Frequency(f) 1 5 10 6 3
Mean(ˉx)=55/25=2.2 (1)
2 3 f⋅(x-ˉx)4
x f f⋅x (x-ˉx) f⋅(x- ˉx) f⋅(x-ˉx)
(8)=(5)×(7)
(2) (3) (4)=(2)×(3) (5) (6)=(3)×(5)2 (7)=(5)×(6)
0 1
1 5
2 10
3 6
4 3
n=25
Calculate Sample Skewness, Sample Kurtosis for the following data:
X 0 1 2 3 4
Frequency(f) 1 5 10 6 3
Mean(ˉx)=55/25=2.2 (1)
2 3 f⋅(x-ˉx)4
x f f⋅x (x- ˉx) f⋅(x- ˉx) f⋅(x-ˉx)
(8)=(5)×(7)
(2) (3) (4)=(2)×(3) (5) (6)=(3)×(5)2 (7)=(5)×(6)
0 1 0 -2.2 4.84 -10.648 23.4256
1 5 5 -1.2 7.2 -8.64 10.368
2 10 20 -0.2 0.4 -0.08 0.016
3 6 18 0.8 3.84 3.072 2.4576
4 3 12 1.8 9.72 17.496 31.4928
n=25 ∑f⋅x=55 =26 =1.2 67.76
=√26/24=1.04
=2.4057
=0.0443
Calculate Sample Skewness, Sample Kurtosis from the following grouped data
Class 2-4 4-6 6-8 8-10
Frequency 3 4 2 1
Mean(ˉx)=52/10=5.2
2 3 f⋅(x-ˉx)4
Class Mid value (x) f f⋅x (x-ˉx) f⋅(x-ˉx) f⋅(x-ˉx)
(8)=(5)×(7)
(1) (2) (3) (4)=(2)×(3) (5) (6)=(3)×(5)2 (7)=(5)×(6)
2-4 3 3
4-6 5 4
6-8 7 2
8-10 9 1
n=10
Calculate Sample Skewness, Sample Kurtosis from the following grouped data
Class 2-4 4-6 6-8 8-10
Frequency 3 4 2 1
Mean(ˉx)=52/10=5.2
2 3 f⋅(x-ˉx)4
Class Mid value (x) f f⋅x (x-ˉx) f⋅(x-ˉx) f⋅(x-ˉx)
(8)=(5)×(7)
(1) (2) (3) (4)=(2)×(3) (5) (6)=(3)×(5)2 (7)=(5)×(6)
2-4 3 3 9 -2.2 14.52 -31.944 70.2768
4-6 5 4 20 -0.2 0.16 -0.032 0.0064
6-8 7 2 14 1.8 6.48 11.664 20.9952
8-10 9 1 9 3.8 14.44 54.872 208.5136
n=10 ∑f⋅x=52 =35.6 =34.56 =299.792
=2.1287
=0.488
• Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the same
to the left and right of the center point.
• Kurtosis is a measure of whether the data are heavy-tailed or
light-tailed relative to a normal distribution.