0% found this document useful (0 votes)
32 views

chapter 4

Chapter Four of the Basic Statistics Lecture Note focuses on measures of variation, emphasizing the importance of understanding variability in data sets. It discusses various measures such as range, quartile deviation, mean deviation, variance, and standard deviation, along with their applications and limitations. The chapter also highlights the significance of both absolute and relative measures of dispersion for comparing different distributions.

Uploaded by

design
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

chapter 4

Chapter Four of the Basic Statistics Lecture Note focuses on measures of variation, emphasizing the importance of understanding variability in data sets. It discusses various measures such as range, quartile deviation, mean deviation, variance, and standard deviation, along with their applications and limitations. The chapter also highlights the significance of both absolute and relative measures of dispersion for comparing different distributions.

Uploaded by

design
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Basic Statistics Lecture Note 2024/2025

CHAPTER FOUR
4. MEASURES OF VARIATION
4.1 Introduction
In addition to locating the center of the observed values of the variable in the data, another
important aspect of a descriptive study of the variable is numerically measuring the extent of
variation around the center. Two data sets of the same variable may exhibit similar positions of
center but may be remarkably different with respect to variability.
Just as there are several different measures of center, there are also several different measures of
variation. In this section, we will examine three of the most frequently used measures of variation;
the sample range, the sample interquartile range and the sample standard deviation. Measures of
variation are used mostly only for quantitative variables.
4.2 Objectives of Measuring Variation
The general object of measuring dispersion is to obtain a single summary figure which adequately
exhibits whether the distribution is compact or spread out.
 To judge the reliability of measures of central tendency
 To control variability itself.
 To compare two or more groups of numbers in terms of their variability.
 To make further statistical analysis.
4.3 Absolute and Relative Measures of Dispersion
The measures of dispersion which are expressed in terms of the original unit of a series are termed
as absolute measures. Such measures are not suitable for comparing the variability of two
distributions which are expressed in different units of measurement and different average size.
Relative measures of dispersions are a ratio or percentage of a measure of absolute dispersion to an
appropriate measure of central tendency and are thus pure numbers independent of the units of
measurement. For comparing the variability of two distributions (even if they are not measured in
the same unit), we compute the relative measure of dispersion instead of absolute measures of
dispersion.
It is useful for comparing variation in two or more distributions where units of measurements are
the same. Various measures of dispersions are in use. The most commonly used measures of
dispersions are:
1. Range and Relative Range
2. Quartile Deviation and Coefficient of Quartile Deviation
3. Mean Deviation and Coefficient of Mean Deviation
4. Standard Deviation and Coefficient of Variation.
4.3.1 The Range and Relative Range
The Range (R): The range is the largest score minus the smallest score. It is a quick and dirty
measure of variability, although when a test is given back to students they very often wish to know
the range of scores. Because the range is greatly affected by extreme scores, it may give a distorted
picture of the scores. Range for grouped frequency distribution is the upper class boundary of the
last class interval minus the lower class boundary of the first class interval, i.e., R = UCBlci - LCBfci
BY: Habtamu W.(MSc in Biostatistics) Page 1
Basic Statistics Lecture Note 2024/2025

The following two distributions have the same range, 13, yet appear to differ greatly in the amount
of variability.
Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.
Range for grouped data:
If data are given in the shape of continuous frequency distribution, the range is computed as:
R  UCLk  LCL1 , UCLk is upperclasslim it of the last class.
UCL1 is lower class lim it of the first class.
This is sometimes expressed as:
R  X k  X1 , X k is class mark of the last class.
X 1 is classmark of the first class.
Merits and Demerits of range
Merits:
 It is rigidly defined.
 It is easy to calculate and simple to understand.
Demerits:
 It is not based on all observation.
 It is highly affected by extreme observations.
 It is affected by fluctuation in sampling.
 It cannot be computed in the case of open end distribution.
 It is very sensitive to the size of the sample.

Relative Range (RR): It is also sometimes called coefficient of range and given by:
CR = (highest value – smallest value)/(highest value + smallest value)
Example:
1. Find the relative range of the above two distribution. (Exercise!)
2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value of:
a. Smallest observation
b. Largest observation
Solution: (2)
R  4  L  S  4 __________ _______(1)
RR  0.25  L  S  16 __________ ___( 2)
Solving (1) and (2) at the same time , one can obtain the following value
L  10 and S  6

BY: Habtamu W.(MSc in Biostatistics) Page 2


Basic Statistics Lecture Note 2024/2025

4.3.2 The Quartile Deviation and Coefficient of Quartile Deviation


The Quartile Deviation (Semi-inter quartile range, Q.D): The inter quartile range is the
difference between the third and the first quartiles of a set of items and semi-inter quartile range is
half of the inter quartile range.

Q.D =
Coefficient of Quartile Deviation (C.Q.D):

C.Q.D = =
Remark: Q.D or C.Q.D includes only the middle 50% of the observation.
4.3.3 The Mean Deviation and Coefficient Of Mean Deviation
The Mean Deviation (M.D): The mean deviation of a set of items is defined as the arithmetic mean
of the values of the absolute deviations from a given average. Depending up on the type of averages
used we have different mean deviations.
a) Mean Deviation about the mean

MD = .
For the case of a frequency distribution data where the values X1, X2, X3, …, Xm occur f1, f2, f3, …,
fm times respectively, then mean deviation is obtained by:

MD = .
For grouped data that is if the data is given in the form of frequency distribution of K-classes in
which mi and fi are the class marks and frequency of the ith class respectively then the mean

deviation is given by: MD = .


1
n
 | xi  Md . |
b. Mean deviation from median =
1
n
 | xi  mod e |
c. Mean deviation from mode =
 In the case of frequency distribution:
1
 fi | xi  Md . |
b’. Mean deviation from median = n
1
 fi | xi  Mode. |
c’. Mean deviation from mode = n
Steps to calculate M.D:
BY: Habtamu W.(MSc in Biostatistics) Page 3
Basic Statistics Lecture Note 2024/2025

1. Find the arithmetic mean,


2. Find the deviations of each reading from and
3. Find the arithmetic mean of the deviations, ignoring sign.
Example: calculate the mean deviation for the following data:
Xi 10 8 9 7 6
Fi 8 9 13 6 3

Solution: first find the mean as = = (10*8 + 8*9 +…+6*3)/(8+9+…+3) = 8.4, then
Xi 10 8 9 7 6
fi 8 9 13 6 3
│Xi - │ 1.6 0.6 0.4 1.4 2.4
fi │Xi - │ 12.8 7.8 3.6 8.4 7.2

Thus, MD = = (12.8 + …+ 7.2)/ (8+…+3) = 39.8/39 =1.02.


Interpretation: each value deviates on average 1.02 from the arithmetic mean, 8.4.
Note: You can also calculate the mean deviation about the Median and Mode.
Coefficient of Mean Deviation (C.M.D):

CMD = .
Exercise: find the coefficient of mean deviation about the mean for the above example.
4.2.4 The Variance, Standard Deviation and the Coefficient of Variation
The Variance: is the "average squared deviation from the mean" and it measures the average of the
square of the deviations from the mean for each observations.
Suppose we have population of N observations, say X1, X2, X3, …, XN, then we define the
population variance as:

= = .
But most of the time we have sample of n observations, say X1, X2, X3, …, Xn from the population
of N, then we define the sample variance as:

= .
This measure of variation is universally used to show the scatter of the individual measurements
around the mean of all the measurements in a given distribution. But the disadvantage is that the
units of variance are the square of the units of the original observations. The easiest way for this
difficulty is to use the square root of the variance as a measure of variability called the standard
deviation.

BY: Habtamu W.(MSc in Biostatistics) Page 4


Basic Statistics Lecture Note 2024/2025

The population and the sample standard deviations denoted by σ and S respectively are defined as:

σ= and S = = .
For the case of frequency distribution data the population and sample variance are given as:

= and =
and the square roots of these will give the corresponding standard deviations.
Variance and Standard Deviation for Grouped Data
To obtain the variance and standard deviation of data presented in a grouped frequency distribution,
we make the same assumptions that made in the calculation of the mean for grouped data in which
each value falling in to a class is identically distributed and observations in each class represented
by the class mark. The calculation is the same to the formula of data given in frequency distribution
except that Xi is substitute by the mid points of each class and m by k.
The following steps are used to calculate the sample variance:
1. Find the arithmetic mean.
2. Find the difference between each observation and the mean.
3. Square these differences.
4. Sum the squared differences.
5. Since the data is a sample, divide the number (from step 4 above) by the number of
observations minus one, (i.e., n-1), where n is the number of observations in the data set.
Example: Areas of spray able surfaces with DDT from a sample of 15 houses are as follows (m2):
101, 105, 110, 114, 115, 124, 125, 125, 130, 133, 135, 136, 137, 140, 145. Find the variance and
standard deviation of the above distribution.
Solution: The mean of the sample is 125 m2, then

S2 = = {(101-125)2 +(105-125)2 + ….(145-125)2 } / (15-1) = 178.71m4


Hence, the standard deviation = S = (178.71m4)1/2 = 13.37 m2.
It implies that each spray surface of the house deviates from the mean by 13.37 m 2 on average.
Examples: Find the variance and standard deviation of the following sample data
a) 5, 17, 12, 10.
b) The data is given in the form of grouped frequency distribution.

Class Frequency
40-44 7
45-49 10
50-54 22

BY: Habtamu W.(MSc in Biostatistics) Page 5


Basic Statistics Lecture Note 2024/2025

55-59 15
60-64 12
65-69 6
70-74 3

Solutions: a) = 11
Xi 5 10 12 17 Total
36 1 1 36 74
(Xi- )2

Then S2 = 74/(4-1) = 24.67 and S = (24.67) ½ = 4.97


b) = 55
mi(midpoint) 42 47 52 57 62 67 72 Total
1183 640 198 60 588 864 867 4400
fi(mi- )2

Then S2 = 4400 /(75-1) = 59.46 and S = (59.46) ½ = 7.71


Some Important Properties of Variance and Standard Deviation
1. For normal (symmetric) distribution the following holds.
 Approximately 68.27% of the data values fall within one standard deviation of the mean. i.e.
with in ( X  S , X  S )
 Approximately 95.45% of the data values fall within two standard deviations of the mean. i.e.
with in ( X  2S , X  2S )
 Approximately 99.73% of the data values fall within three standard deviations of the mean. i.e.
with in ( X  3S , X  3S )
2. Chebyshev's Theorem
For any data set ,no matter what the pattern of variation, the proportion of the values that fall within
1
1 2
k standard deviations of the mean or ( X  kS, X  kS ) will be at least k , where k is a number
greater than 1. i.e. the proportion of items falling beyond k standard deviations of the mean is at
1
2
most k
Example: Suppose a distribution has mean 50 and standard deviation 6. What percent of the
numbers are:
a) Between 38 and 62
b) Between 32 and 68
c) Less than 38 or more than 62.
d) Less than 32 or more than 68.
Solutions:
BY: Habtamu W.(MSc in Biostatistics) Page 6
Basic Statistics Lecture Note 2024/2025

a) 38 and 62 are at equal distance from the mean,50 and this distance is 12
 ks  12
12 12
k  2
S 6

1
(1  ) *100%  75%
 Applying the above theorem, at least k2 of the numbers lie between
38 and 62.
b) Similarly done.
1
2
*100%  25%
c) It is just the complement of a) i.e. at most k of the numbers lie less than 32 or
more than 62.
d) Similarly done.
3. Consider a sample X1, ….., Xn, which will be referred to as the original sample. To create a
translated sample X1+C, add a constant C to each data point. Let Yi = Xi+C, i = 1, …., n.
Suppose we want to compute the standard deviation of the translated sample, we can show that
the following relationship holds: If Yi = Xi + C, i = 1, …., n, then Sy = Sx. Therefore, the
standard deviation of Y will be the same as the standard deviation of X.
4. What happens to the standard deviation if the units or scales being worked with are changed? A
re-scaled sample can be created: If Yi = CXi, i=1, ……., n, then Sy = CSx and S2y = C2S2x.
Therefore, to find the variance and standard deviation of the Y’s compute the variance and
standard deviations of the X’s and multiply it by the constant C2 and C, respectively.
Example: If we have a sample of temperature in °C with a standard deviation of 1.8, then
what is the standard deviation of a sample temperature in °F?
Solution: Let Yi denote the °F temperature that corresponds to a °C temperature of Xi.

Since the required transformation to convert the data to °F would be: Yi = Xi + 32, i= 1,
2, 3, …, n. Then the standard deviation in oF would be: Sy = 9/5(1.8) = 3.24 0F.
5. On the other hand, where several standard deviations for a variable are available and if we
need to compute the combined standard deviation, the pooled standard deviation (Sp) of the
entire group consisting of all the samples may be computed as:

Sp = , where ni and Si represent number of observations and standard


deviation of each single sample, respectively.
4. The value of S is usually positive and it is zero only when all of the data values are the same.
Values close together will yield a small SD, whereas values spread apart will yield a larger SD.
Also, larger values of S indicate greater amount of variation.

BY: Habtamu W.(MSc in Biostatistics) Page 7


Basic Statistics Lecture Note 2024/2025

Example: The standard deviation of systolic blood pressure was found to be 10.6 and 15.2 mm Hg,
respectively, for two groups of 12 and 15 men. What is the standard deviation of systolic pressure of
all the 27 men?
Solution: Given: Group 1: S1 = 10.6 and n1 = 12 Group 2: S2 = 15.2 and n2 = 15, then

Sp = = {(11*10.62 + 14*15.52)/(11*14)}1/2 = 13.37 mm Hg.

Coefficient of Variation (CV): The coefficient of variation (CV) is defined by *100%. The
coefficient of variation is most useful in comparing the variability of several different samples, each
with different means. This is because a higher variability is usually expected when the mean
increases, and the CV is a measure that accounts for this variability.
The coefficient of variation is also useful for comparing the reproducibility of different variables.
CV is a relative measure free from unit of measurement.
Examples: An analysis of the monthly wages paid (in Birr) to workers in two firms A and B
belonging to the same industry gives the following results.
Value Firm A Firm B
Mean wage 52.5 47.5
Median wage 50.5 45.5
Variance 100 121

Solution: C.VA = *100% = 10/52.5 = 19.05% and

C.VB = *100% = 11/47.5 = 23.16%.


Since C.VA < C.VB, in firm B there is greater variability in individual wages.
Exercises 4.1
1. Find the missing information from the following data.
Group 1 Group 2 All group
Mean 55 70 60
Sample size 100 ? 150
Standard 15 10 ?
deviation
2. A meteorologist interested in the consistency of temperatures in three cities during a given week
collected the following data. The temperatures for the five days of the week in the three cities were
City 1 25 24 23 26 17
City2 22 21 24 22 20
City3 32 27 35 24 28
Which city have the most consistent temperature, based on these data?
4.5 The standard Score (Z-score)

BY: Habtamu W.(MSc in Biostatistics) Page 8


Basic Statistics Lecture Note 2024/2025

The Z-score is the number of standard deviations that a given value X is below or above the mean

and defined as Z = (for the sample data sets) and Z = (for the population data sets).
Values above the mean have positive z-scores and values below the mean have negative Z-scores.
The numerical value of the Z-score reflects because of this Z-score is also referred to as relative
measure of relative standing. Scores are generally meaningless by themselves unless they are
compared to the distribution or scores from some reference group. In addition to comparison the
data sets it is useful to transform a given data sets in to a new distribution and the resulting data has
mean value zero and variance one which is the standard normal distribution (we will see it in
chapters of hypothesis testing).
Note: A Z-score value less than -2 and greater than 2 considers as unusual value while between -2
and 2 is considers as ordinary values.
Examples 1. Two sections were given introduction to statistics examinations. The following
information was given.
Value Section 1 Section 2
Mean 78 90
Standard deviation 6 5
Student A from section 1 scored 90 and student B from section 2 scored 95. Relatively speaking
who performed better?

Solution: ZA = = (90-78)/6 = 2 and ZB = = (95-90)/5 = 1.


Student A performed better relative to his section because the score of student A is two standard
deviation above the mean score of his section while, the score of student B is only one standard
deviation above the mean score of his section.
Exercise 4.2
1. Two groups of people were trained 100km race and tested to find out which group is faster to
complete the race. For the two groups the following information was given:
Value Group one Group two
Mean 10.4 min 11.9 min
Stan.dev. 1.2 min 1.3 min
Relatively speaking:
a. Which group is more consistent in its performance?
b. Suppose a person A from group one take 9.2 minutes while person B from Group two take 9.3
minutes, who was faster in completing the race? Why?
4.6. Moments, Skewness and Kurtosis
In describing a numerical data set it is not only necessary to summarize the data by presenting
appropriate measures of central tendency, dispersion and relative standing, it is also necessary to
consider the shape of the data – the manner, in which the data are distributed. There are two
measures of the shape of a data set: skewness and kurtosis.
Moments
BY: Habtamu W.(MSc in Biostatistics) Page 9
Basic Statistics Lecture Note 2024/2025

Moments are statistical measures used to describe the characteristics of a distribution and we can
have moment about any number A and /or about the mean (called central moment).
The rth moment of the distribution about the mean is:
∑ ̅̅̅̅
for ungrouped data set and
∑ ̅̅̅̅
for grouped data set.
The rth moments of the distribution about A is:

for ungrouped data set and

for grouped data set.
Skewness
If the distribution of the data is not symmetrical, it is called asymmetrical or skewed. Skewness
characterizes the degree of asymmetry of a distribution around its mean.
The direction of the skewness depends upon the location of the extreme values. If the extreme
values are the larger observations, the mean will be the measure of location most greatly distorted
toward the upward direction. Since the mean exceeds the median and the mode, such distribution is
said to be positive or right-skewed. The tail of its distribution is extended to the right.
On the other hand, if the extreme values are the smaller observations, the mean will be the measure
of location most greatly reduced. Since the mean is exceeded by the median and the mode, such
distribution is said to be negative or left-skewed. The tail of its distribution is extended to the left.

Right-skewed distribution Left-skewed


distribution
For a sample data, the skewness is defined by the formula:
̅
Sk = ∑ ( ) , where n = no of obsns in the sample & s = SD of the sample.
Mean  Mode
S tan dard deviation
It is also possible to find skewness as: SK=
Properties of Skewness
 If SK = 0, then the distribution is symmetrical.
 If SK > 0, then the distribution is positively skewed.

BY: Habtamu W.(MSc in Biostatistics) Page 10


Basic Statistics Lecture Note 2024/2025

 If SK < 0, then the distribution is negatively skewed.


 There is no theoretical limit to this measure, however, in practice the value given by
this formula falls between -3 and 3
Kurtosis
Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the bell-
shaped distribution (normal distribution) or kurtosis is the degree of measure of peakedness of a
distribution.
If a distribution is very peaked than a normal distribution, then it is called Leptokurtic distribution
and if it is flat it is called Pletykurtic and if it is moderate (normal) we call it Mesokurtic.
Kurtosis of a sample data set is calculated directly from the data by the formula:
̅
= { ∑ ( ) }-
It is also possible to calculate the measure of kurtosis from the rth moment about the mean of the
sample data as: , where is the 4th moment about the mean.
Interpretation of the value of
1. If =3, then the distribution is mesokurtic.
2. If > 3, then the distribution is leptokurtic.
3. If < 3, then the distribution is platykurtic.
If we want to our reference point to be zero, we can change the above coefficient as: φ = - 3.
Accordingly, If φ =0, then the distribution is said to be mesokurtic
If φ > 0, then the distribution is said to be leptokurtic
If φ < 0, then the distribution is said to be platykurtic

The distributions with positive and negative kurtosis

BY: Habtamu W.(MSc in Biostatistics) Page 11

You might also like