Week_4
Week_4
The various measures of central tendency (averages) give us a single value that
represents the entire data. But the average alone cannot adequately describe a set of
observations. Furthermore, measures of central value fail to give us any idea about
the formulation of the data sets. For this reason, it is necessary to study the
dispersion (variability) along with average for describing a data set.
Example: Populations/Samples can have similar means but the variations can be
very different.
Consider two datasets:
Both have same mean, 22, but the variation in the second dataset is larger.
Measure of Dispersion
The measurement of the scatter of the values of a data set among themselves is
called a measure of dispersion or measure of variation. A measure of dispersion
conveys information regarding the amount of variability present in a set of data. If
all the values are the same, there is no dispersion; if they are not all the same,
dispersion is present in the data. The amount of dispersion may be small when the
values, though different, are close together. If the values are widely scattered, the
dispersion is greater. Other terms used synonymously with dispersion include
variation, spread, and scatter.
1
Figure 2.5.1 shows the frequency polygons for two populations that have equal
means but different amounts of variability. Population B, which is more variable
than population A, is more spread out.
(a) Range
(b) Variance or Standard Deviation
The Range:
The range of a set of data values is the difference between the highest and the
lowest values in the set. If 𝑥𝑙 & 𝑥𝑠 are the largest and smallest values, respectively
in a set, the range, denoted by R, is defined as
𝑅 = 𝑥𝑙 − 𝑥𝑠 .
2
Example: Calculate sample range for the following data:
2, 4, 5, 8
Note:
1. The value of range is always non-negative.
2. The unit of range is same as the unit of data.
3. The range is not useful as a measure of the variation since it only
takes into account two of the values (it is not good), however, it plays
a significant role in some applications.
Variance:
3
4
5
6
The variance represents squared units and, therefore, is not an appropriate
measure of dispersion when we wish to express this concept in terms of the
original units.
7
8
Example:
9
10
Interquartile Range: A measure that reflects the variability among the middle 50
percent of the observations in a data set is the interquartile range. Interquartile
range, denoted by IQR, is defined as
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 ,
where 𝑄1 and 𝑄3 are the first and third quartiles, respectively. A large IQR
indicates a large amount of variability among the middle 50 percent of the relevant
observations, and a small IQR indicates a small amount of variability among the
relevant observations.
Box-and-Whisker Plot
A useful visual device for communicating the information contained in a data set is
the box-and-whisker plot. The construction of a box-and whisker plot (sometimes
called, simply, a box plot) makes use of the quartiles of a data set. It
Depicts the central tendency and variability
Displays where there are any potential outlier/extreme observations
Steps in drawing Box-and-Whisker plot for a given data set are
1. Compute 𝑄1 , 𝑄2 (Median), and 𝑄3 .
2. Calculate inner fences and outer fences:
Inner fences: Q1 – 1.5(IQR)
Q3 + 1.5(IQR)
Outer fences: Q1 – 3(IQR)
Q3 + 3(IQR)
3. Identify the smallest observation, a and largest observation, b that
are between the inner fences
4. Draw a box that extends from Q1 to Q3. Draw a vertical line
through the box at median
5. Draw whiskers as lines that extend below Q1 and above Q3. Draw
one whisker from Q1 to a and the other whisker from Q3 to b.
11
6. Measurements that are located between inner and outer fences are
called mild outliers. Plot them using the symbol *.
7. Measurements that are located outside the outer fences are called
extreme outliers. Plot them using the symbol o.
Example:
Draw Box-and-Whisker plot for 20 customer satisfaction scores:
1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10
3. Identify the smallest (a) and largest (b) observations that are between the
inner fences
4. Draw a box that extends from Q1 to Q3. Draw a vertical line through the box
at Md
12
5. Draw whiskers as lines that extend below Q1 and above Q3. Draw one
whisker from Q1 to (a) and the other whisker from Q3 to (b).
6. Measurements that are located between inner and outer fences are called
mild outliers. Plot them using the symbol *
7. Measurements that are located outside the outer fences are called extreme
outliers. Plot them using the symbol o
There are two mild outliers and one extreme outlier in the dataset
13
Empirical Rule (Rule 68-95-99.7)
If a distribution appears bell-shaped symmetric about 𝜇 , we expect that
approximately
68% of the observations to fall in the interval (𝜇 − 𝜎, 𝜇 + 𝜎)
(within one standard deviation of the mean)
95% of the observations to fall in the interval (𝜇 − 2𝜎, 𝜇 + 2𝜎)
(within two standard deviations of the mean)
99.7% of the observations to fall in the interval (𝜇 − 3𝜎, 𝜇 + 3𝜎)
(within three standard deviations of the mean)
𝜇: 𝑃𝑜𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑀𝑒𝑎𝑛; 𝜎: 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
14
Example 1: A research was performed on the IQ scores of the employees of a
private firm. The scores are noted to be in normal distribution (bell shaped
symmetric). The mean of the distribution is 100 and standard deviation is 15.
Estimate the percentage of the scores that fall between 70 and 130.
Example 2: The scores of an entrance test for the high school pass-outs in a
particular year were bell shaped symmetric. If the mean and standard deviation
were 490 and 100, then
(a) What percentage students scored between 590 and 390 on this test?
(b) The score of a student was 795. What can you say about his
performance as compared to rest of the scores?
Solution:
a) Since 590 = 490 + 100 = μ+ σ
and 390 = 490 - 100 = μ - σ
Hence, we can say that approximately 68% of the students scored between 590 and
390 on this test.
15
b) Since 490 + 3 x 100 = 790 = μ + 3σ
490 - 3 x 100 = 190 = μ - 3σ
We can say that 99.7% of the test scores lie between 190 and 790. Hence a score of
795 is one of the highest scores.
Problem: If the average age of retirement for the entire population in a country is
64 years and the distribution is normal (bell shaped symmetric) with a standard
deviation of 3.5 years, what is the approximate age range in which 95% of people
retire?
16
78 64 76 82 68 69 67 79 69 74 83 76 71 84 72 75 72 71 83 76 68 75 73 69 73 76
77 68 71 72 70 77 71 74 75 75 75 71 74 71 70 76 64 65 76 78 70 69 82 77
𝑥 = 73.42, 𝑠 = 4.82
But actually following are the students who scored between 68.6 to 78.24:
69 69 69 69 70 70 70 71 71 71 71 71 71 72 72 72 73 73 74 74 74 75 75 75 75 75
76 76 76 76 76 76 77 77 77 78 78
There are 37 of them out of 50, therefore the actual percentage is 74 (which was
estimated as 68).
17
Chebyshev’s Theorem
Similar to Empirical Rule, but this can be applied even when the distribution
is not bell shaped symmetric.
1
For any value k > 1, at least 100 1 − % of the population measurements
𝑘2
lie in the interval (𝜇 − 𝑘𝜎, 𝜇 + 𝑘𝜎).
Problem: A population data set of size N = 500 has mean μ = 5.2 and standard
deviation σ = 1.1. Find the minimum number of observations in the data set that
must lie:
Problem: Over the last decade, Amazon.com has sold the following number of
books (in millions):
103 106 114 177 111 162 148 119 120 144
18