0% found this document useful (0 votes)
16 views111 pages

3descriptive Numerical Summary Measures

Uploaded by

Bekele S. Merga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views111 pages

3descriptive Numerical Summary Measures

Uploaded by

Bekele S. Merga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

Measure of Central

Tendency and Dispersion


#3

19 February 2017 1
Mathematical Presentation (Summery Statistics)
• Single numbers which quantify the characteristics
of a distribution of data
A. Measures of location
 Measures of central tendency
 Measures of non central locations
(Quartiles, Percentiles )
B. Measures of dispersion
• A frequency distribution is a general picture of the
distribution of a variable
• But, can’t indicate the average value and the spread
of the values
19 February 2017 2
Summary Measures: Population
Parameters and Sample Statistics
 Measures of Central  Measures of Variability
Tendency  Range
 Median  Inter-quartile range
 Mode  Variance
 Standard Deviation
 Mean
 Coefficient of variation
 Weighted Mean

 Other summary
measures:
 Skewness
 Kurtosis
19 February 2017 3
Measures of Central Tendency
(MCT)
• On the scale of values of a variable there is a certain
stage at which the largest number of items tend to
cluster.

• Since this stage is usually in the centre of distribution,


the tendency of the statistical data to get concentrated
at a certain value is called “central tendency”

• The various methods of determining the point about


which the observations tend to concentrate are called
MCT.
19 February 2017 4
MCT…

• The objective of calculating MCT is to determine a


single figure which may be used to represent the
whole data set.

• In that sense it is an even more compact description


of the statistical data than the frequency distribution.

• Since a MCT represents the entire data, it facilitates


comparison within one group or between groups of
data.

19 February 2017 5
MCT …
Position
20

15

10

0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99

19 February 2017 6
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses the
following characteristics.
1. It should be based on all the observations
2. It should not be affected by the extreme values
3. It should be as close to the maximum number of values as
possible
4. It should have a definite value
5. It should not be subjected to complicated and tedious
calculations
6. It should be capable of further algebraic treatment
7. It should be stable with regard to sampling
19 February 2017 7
MCT …

• The most common measures of central


tendency include:
– Arithmetic Mean
– Median
– Mode

19 February 2017 8
Measures of Central Tendency
or Location

Median  Middle value when


sorted in order of
magnitude
 50th percentile

Mode  Most frequently-


occurring value

Mean  Average

19 February 2017 9
1. Arithmetic Mean
A. Ungrouped Data
• The arithmetic mean is the "average" of the data set and
by far the most widely used measure of central location
• It is usually denoted by
• Is the sum of all the observations divided by the total
number of observations.

19 February 2017 10
The Summation Notation

19 February 2017 11
1. Arithmetic Mean…

19 February 2017 12
1. Arithmetic Mean…

• Example 1: Let X1=2, X2 = 5, X3=1, X4 =4, X5=10, X6= −5,


X7 = 8

• Since there are 7 observations, i range from 1 up to 7

 ∑xi = 2+5+1+4+10-5+8 = 25

 (∑Xi)2 = (25)2 = 625

 ∑Xi2 = 4 + 25 + 1 + 16 + 100 + 25 + 64 = 235

19 February 2017 13
Rules for working with summation

1) ∑(xi +yi) = ∑xi + ∑yi , where the number of x


values = the number of y values.

2) ∑Kxi = k×∑xi , where K is a constant.

3) ∑K = n×K, where K is a constant

19 February 2017 14
1. Arithmetic Mean…
Example 2: The heart rates for n=10 patients were
as follows (beats per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of
these patients?

19 February 2017 15
Exercise 1
 In 2005 epidemiology course, the final exam result of
10 2nd year public health student was: 40, 25, 40, 41,
50, 41, 39, 37, 39, 41
 What is the mean score of the above result?
Answer
mean= 40+25+40+41+50+41+39+37+39+41
10
Mean =39.3

19 February 2017 16
1. Arithmetic Mean …

b) Grouped data
In calculating the mean from grouped data, we assume that all values falling into a
particular class interval are located at the mid-point of the interval. It is calculated as
follow:
k

m f
i=1
i i
x= k

f i=1
i

where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval

19 February 2017 17
Example.3: Compute the mean age of 169 subjects from the
grouped data.

Class interval Mid-point (mi) Frequency (fi) mifi


10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
Total __ 169 5810.5

Mean = 5810.5/169 = 34.38 years

19 February 2017 18
1. Arithmetic Mean …

The mean can be thought of as a “balancing point”,


“center of gravity”

19 February 2017 19
When the data are skewed, the mean is “dragged” in the
direction of the skewness

It is possible in extreme cases for all but one of the sample


points to be on one side of the arithmetic mean & in this
case, the mean is a poor measure of central location or
does not reflect the center of the sample.
19 February 2017 20
Characteristics of Arithmetic Mean
 Influenced by each and every value in a data set
 It is greatly affected by extreme values.
 The sum of the deviations about it is zero.
 The sum of the squares of deviations from it is less
than of those computed from any other point.
 For a given set of data there is one and only one
arithmetic mean (uniqueness).
 Easy to calculate and understand (simple).
 In case of grouped data if any class interval is open, it
can not be calculated.
19 February 2017 21
Arithmetic Mean …

Advantages Disadvantages
• It is based on all values • It may be greatly
given in the distribution. affected by extreme
• It is most easly items
understood. • When the distribution
• It is most amenable to has open-end classes, its
algebraic treatment. computation may not be
valid

19 February 2017 22
2. Median
a) Ungrouped data
• An alternative measure of central location, perhaps second
in popularity to the arithmetic mean
• Is the value which divides the data set into two equal parts
• If the number of values is odd, the median will be the
middle value when all values are arranged in order of
magnitude.
• When the number of observations is even, there is no
single middle value but two middle observations.
 In this case the median is the mean of these two middle
observations, when all observations have been arranged
in the order of their magnitude.

19 February 2017 23
2. Median …

19 February 2017 24
2. Median …

19 February 2017 25
Example 4 – Median
Sales Sorted Sales

9 6
6 9
12 10
10 12 Median =
13 13
15 14 50th Percentile= [(n+1)p]th
16 14
14 15
(20+1)50/100=10.5
14 16
16 16
17 16
Median
16 + 16 = 16
16 17
2
24 17
21 18
22 18
18 19 The median is the middle
19 20
18 21 value of data sorted in
20 22
17 24 order of magnitude. It is
19 February 2017 the 50th percentile. 26
2. Median …

19 February 2017 27
• The median is a better description (than the mean) of
the majority when the distribution is skewed
• Example 5:
– Data: 14, 89, 93, 95, 96
– Skewness is reflected in the outlying low value of 14
– The sample mean is 77.4
– The median is 93

19 February 2017 28
2. Median ….
B) Grouped Data: In calculating the median from
grouped data, we assume that the values within a
class-interval are evenly distributed through the
interval.
• The first step is to locate the class interval in
which the median is located, using the following
procedure.
• Find n/2 and see a class interval with a minimum
cumulative frequency which contains n/2.
• Then, use the following formula.
19 February 2017 29
2. Median ….
 n 
  Fc 
x = Lm   2
~ W
 fm 
 
where,
Lm = lower true class boundary of the interval containing the median
Fc = cumulative frequency of the interval just above the median class
interval
fm = frequency of the interval containing the median
W= class interval width
n = total number of observations

19 February 2017 30
Example 6. Compute the median age of 169
subjects from the grouped data.
Class interval Mid-point (mi) Frequency (fi) Cum. freq
10-19 14.5 4 4
20-29 24.5 66 70
30-39 34.5 47 117
40-49 44.5 36 153
50-59 54.5 12 165
60-69 64.5 4 169
Total 169

n/2 = 169/2 = 84.5


The 3rd class interval contains the meadian
19 February 2017 31
2. Median …
• n/2 = 84.5 = in the 3rd class interval
• Lower true limit = 29.5, Upper true limit = 39.5
• Frequency of the class = 47
 n 
  Fc 
~
x = Lm   2 W
 fm 
 

Meadian = 29.5 + (84.5-70)10


47
Median = 29.5 + (14.5/47)10 = 32.58 ≈ 33

19 February 2017 32
Properties of the median
• There is only one median for a given set of data
(uniqueness)
• The median is easy to calculate
• Median is a positional average and hence it is not
sensitive to extreme values can be calculated even in
the case of open end intervals
• It is determined mainly by the middle points and less
sensitive to the remaining data points (weakness).

19 February 2017 33
Median …

Advantages Disadvantages
• It is easily calculated and is • The median is not so
not much disturbed by well suited to algebraic
extreme values
treatment as the means.
• It is more typical of the
series • It is not so generally
• The median may be located familiar as the
even when the data are arithmetic mean
incomplete,
• Good with ordinal data

19 February 2017 34
3. Mode
• The mode is the most frequently occurring value
among all the observations in a set of data.
• It is not influenced by extreme values.
• It is possible to have more than one mode or no
mode.
• It is even less amenable (responsive) to mathematical
treatment than the median
• It is not a good summary of the majority of the data.

19 February 2017 35
Mode
Mode
Mode

20
18
16
14
12
10
8
6
4
2
0 2017
19 February 36
T. Ancelle, D. Coulombie
Example 7 - Mode

.
. . . . . : . : : : . . . . .
---------------------------------------------------------------
6 9 10 12 13 14 15 16 17 18 19 20 21 22 24

Mode = 16

The mode is the most frequently occurring value. It


is the value with the highest frequency.
19 February 2017 37
Mode…

• A. Ungrouped data: It is a value which


occurs most frequently in a set of values.
• If all the values are different there is no
mode, on the other hand, a set of values
may have more than one mode.

19 February 2017 38
• Example 8
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
• Example 9
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
• Example 10
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
19 February 2017 39
Mode…

b) Grouped data
• we usually frist refer to the modal class, where the
modal class is the class interval with the highest
frequency.
• If a single value for the mode of grouped data
must be specified, it is taken as the mid-point of
the modal class interval.

19 February 2017 40
Mode …

Mode = LCBmod + ( ∆1+∆2


∆1 ) w

• ∆1= fmod – freq next lower to the modal class (preceding)


• ∆2=fmod-freq next higher to the modal class
• Modal class =class with highest freq.
Eg:
classes freq Cum. freq
54.5-57.5 6 6
57.5-60.5 11 17
60.5-63.5 16 33
63.5-66.5 27 60
66.5-69.5 20 80
69.5-72.5 14 94
72.5-75.5 6 100
total 100

19 February 2017 41
• Lcbmod=63.5
Fl=16
∆1=fmod-fl=27-16=11
Fh=20
∆2=fmod-fh =27-20=7

Mode = 63.5+(11)3
11+7
Mode= 65.33beats/min

19 February 2017 42
Properties of mode
 It is not affected by extreme values
 It can be calculated for distributions with open end
classes
 It is the most typical value of the distribution
 Often its value is not unique
 The main drawback of mode is that often it does
not exist

19 February 2017 43
Mode …

Advantages Disadvantages
• Since it is the most typical • It is not capable of
value it is the most mathematical
descriptive average treatment
• Since the mode is usually • In a small number of
an “actual value”, it items the mode may
indicates the precise value not exist.
of an important part of the
series.

19 February 2017 44
4. Geometric mean (GM)
• Mainly used in many types of laboratory data,
specifically data in the form of concentrations of one
substance in another
• Example 11: the minimum inhibitory concentration of
penicillin in urine for N. gonorrhoeae in 71 patients

(µg/ml) Frequency (µg/ml) Frequency

0.03125 21 0.250 19
0.0625 6 0.50 17
0.1250 8 1.0 3

19 February 2017 45
4. Geometric mean (GM) …
If x 1 , x 2 , ..., x n are n positive observed values, then
n
GM = n  x i
i=1

and
n

 logx
i=1
i
logGM = .
n
The geometric mean is generally used with data measured on a logarithmic scale, such
as titers of anti-neutrophil immunoglobulin G.

It is also preferred when several values in a data set are much higher
than all of the others. These higher values would tend to inflate or
distort an arithmetic mean.
19 February 2017 46
Example 11: the Geometric mean (GM) of the
above data is:

LogGM = [21log(0.03125) + 6log(0.0625) +


8log(0.125) + 19log(0.25) + 17log(0.5)
+ 3log(1.0)]/74 = -0.846
The GM = the antilogarithm of -0.846 = 0.143

19 February 2017 47
Characteristics of Geometric mean

It is a calculated value and depends upon the size of all


the items.

• It gives less importance to extreme items than does


the arithmetic mean.

• For any series of items it is always smaller than the


arithmetic mean.

• It exists ordinarily only for positive values.


19 February 2017 48
Geometric mean…

Advantages Disadvantages
• since it is less affected • Its computation is
by extremes it is a more relatively difficult.
preferable average than • It cannot be determined
the arithmetic mean if there is any negative
• It is capable of algebraic value in the distribution,
treatment or where one of the
• It is based on all values items has a zero value.
given in the distribution.

19 February 2017 49
5. Harmonic mean (HM)

• Just as the geometric mean is based on an


arithmetic mean of logarithms, so is the harmonic
mean based on arithmetic mean of the reciprocals.

• Pertains to rates and time

• We define it as the reciprocal of the arithmetic


mean of the reciprocal of the given numbers.

19 February 2017 50
5. Harmonic mean (HM) …

If the given numbers are x 1 , x 2 , ..., x n , then


1
HM = n
1 1

n i=1 x i

19 February 2017 51
6. Weighted mean (WM)

• In a weighted mean, separate outcomes have separate


influences.

• The influence attached to an outcome is the weight.

• Familiar is the calculation of a course grade as a


weighted average of scores on separate outcomes.

19 February 2017 52
Example 12:

19 February 2017 53
Which measure of central tendency is best with a
given set of data?

• Two factors are important in making this decisions:


– The scale of measurement (type of data)
– The shape of the distribution of the
observations
• The mean can be used for discrete and continuous
data
• The median is appropriate for discrete and
continuous data as well, but can also be used for
ordinal data

19 February 2017 54
• The mode can be used for all types of data, but may
be especially useful for nominal and ordinal
measurements
For discrete or continuous data, the “modal class”
can be used
• The geometric mean is used primarily for
observations measured on a logarithmic scale.
• Harmonic mean is a suitable MCT when the data
pertains to rates and time.
• Weighted mean is commonly used in the calculation
of mean for different outcomes.
19 February 2017 55
Skewness
• If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores (misses central location in case).

• Based on the type of skewness, distributions can be:

19 February 2017 56
(a) Symmetrical distribution: one half of the
curve is the mirror image of the other half
i. Symmetric and unimodal distribution —
Mean, median, and mode should all be
approximately the same
Mean =Median=Mode

19 February 2017 57
ii. Symmetric Bimodal distribution
• Mean and median should be about the same,
but may take a value that is unlikely to occur;
two modes might be best

19 February 2017 58
(b) Skewed to the right (positively skewed)
• Occurs when the majority of scores are at the left end
of the curve and a few extreme large scores are
scattered at the right end
• Mean is sensitive to extreme values, so median might
be more appropriate
Mode
Median

Mean

Mean >Median >Mode


19 February 2017 59
(c) Skewed to the left (negatively skewed)
• occurs when majority of scores are at the right end of
the curve and a few small scores are scattered at the
left end.
• Mean< Median < Mode
Mode

Median

Mean

19 February 2017 60
19 February 2017 61
19 February 2017 62
Skewness and Sample Distributions
Not all curves are normal, even if still bell-shaped
7. Percentiles

 Given any set of numerical observations, order


them according to magnitude.
 The Pth percentile in the ordered set is that value
below which lie P% (P percent) of the observations
in the set.
 The position of the Pth percentile is given by (n +
1)P/100, where n is the number of observations in
the set.
Percentiles
• Simply divide the data into 100 pieces.
e.g. Standard physical growth chart
• Percentiles are less sensitive to outliers and not
greatly affected by the sample size (n).

19 February 2017 65
Example 13-1
Sales Sorted Sales
A large department store
9 6
collects data on sales made by 6
12
9
10
each of its salespeople. The 10
13
12
13
number of sales made on a 15
16
14
14
given day by each of 14
14
15
16
20 salespeople is as shown on 16
17
16
16
the this slide. Also, the data 16
24
17
17
has been sorted in magnitude. 21
22
18
18
18 19
19 20
18 21
20 22
17 24
Example 13-1 (Continued) Percentiles
 Find the 50th, 80th, and the 90th percentiles of this
data set.
 To find the 50th percentile, determine the data point
in position (n + 1)P/100 = (20 + 1)(50/100)
= 10.5.
 Thus, the percentile is located at the 10.5th
position.
 The 10th observation is 16, and the 11th observation
is also 16.
 The 50th percentile will lie halfway between the
10th and 11th values (which are both 16 in this case)
and is thus 16.
Example 13-1(Continued) Percentiles

 To find the 80th percentile, determine the data


point in position (n + 1)P/100 = (20 + 1)(80/100) =
16.8.
 Thus, the percentile is located at the 16.8th
position.
 The 16th observation is 19, and the 17th
observation is also 20.
 The 80th percentile is a point lying 0.8 of the way
from 19 to 20 and is thus 19.8.
Example 13-1 (Continued) Percentiles

 To find the 90th percentile, determine the data point


in position (n + 1)P/100 = (20 + 1)(90/100) = 18.9.
 Thus, the percentile is located at the 18.9th
position.
 The 18th observation is 21, and the 19th
observation is also 22.
 The 90th percentile is a point lying 0.9 of the
way from 21 to 22 and is thus 21.9.
Quartiles – Special Percentiles
 Quartiles are the percentage points that break
down the ordered data set into quarters.
 The median divides the data into two equal parts
 If the data are divided into four equal parts, we
speak of quartiles.
 The first quartile is the 25th percentile. It is the
point below which lie 1/4 of the data.
 The second quartile is the 50th percentile. It is
the point below which lie 1/2 of the data. This is
also called the median.
 The third quartile is the 75th percentile. It is the
point below which lie 3/4 of the data.
Quartiles and Interquartile Range

 The first quartile, Q1, (25th percentile) is often


called the lower quartile.
th
 The second quartile, Q2, (50 percentile) is often
called the median
or the middle quartile.
 The third quartile, Q3, (75
th percentile) is often
called the upper quartile.
 The interquartile range is the difference
between the first and the third quartiles.
Example 13-2: Finding Quartiles
Sorted
Sales Sales
9 6
(n+1)P/100 Quartiles
6 9
12 10 Position
10 12 X1 + (P/100)(X2-X1)
13 13 First Quartile (20+1)25/100=5.25
15 14 13 + (.25)(14-13)
16 14 = 13.25
14 15
14 16
16 16 Median (20+1)50/100=10.5 16 + (.5)(0) = 16
17 16
16 17
24 17
21 18
22 18 Third Quartile (20+1)75/100=15.75 18+ (.75)(1) = 18.75
18 19
19 20
18 21
20 22
17 24
Measures of Dispersion
Consider the following two sets of data:

A: 177 193 195 209 226 Mean = 200

B: 192 197 200 202 209 Mean = 200

Two or more sets may have the same mean and/or median but they
may be quite different.

19 February 2017 73
These two distributions have the same mean,
median, and mode

19 February 2017 74
• MCT are not enough to give a clear
understanding about the distribution of
the data.

• We need to know something about the


variability or spread of the values —
whether they tend to be clustered close
together, or spread out over a broad
range

19 February 2017 75
Measures of Dispersion
• Measures that quantify the variation or
dispersion of a set of data from its central
location

• Dispersion refers to the variety exhibited by


the values of the data.

• The amount may be small when the values are


close together.

• If all the values are the same, no dispersion

19 February 2017 76
Measures of Dispersion
Other synonymous term:
– ―Measure of Variation‖
– ―Measure of Spread‖
– ―Measures of Scatter‖

19 February 2017 77
• Measures of dispersion include:
– Range
– Inter-quartile range
– Variance
– Standard deviation
– Coefficient of variation
– Standard error
– Others

19 February 2017 78
1. Range (R)
• The difference between the largest and
smallest observations in a sample.

• Range = Maximum value – Minimum value

• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
• Data set with higher range exhibit more
variability
19 February 2017 79
Properties of range
 It is the simplest crude measure and can be
easily understood
 It takes into account only two values which
causes it to be a poor measure of dispersion
 Very sensitive to extreme observations
 The larger the sample size, the larger the
range

19 February 2017 80
2. Interquartile range (IQR)
• Indicates the spread of the middle 50% of
the observations, and used with median

IQR = Q3 - Q1

• Example: Suppose the first and third quartile for


weights of girls 12 months of age are 8.8 Kg and
10.2 Kg, respectively.
IQR = 10.2 Kg – 8.8 Kg = 1.4
i.e., 50% of the infant girls weigh between 8.8 and
10.2 Kg.
19 February 2017 81
IQR...

19 February 2017 82
Properties of IQR:
• It is a simple and versatile measure
• It encloses the central 50% of the observations
• It is not based on all observations but only on
two specific values
• It is important in selecting cut-off points in the
formulation of clinical standards
• Since it excludes the lowest and highest 25%
values, it is not affected by extreme values
• Less sensitive to the size of the sample

19 February 2017 83
Example - Range and Interquartile Range
(Data is used from Example 1-1)
Sorted
Sales Sales Rank
9 6 1 Minimum Range: Maximum - Minimum =
6 9 2 24 - 6 = 18
12 10 3
10 12 4
13 13 5 Q1 = 13 + (.25)(1) = 13.25
15 14 6 First Quartile
16 14 7
14 15 8
14 16 9
16 16 10 Q2 = Median =P50
17 16 11
16 17 12
24 17 13 Q3 = 18+ (.75)(1) = 18.75
21 18 14
22 18 15 Third Quartile
18 19 16 Interquartile Q3 - Q1 =
19 20 17
Range: 18.75 - 13.25 = 5.5
18 21 18
20 22 19 Maximum
17 24 20
3. Quartile deviation (QD)

QD =
Q 3  Q1
2

A measure of distance between two values which cover the


middle half of the distribution

19 February 2017 85
4. Coefficient of quartile deviation
(CQD)

• CQD = Q 3  Q 1
Q 3  Q1
• CQD is an absolute quantity (unitless)
and is useful to compare the variability
among the middle 50% observations.

19 February 2017 86
5. Mean deviation (MD)
• Mean deviation is the average of the
absolute deviations taken from a central
value, generally the mean or median.
• Consider a set of n observations x1, x2,
..., xn. Then:
n
1
MD   x i  A
n i 1
• ‗A‘ is a central value (arithmetic mean or
median).
19 February 2017 87
Properties of mean deviation:
 MD removes one main objection of the earlier
measures, that it involves each value

 It is not affected much by extreme values

 Its main drawback is that algebraic negative


signs of the deviations are ignored which is
mathematically unsound

19 February 2017 88
6. Variance 2
( , 2
s)
• The main objection of mean deviation, that
the negative signs are ignored, is removed
by taking the square of the deviations from
the mean.

• The variance is the average of the squares


of the deviations taken from the mean.

19 February 2017 89
Variance and Standard
Deviation
Population Variance Sample Variance

(x  x)
n
N 2

(x  m) 2

s 
2 i 1

 2  i1
N
(n  1)
( )
2

( x)
2
N n
 x
i 1
N
x 
n

x  2 i 1 2

 n
i 1
 i1 N
N (n  1)
  2

s s 2
• It is squared because the sum of the
deviations of the individual observations of
a sample about the sample mean is
always 0

0 = ( x i - x )

• The variance can be thought of as an


average of squared deviations
19 February 2017 91
• Variance is used to measure the
dispersion of values relative to the mean.
• When values are close to their mean
(narrow range) the dispersion is less than
when there is scattering over a wide
range.
– Population variance = σ2
– Sample variance = S2

19 February 2017 92
a) Ungrouped data
 Let X1, X2, ..., XN be the measurement on
N population units, then:
N

 (X i  m) 2
2  i 1
where
N
N

X i
m= i=1
is the population mean.
N

19 February 2017 93
A sample variance is calculated for a sample of
individual values (X1, X2, … Xn) and uses the sample
mean (e.g. ) rather than the population mean µ.

19 February 2017 94
Why divide by n‐1?

• Samples give us estimates of population parameter


s (population mean and variance)

• Dividing by n underestimates the population varia


nce and this is easily demonstrated.

19 February 2017 95
Another feature about n-1
• In many statistical tests we sum variances
from groups and we lose a data point or what
is sometimes referred to as degrees of
freedom.
• For each sample estimate we therefore lose
a degree of freedom-all numbers on which
the estimate is based are free to vary except
one

19 February 2017 96
19 February 2017 97
Calculation of Sample Variance
x xx (x  x) 2 x2 n

(x  x)
2

6 -9.85 97.0225 36 378.55


s 
2 i 1

9
10
-6.85
-5.85
46.9225
34.2225
81
100 (n  1) (20  1)
12 -3.85 14.8225 144 378.55
13 -2.85 8.1225 169   19.923684
14 -1.85 3.4225 196 19
14 -1.85 3.4225 196
 n x
2

15 -0.85 0.7225 225


n  i 1 
16 0.15 0.0225 256
 x  2

16 0.15 0.0225 256 n



i 1

16
17
0.15
1.15
0.0225
1.3225
256
289 (n  1)
17 1.15 1.3225 289 2
100489
317
18 2.15 4.6225 324 5403  5403 
18 2.15 4.6225 324  20  20
19
20
3.15
4.15
9.9225
17.2225
361
400
(20  1) 19
21 5.15 26.5225 441 5403  5024.45 378.55
22 6.15 37.8225 484    19.923684
24 8.15 66.4225 576 19 19
317 0 378.5500 5403 s  s  19.923684  4.46
2
b) Grouped data
k

 i
(m  x) 2
fi
S2  i=1
k

f
i=1
i -1

where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
x = the sample mean
k = the number of class intervals
19 February 2017 99
Properties of Variance:
 The main disadvantage of variance is
that its unit is the square of the unit of
the original measurement values
 The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because
the difference is squared in variance.
• The drawbacks of variance are
overcome by the standard deviation.
19 February 2017 100
7. Standard deviation (, s)
• It is the square root of the variance.
• This produces a measure having the
same scale as that of the individual
values.

   and S = S
2 2

19 February 2017 101


• Following are the survival times of n=11
patients after heart transplant surgery.

• The survival time for the ―ith‖ patient is


represented as Xi for i= 1, …, 11.

• Calculate the sample variance and SD.

19 February 2017 102


19 February 2017 103
Example. Compute the variance and SD of the age of 169
subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22

19 February 2017 104


Properties of SD
• The SD has the advantage of being expressed in
the same units of measurement as the mean

• SD is considered to be the best measure of


dispersion and is used widely because of the
properties of the theoretical normal curve.

• However, if the units of measurements of


variables of two data sets is not the same, then
there variability can‘t be compared by comparing
the values
19 February 2017
of SD. 105
SD Vs Standard Error (SE)
• SD describes the variability among individual
values in a given data set
• SE is used to describe the variability among
separate sample means obtained from one
sample to another

• We interpret SE of the mean to mean that


another similarly conducted study may give a
mean that may lie between  SE.
19 February 2017 106
Standard Error
• SD is about the variability of individuals

• SE is used to describe the variability in


the means of repeated samples taken
from the same population.

• For example, imagine 5,000 samples, each of the same size n=11.
This would produce 5,000 sample means. This new collection has
its own pattern of variability. We describe this new pattern of
variability using the SE, not the SD.

19 February 2017 107


Example: The heart transplant surgery
n=11, SD=168.89, Mean=161 days

• What happens if we repeat the study? What will our next mean be? Will
it be close? How different will it be?

• The behavior of mean from one replication of the study to the next
replication is referred to as the sampling distribution of mean.
• We can also have sampling distribution of the median or the SD

• We interpret this to mean that a similarly conducted study might


produce an average survival time that is near 161 days, ±50.9 days.

19 February 2017 108


8. Coefficient of variation (CV)
• When two data sets have different units
of measurements, or their means differ
sufficiently in size, the CV should be
used as a measure of dispersion.
• It is the best measure to compare the
variability of two series of sets of
observations.
• Data with less coefficient of variation is
considered more consistent.
19 February 2017 109
•CV is the ratio of the SD to the mean multiplied by 100.

S
CV   100
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200mg/dl 20.0

• ―Cholesterol is more variable than systolic blood


pressure‖

19 February 2017 110


NOTE:
• The IQR is used with the median as well

• The SD is used with the mean

• For nominal and ordinal data, a table or


graph is often more effective than any
numerical summary measure

19 February 2017 111

You might also like