Summarisation and Presentation of Data
(Descriptive Statistics)
Rotimi F. Afolabi, PhD
Department of Epidemiology & Medical Statistics
College of Medicine, University of Ibadan
Session Objective
• Introduce students to the following different
methods of organising, summarising and presenting
data:
– frequency tables (tabular descriptive),
– graphs/charts (graphical descriptive) and
– summary (numerical descriptive) statistics
Descriptive Statistics_Lectures 3-6_2017 2
Tabular Presentation of Data
(Lecture 3)
Tables
♦ Definition
– Data presented in columns and rows by one or
more classification variable
• Uses
– demonstrate patterns, exceptions, differences and
other relationships
– serve as the basis for preparing more visual
displays of data, such as graphs and charts, where
some of the detail may be lost
Descriptive Statistics_Lectures 3-6_2017 4
Examples:
Represent the data on the age distribution of
adult admissions into UCH in June 1998.
Age (years) Frequency
10-19 16
20-29 27
30-39 23
40-49 24
50-59 23
60-69 19
70-79 15
The two-variable table
Table 1. Cases of Salmonella
Typhimurium-infection by age-group and sex,
Herøy, Norway, 1999
Age group Sex Total
(years) Male Female
0 - 9 7 5 12
10 - 19 5 5 10
20 - 29 5 5 10
30 - 39 1 4 5
40 - 49 2 3 5
50 - 59 0 3 3
60 - 69 2 1 3
70 - 2 4 6
Total 24 30 54
Descriptive Statistics_Lectures 3-6_2017 6
Contingency Table
The 2x2 table for a cohort study
Table 5. Association between fish consumption and
gastrointestinal illness among customers at Uncle Mike's Fish &
Chips, Cambridge, October 1 2000
Ill Well Total Attack rate
Ate fish 42 16 58 0.72
Did not eat fish 5 59 64 0.078
Relative risk: 9.3 (95% confidence interval 3.9 - 22)
Descriptive Statistics_Lectures 3-6_2017 7
Frequency Distribution Table
• Arrangement of data by rows & columns
• Useful to summarize data.
• Has two main columns.
• Column 1 lists all values of the variable.
• Column 2 the frequency at which each
value occurs.
• For initial data exploration
• Construction depends on type of variable
Descriptive Statistics_Lectures 3-6_2017 8
Frequency Table - Qualitative variable
• Ist column: different categories of the
variable (mutually exclusive)
• 2nd column: frequency or count with which
each category occurred.
Descriptive Statistics_Lectures 3-6_2017 9
Reasons for physicians not smoking
Reasons Frequency
• Health 25
• Religious 15
• Social 12
• Profession 5
• Others 3
Descriptive Statistics_Lectures 3-6_2017 10
Frequency Intervals
• Must not overlap-
Example
• 5-9 not 5 - 10
• 10 - 14 10 - 15
• 15 - 19 15 – 20
• Equal intervals easier to interpret but
• Unequal intervals may be used to illustrate specific attributes
of interest
Descriptive Statistics_Lectures 3-6_2017 11
Relative Frequency
• Proportion of total observations ascribed to
that value
• Divide frequency in the class interval by
total observation.
Descriptive Statistics_Lectures 3-6_2017 12
Cumulative Frequency
• Proportion of total observations with
certain value or less.
• Must correspond to end of class interval.
• Add up relative frequencies to preceding
values.
Descriptive Statistics_Lectures 3-6_2017 13
Frequency Distribution Table with One Variable
Table 1 Number of Cases of Primary and Secondary Syphilis
by Age Group, USA, 1989.
Age Group Cases
(in years) Frequency Cumulative Relative Frequency Cumulative
(Number) frequency (Percentage) Relative
Frequency
< 14 230 230 0.5 0.5
15-19 4378 4608 10.0 10.5
20-24 10,405 15013 23.6 34.1
25-29 9610 24623 21.8 55.9
30-34 8648 33271 19.6 75.5
35-44 6901 40172 15.7 91.2
45-54 2631 42803 6.0 97.2
> 55 1278 44081 2.9 100.0
Total 44,081 100 100.0%
Descriptive Statistics_Lectures 3-6_2017 14
Qualities of Frequency Tables
Simple Information (not more than 3 variables)
• Clear title to indicate what? when? where?
• Good labeling of rows & columns
• Indicate units of measurements
• Row, column & grand totals MUST add up
Descriptive Statistics_Lectures 3-6_2017 15
Terms used in constructing frequency table
• Classes (Class Intervals): categories of grouping data
• Frequency: It is the number of times a value or group of
values of a variable occurs i.e. number of observations
that fall in a class
• Frequency distribution: listing of all classes and their
frequencies
• Relative frequency: ratio of frequency of a class to the
total number of observations
• Relative-frequency distribution: listing of all classes and
their relative frequencies
Descriptive Statistics_Lectures 3-6_2017 16
Terms used in constructing frequency table …
• Class Limits – the end numbers of each class. Upper class
limit is the largest number of the class and Lower class
limit is the smallest number of the class
• Class Boundaries – When the upper limit of each class is
the same as the lower limit of the next class, the class
limits are referred to as Class Boundaries. This is obtained
by taking the midpoint of the upper limit of each class
and the lower limit of the next
• Class Mark – It is the midpoint between the lower and
upper class limits of a class
• Class Size (or Width) – It is the difference between the
upper and lower class boundaries
Descriptive Statistics_Lectures 3-6_2017 17
Graphical Presentation of Data
(Lecture 4)
Graphical Presentation
Need :
– To aid in visually exploring the data
– Diagram make better visual
impressions than numbers
• An adage says a picture is more than a
thousand words
Descriptive Statistics_Lectures 3-6_2017 19
Graphs/charts
Depends on type of data
• Quantitative or numerical data
– Histogram,
– frequency polygon,
– Cumulative frequency curve (known as “ogive”)
– Box plot
• Qualitative or categorical data
– Bar chart ,
– Pie chart
Descriptive Statistics_Lectures 3-6_2017 20
Histogram
• Plot of the frequency distribution
• Use to show data on interval scale or
continuous variables
• Slender rectangles adjoin each other
• Area under the histogram is equivalent to
the total frequency
Descriptive Statistics_Lectures 3-6_2017 21
Histogram – Purpose/uses
• Provides information on range of data
values
• Shows the location of the highest
concentration of measurement
• Reveals the presence or absence of
symmetry
Descriptive Statistics_Lectures 3-6_2017 22
Example 1: Represent the data on the age distribution
of adult admissions into UCH in June 1998.
Age (years) Frequency
10-19 16
20-29 27
30-39 23
40-49 24
50-59 23
60-69 19
70-79 15
Histogram of ages of adult admissions at
UCH, June 1998
Frequency
30
25
20
Frequency
15
10
0
10 - 19 20-29 30-39 40-49 50-59 60-69 70-79
Age
Descriptive Statistics_Lectures 3-6_2017 24
Frequency polygon
• Special line graph for frequency distribution
• Obtained by plotting frequencies against the
class marks
• From the histogram
– Plot the frequency at the midpoint
– Join the points with a straight line
Lecture 3 - 2015/2016 Session 25
FREQUENCY POLYGON
30
25
20
Frequency
15
10
0
10 20 30 40 50 60 70 80
Age (Yrs)
Lecture 3 - 2015/2016 Session 26
Cumulative frequency curve -ogive
• Plot of cumulative frequency against the upper class
boundaries, and joining all the consecutive points
• Additional point is obtained by plotting a frequency of
zero against the lowest lower boundary
• Used to show how many data values are accumulated up
to and including a specific class
• It may be applied to obtain measures of partition such as
– Quartiles
– Deciles
– percentiles
Lecture 3 - 2015/2016 Session 27
Ogive
156
136
116
Cummulative freq
96
76
56
36
16
19 29 39 49 59 69 79 89
Age (Yrs)
Lecture 3 - 2015/2016 Session 28
Box and Whisker Plot [ Box Plot]
• A simple but excellent tool for conveying location
and variation information in data sets
• It helps to display the symmetry properties of a
sample
• It can be used to visually describe the spread of
a sample
• It can also help identify possible outlying values
– that is, values that seem inconsistent with the rest of
the points in the sample.
Descriptive Statistics_Lectures 3-6_2017 29
Box and whiskers plot
Systolic 75th percentile
Blood
pressure
(mmHg) median
120
mean
25thpercentile
Descriptive Statistics_Lectures 3-6_2017 30
Box Plot …
• The box plot is interpreted as follows:
• The midline of a box plot is the median or 50th
percentile.
• The body or box portion of the plot is the
interquartile range going from the 25th percentile to
the 75th percentile.
– The interquartile range is the middle 50% of the data
– The width of the interquartile range is equal to:
• 75th Percentile – 25th Percentile
– The interquartile range is a robust measure of variability
Descriptive Statistics_Lectures 3-6_2017 31
Box Plot …
• How can the median, upper quartile, and lower
quartile be used to judge the symmetry of a
distribution?
1. If the distribution is symmetric, then the upper and lower
quartiles should be approximately equally spaced from
the median.
2. If the upper quartile is farther from the median than the
lower quartile, then the distribution is positively skewed.
3. If the lower quartile is farther from the median than the
upper quartile, then the distribution is negatively skewed
• Uses the 5-number summary indices
Descriptive Statistics_Lectures 3-6_2017 32
5-Number Summary Indices
•Arrange data in descending order
– Find Q1=1/4 data lies below this point
– Find Median= 1/2 data lies below this point
– Find Q3=3/4 data lies below this point
– Find Maximum score and
– Find Minimum score
Descriptive Statistics_Lectures 3-6_2017 33
Descriptive Statistics_Lectures 3-6_2017 34
Bar Chart
• Slender rectangles to represent frequency of
values of variable
• Rectangles are separate and distinct
• Height of rectangle correspond to frequency
• Types of bar chart
– Simple: consisting of a set of non-joining bars
– Component: like simple bar chart except that each
bar is split up into constituent parts
– Multiple: the component values are shown as
separate bars joined and always in the same
sequence
Descriptive Statistics_Lectures 3-6_2017 35
Example : Reasons For Physicians Not Smoking
» Reasons Frequency
• Health 25
• Religious 15
• Social 12
• Profession 5
• Others 3
Descriptive Statistics_Lectures 3-6_2017 36
Reasons for UCH physicians not smoking
25
20
15
10
0
Health Religious Social Profession Others
Descriptive Statistics_Lectures 3-6_2017 37
Pie Chart
• Graphical device consisting of a circle sub-divided
into sectors whose areas are proportional to the
whole quantity
• Use to show the components of a total
• More intelligent visual impressions sometimes
• Procedures:
– Draw a circle of any convenient radius, with a marked
centre to represent total observation.
– Divide circle into sectors according to the frequency of
each attribute.
– Use (n/N) x 3600 to represent each sector.
– Shade sectors in different colours to distinguish.
Descriptive Statistics_Lectures 3-6_2017 38
Pie Chart …
5%
8%
Health
42%
Religious
20%
Social
Profession
Others
25%
Descriptive Statistics_Lectures 3-6_2017 39
Numerical Summarisation of Data
(Lectures 5 & 6)
Types of Numerical Measures
Central Location / Position / Tendency - a
single value that represents (is a good
summary of) an entire distribution of data
Spread / Dispersion / Variability - how much
the distribution is spread or dispersed from
its central location
Descriptive Statistics_Lectures 3-6_2017 41
Central Location
20
? ?
15
Number of people
10
5
Spread
0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99
Age
Descriptive Statistics_Lectures 3-6_2017 42
Measures of central tendency
(Lecture 5)
Measures of Central Location
Definition:
– a single value that represents (a good summary
of) an entire distribution of data
Also known as:
– “Measure of central tendency”
– “Measure of central position”
Common measures:
– Arithmetic mean
– Median
– Mode
Descriptive Statistics_Lectures 3-6_2017 44
Arithmetic Mean (Average)
• Most useful measure of central tendency
• Mean retains all original measurements
• Not good when data is skewed
• Sensitive to extreme values
– One data point could make a great change in
sample mean
• Calculation steps:
– Add up data
– Divide by number of observations (pieces of
data) i.e. sample size (n)
Descriptive Statistics_Lectures 3-6_2017 45
s/n Age
1 27 Raw data set:
2 30 Ages of students in a class (by years of
3 28
4 31 age)
5 28
6 36
7 29
Mean = Sum of all
8 37
9 29 observations
10 34
No of
n observations
x
11 30
12 30
13 27 i
14 30
X i 1
15 28
16 31
17
18
19
32
30
29
n
20 29 –Usually pronounced “x bar”
Descriptive Statistics_Lectures 3-6_2017 46
Arithmetic Mean
Five systolic blood pressures (mmHg) (n = 5)
120, 80, 90, 110, 95
–Can be represented with math type notation:
x1= 120, x2 = 80, . . . , x5 = 95
–The sample mean is easily computed by adding up the five
values and dividing by five—in statistical notation the sample
mean is frequently represented by a letter with a line over it
x =120 + 80+90 + 110 +95
5
=99mmHg
Descriptive Statistics_Lectures 3-6_2017 47
PROCEDURE FOR CALCULATING MEAN
(Grouped data)
i. Find class mark for each interval
ii. Multiply class mark in each interval by
their corresponding frequencies
iii. Add results in (ii) across all intervals
iv. Divide results in (iii) by number of
observations or total frequency
Descriptive Statistics_Lectures 3-6_2017 48
Mean – Grouped Data
• The mean of the given data is calculated by
dividing the sum of all observations by the
number of observations
• If x1, x2, x3 .......xn are the given observations
with their respective frequencies f1, f2,
f3 .......fn then,
• Sum of observations = f1x1 + f2x2 + ... + fnxn
• Sum of frequencies = f1 + f2 + f3 + ... + fn
Descriptive Statistics_Lectures 3-6_2017 49
Mean – Grouped Data
Descriptive Statistics_Lectures 3-6_2017 50
Example
• Compute the mean of the data below
Age Interval Frequency (f)
20 -29 2
30 -39 3
40-49 3
50 -59 2
Total 10
Descriptive Statistics_Lectures 3-6_2017 51
Solution
Age Interval Frequency (f) Class Mark (X) fx
20 -29 2 24.5 49
30 -39 3 34.5 103.5
40-49 3 44.5 133.5
50 -59 2 54.5 109
Total 10 395
n
fx i i
395
x i 1
n
39.5 years
10
f
i 1
i
Descriptive Statistics_Lectures 3-6_2017 52
Median
• The median is the middle number (also called the
50th percentile or second quartile)
– Other percentiles/quartiles can be computed as well, but
are not measures of centre
80 90 95 110 120
• Best measure of central tendency when data is
skewed
• Concentrates on ranks of values rather than
absolute values
• Unaffected by extreme values
– For example, if 120 became 200, the median would remain the same, but
the mean would change to 115
• It is determined mainly by middle value(s) in a sample and
insensitive to other values unlike the Arithmetic mean that uses
all observations Descriptive Statistics_Lectures 3-6_2017 53
Calculation of median values for
ungrouped data
• Steps
– Arrange observations in ascending or
descending order
– If n is odd : Pick observation in the middle as
median
– If n is even : take arithmetic mean of two
middle observations
Descriptive Statistics_Lectures 3-6_2017 54
EXAMPLE ON MEDIAN
Find the Median of age at marriage of ten
pregnant women seen at ANC:
42, 52, 31, 35, 50, 40, 27, 43, 35, 28
Step 1: Arrange in Ascending Order
27, 28, 31, 35, 35, 40, 42, 43, 50, 52
Step 2: Pick the middle observations
(35+40)/2
= 37.5 years
The Median (P50) is the value that separates the lower 50%
from the upper 50% of the observations
Median – Grouped Data
Step 1: Construct the cumulative frequency distribution
Step 2: Decide the class that contain the median
Median Class is the first class with the value position of
cumulative frequency equal at least n/2
Step 3: Find the median by using the following formula:
n cf
Median lb ( 2 )w
fm
n = total frequency
Lb = the lower class boundary of the median class
cf =cumulative frequency of the class preceding the median class
fm = the frequency of the median class
w =class width or class size
The median class is the class interval whose cumulative frequency is
nearly equal to n/2; if is odd, replace n/2 with (n+1)/2 in the equation
Descriptive Statistics_Lectures 3-6_2017 56
Recall the example on mean, and find its
median
Age Class Frequency Cumulative
Interval Boundaries (f) frequency (cf)
20 -29 19.5 – 29.5 2 2
30 -39 29.5 – 39.5 3 5
40-49 39.5 – 49.5 3 8
50 -59 49.5 - 59.5 2 10
Total 10
10 2
Median 29.5 ( 2 )10 39.5
3
Descriptive Statistics_Lectures 3-6_2017 57
MODE
Definition: Mode is the value that occurs most
frequently
• Least used measure of central tendency
• The observation that occurs most frequently
• Easiest measure to understand, explain,
identify
• Mathematical properties rather intractable
• Modes may not exist if there are many large
number of possible values
• May be more than one mode
• Insensitive to extreme values (outliers)
• Does not use all the data
Descriptive Statistics_Lectures 3-6_2017 58
Method for identification:
1. Arrange data into a frequency distribution
or histogram, showing all values of the
variable and the frequency with which
each value occurs
2. Identify the value that occurs most often
Descriptive Statistics_Lectures 3-6_2017 59
Ob
s Age
1
2
27
27
Mode
3 28
4 28 The most frequent value of the variable
5 28
6 29 Mode = 30
7 29 7
8 29
6
9 29
10 30 5
Frequency
11 30
4
12 30
13 30 3
14 30
2
15 31
16 31 1
17 32 2
18 34 27 8 29 30 31 32 33 34 35 36 37
19 36
Age (years)
20 37
Descriptive Statistics_Lectures 3-6_2017 60
Mode – Grouped Data
Mode
•Mode is the value that has the highest frequency in a data set.
•For grouped data, class mode (or, modal class) is the class with the highest frequency.
•To find mode for grouped data, use the following formula:
1
Mode lbWhere:
( )w
1 2
w = is the class
width
1 = fm – f1 is the difference between the frequency of class mode (f m)
and the frequency of the class preceding/before the class mode
2 = f – f is the difference between the frequency of class mode (f m) and
m 2
the frequency of the class succeeding/after the class mode
lm is the lower boundary of class mode
06/21/2025 Lecture 4 - Numerical measure of data I 61
Calculation of Grouped Data - Mode
Example: Based on the grouped data below, find the mode
Time to travel to work Frequency
1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7
Solution:
Based on the table,
Lmo = 10.5, 1 = (14 – 8) = 6, 2 = (14 – 12) = 2 and
6
Mode 10.5 ( )10 18
62
06/21/2025 Lecture 4 - Numerical measure of data I 62
Mode can also be obtained from a histogram.
Step 1: Identify the modal class and the bar representing it
Step 2: Draw two cross lines as shown in the diagram.
Step 3: Drop a perpendicular from the intersection of the
two lines until it touch the horizontal axis.
Step 4: Read the mode from the horizontal axis
06/21/2025 Lecture 4 - Numerical measure of data I 63
Geometric Mean
• Useful in laboratory data whereby values are
concentrations of one substance in another,
assessed by dilution techniques in multiples of a
standard number
• The geometric mean of positive n observations is
the nth root of the product of the observations.
That is;
06/21/2025 Lecture 4 - Numerical measure of data I 64
Relationship among the three main
measures of location
• Relationship between mean and median are useful
in assessing symmetry of distribution of data
– Mean=Median=Mode implies symmetry
– Mean>Median>Mode implies positive skewness
– Mean<Median<Mode implies negative skewness
Descriptive Statistics_Lectures 3-6_2017 65
Comparison of Mode, Median and Mean
Symmetrical:
Mode = Median = Mean
Skewed right:
Mode < Median < Mean
Skewed left:
Mean < Median < Mode
Descriptive Statistics_Lectures 3-6_2017 66
Pearson’s Measure of Skewness
• Skewness = (Mean-Median)
Standard Deviation
• Values:
1. Zero, if a perfect Symmetrical distribution
2. Negative, when negatively skewed or skewed to
the left
3. Positive, when positively skewed or skewed to
the right
Descriptive Statistics_Lectures 3-6_2017 67
Measures of Partition
• These are descriptive measures commonly
used for ORDERED observations
– Quartiles
– Deciles
– Percentiles
Descriptive Statistics_Lectures 3-6_2017 68
Measures of Partition
• Percentiles: divides a set of ordered observations
into 100 equal parts
– 20th percentile is the value below which 20% of the
observations lie.
• Deciles: divides a set of ordered observations into
10 equal parts
• Quartiles: divides a set of ordered observations into
4 equal parts
– Q1(1st quartile, 25th percentile), Q2 (2nd quartile, median,
50th percentile), Q3 (3rd quartile, 75% percentile)
Descriptive Statistics_Lectures 3-6_2017 69
For grouped data
Quartile Percentile Position Position
(odd) (even)
Q1 P25 (n+1)/4 n/4 middle observation of the lower half of
observations i.e.
The value that separates the lower 25%
from the upper 75% of the observations
Q2 P50 (n+1)/2 n/2 Middle observation
Q3 P75 3(n+1)/4 3n/4 middle observation of the upper half of
observations i.e.
The is the value that separates the lower
75% from the upper 25% of the
observations.
kn cf
Pk lb ( 100 ) w; k 1,2,...,99
fp
kn cf
Qk lb Descriptive 4
( Statistics_Lectures
)w ; k 1,2,3
fq 3-6_2017 70
Recall that Interquartile Range is a
function of Q1 and Q3
20
18
16
14
12
N 10
8
6
4
2
0
Q1 Q3
Interquartile Interval
Descriptive Statistics_Lectures 3-6_2017 71
Quartiles
Using the same method of calculation as in the Median,
we can get Q1 and Q3 equation as follows:
n cf 3n cf
Q1 lb ( 4 )w Q3 lb ( 4 )w
fq fq
Example: Based on the grouped data below, find the Interquartile Range
Time to travel to work Frequency
1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7
Descriptive Statistics_Lectures 3-6_2017 72
Solution:
1st Step: Construct the cumulative frequency distribution
Time to travel Frequency Cumulative
to work Frequency
1 – 10 8 8
11 – 20 14 22
21 – 30 12 34
31 – 40 9 43
41 – 50 7 50
2nd Step: Determine the Q1 and Q3
n 50 n cf
Class Q1 12.5
4 4 Q1 lb ( 4 )w
fq
Class Q1 is the 2nd class
Therefore,
12.5 8
Q1 10.5 ( )10 13.7143 min s
14
Descriptive Statistics_Lectures 3-6_2017 73
3n 3 50 3n cf
Class Q3
4
4
37.5
Q3 lb ( 4 )w
fq
Class Q3 is the 4th class
Therefore,
37.5 34
Q 3 30.5 ( )10 34.3889 min s
9
Interquartile Range
IQR = Q3 – Q1
Therefore IQR = Q3 – Q1
= 34.3889 –
13.7143
= 20.6746
Descriptive Statistics_Lectures 3-6_2017 74
Measures of Variation/Dispersion/Spread
(Lecture 6)
Distributional Spread from the Centre
High
Low
Descriptive Statistics_Lectures 3-6_2017 76
Measures of dispersion / spread / variation
• Range
• Interquartile range
• Variance
• Standard Deviation
• Coefficient of Variation
Descriptive Statistics_Lectures 3-6_2017 77
Range
• The difference between the smallest
observation (minimum value) and the largest
observation (maximum value) in a set of data
• range = maximum – minimum
• Rely on only 2 extreme values
• Easy to calculate
• But it is affected by outliers
Descriptive Statistics_Lectures 3-6_2017 78
Range
Minimum Maximum
Range
Descriptive Statistics_Lectures 3-6_2017 79
Inter-quartile Range.
• The interquartile range is the difference
between the 25th percentile (1st quartile) and
the 75th percentile (3rd quartile) in a set of data
– Difference between 3rd quartile and Ist quartile.
• Concentration on the middle 50% of the
ordered observations
– i.e. it gives an idea of the middle 50 percent of the
observations
• Not affected by outliers.
Descriptive Statistics_Lectures 3-6_2017 80
Median Mode
14
12
10
8
N
6
1st quartile 3rd quartile
Minimum Interquartile interval Maximum
Range
Descriptive Statistics_Lectures 3-6_2017 81
Variance
• Mean squared deviations from the mean
value.
• Square of standard deviations.
• Units of measurement in square of original
units S2 = (xI - x)2
n -1
Descriptive Statistics_Lectures 3-6_2017 82
Standard Deviation
• Square root of variance
• Best measure of variation or
dispersion
• Unit same as original units
• Amenable to mathematical and
statistical manipulations
Descriptive Statistics_Lectures 3-6_2017 83
Steps to Calculate Variance and
Standard Deviation
x : mean
xi : value
å( x i - x )²
n : number s² =
s²: variance
s : standard deviation
n-1
1. Calculate the arithmetic mean x
2. Subtract the mean from each observation. xi- x
3. Square the difference.
( x i - x )²
å( x i - x )²
4. Sum the squared differences
5. Divide the sum of the squared differences by n – 1
6. Take the square root of the variance
s = s2
Descriptive Statistics_Lectures 3-6_2017 84
Example:
The frequency distribution of the
weight of 100 patients with
Rheumatoid Arthritis is as follows:
Weight (kgs) Frequency Class-Mid-Mark
60 - 69 5 64.5
70 - 79 15 74.5
80 - 89 20 84.5
90 - 99 25 94.5
100 - 109 20 104.5
110 - 119 15 114.5
Calculate the mean, variance and standard deviation
Descriptive Statistics_Lectures 3-6_2017 85
SOLUTION
Mean = fI xI = 5(64.5) + 15(74.5) + 20(84.5) + 25(94.5) + 20(104.5) + 15(114.5)
fI 100
= 322.5 + 1117.5 + 1690 + 2362.5 + 2090 + 1717.5= 9300
100 100
= 93 kgs
Variance = fi (xI - x)2 = 5(64.5-93)2 +... + 15(114.5-93)2
fI - 1 100 - 1
= 20275 = 204.798 kg2
99
Standard deviation = fi (xI - x)2 = 20275 = 14.31kgs
fI - 1 99
Descriptive Statistics_Lectures 3-6_2017 86
Example 2: Find the variance and standard deviation for the following data:
No. of order f
10 – 12 4
13 – 15 12
16 – 18 20
19 – 21 14
Total n = 50
Solutio
n:
No. of order f x fx fx2
10 – 12 4 11 44 484
13 – 15 12 14 168 2352
16 – 18 20 17 340 5780
19 – 21 14 20 280 5600
Total n = 50 832 14216
Descriptive Statistics_Lectures 3-6_2017 87
Variance,
fx
2
2
fx 2
n
s
n 1
832
2
14216
50
50 1
7.5820
2
Standard Deviation, s s 7.5820 2.75
Thus, the standard deviation of the number of orders
received at the office of this mail-order company
during the past 50 days is 2.75.
Descriptive Statistics_Lectures 3-6_2017 88
–Often abbreviated S, SD or sd
–The smaller the s, the lesser the
variability and the better the statistic
becomes
– s measures the spread about the mean
– s can equal 0 only if there is no spread
– All n observations have the same
value
Descriptive Statistics_Lectures 3-6_2017 89
–The units of s are the same as the units
COEFFICIENT OF VARIATION:
It is a measure of spread that corrects for differences in
magnitude or units of observations
– It is dimensionless thereby useful for comparing the spread
of two or more data sets efficiently, when the units are
different
– The lower the coefficient of variation , the smaller the
spread
It is defined as the ratio of standard deviation to the
mean of a data set; mathematically expressed as:
Descriptive Statistics_Lectures 3-6_2017 90
Choosing appropriate descriptive statistics
For single-peaked and symmetric distribution
– Position
• mean, median and mode are identical or nearly
equal
– Dispersion
• Standard Deviation
For data with significant outliers
– Position
• median is more informative than the mean
– Dispersion
• Range
• Interquartile interval
Descriptive Statistics_Lectures 3-6_2017 91
Review question
• What summary tools are the most appropriate
to use for the following sets of data?
– Salaries of physicians in a clinic
– Test scores of all students in a qualifying exam
– Serum sodium levels of healthy individuals
– Presence of diahrroea in a group of children
– Disease stage of cervical cancer patients in UCH
Descriptive Statistics_Lectures 3-6_2017 92
Exercise:
Based on the grouped data below, find the mean,
median, standard deviation, coefficient of variation,
and Pearson measure of skewness
Time to travel to work Frequency
1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7
Descriptive Statistics_Lectures 3-6_2017 93
Assignment:
Based on the frequency distribution table below, find
the mean, median, interquartile range, standard
deviation, coefficient of variation, and Pearson
measure of skewness
Distribution of the number of previous pregnancies of a
group of women aged 30–34 attending an antenatal clinic
No. of previous No. of women
pregnancies
0 18
1 27
2 31
3 19
4 5
Descriptive Statistics_Lectures 3-6_2017 94