0% found this document useful (0 votes)
6 views94 pages

Lectures 3 - 6 - 2017

The document introduces methods for organizing, summarizing, and presenting data, focusing on frequency tables, graphs/charts, and summary statistics. It covers tabular presentations, graphical representations, and numerical summaries, detailing their definitions, uses, and examples. Key concepts include frequency distribution, relative frequency, cumulative frequency, and various graphical tools like histograms, box plots, and pie charts.

Uploaded by

Stanley Ogili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views94 pages

Lectures 3 - 6 - 2017

The document introduces methods for organizing, summarizing, and presenting data, focusing on frequency tables, graphs/charts, and summary statistics. It covers tabular presentations, graphical representations, and numerical summaries, detailing their definitions, uses, and examples. Key concepts include frequency distribution, relative frequency, cumulative frequency, and various graphical tools like histograms, box plots, and pie charts.

Uploaded by

Stanley Ogili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 94

Summarisation and Presentation of Data

(Descriptive Statistics)

Rotimi F. Afolabi, PhD


Department of Epidemiology & Medical Statistics
College of Medicine, University of Ibadan
Session Objective

• Introduce students to the following different


methods of organising, summarising and presenting
data:
– frequency tables (tabular descriptive),
– graphs/charts (graphical descriptive) and
– summary (numerical descriptive) statistics

Descriptive Statistics_Lectures 3-6_2017 2


Tabular Presentation of Data
(Lecture 3)
Tables
♦ Definition
– Data presented in columns and rows by one or
more classification variable
• Uses
– demonstrate patterns, exceptions, differences and
other relationships
– serve as the basis for preparing more visual
displays of data, such as graphs and charts, where
some of the detail may be lost

Descriptive Statistics_Lectures 3-6_2017 4


Examples:
Represent the data on the age distribution of
adult admissions into UCH in June 1998.

Age (years) Frequency


10-19 16
20-29 27
30-39 23
40-49 24
50-59 23
60-69 19
70-79 15
The two-variable table
Table 1. Cases of Salmonella
Typhimurium-infection by age-group and sex,
Herøy, Norway, 1999

Age group Sex Total


(years) Male Female
0 - 9 7 5 12
10 - 19 5 5 10
20 - 29 5 5 10
30 - 39 1 4 5
40 - 49 2 3 5
50 - 59 0 3 3
60 - 69 2 1 3
70 - 2 4 6
Total 24 30 54

Descriptive Statistics_Lectures 3-6_2017 6


Contingency Table
The 2x2 table for a cohort study
Table 5. Association between fish consumption and
gastrointestinal illness among customers at Uncle Mike's Fish &
Chips, Cambridge, October 1 2000

Ill Well Total Attack rate


Ate fish 42 16 58 0.72
Did not eat fish 5 59 64 0.078

Relative risk: 9.3 (95% confidence interval 3.9 - 22)


Descriptive Statistics_Lectures 3-6_2017 7
Frequency Distribution Table
• Arrangement of data by rows & columns
• Useful to summarize data.
• Has two main columns.
• Column 1 lists all values of the variable.
• Column 2 the frequency at which each
value occurs.
• For initial data exploration
• Construction depends on type of variable

Descriptive Statistics_Lectures 3-6_2017 8


Frequency Table - Qualitative variable

• Ist column: different categories of the


variable (mutually exclusive)
• 2nd column: frequency or count with which
each category occurred.

Descriptive Statistics_Lectures 3-6_2017 9


Reasons for physicians not smoking

Reasons Frequency

• Health 25
• Religious 15
• Social 12
• Profession 5
• Others 3

Descriptive Statistics_Lectures 3-6_2017 10


Frequency Intervals
• Must not overlap-
Example
• 5-9 not 5 - 10
• 10 - 14 10 - 15
• 15 - 19 15 – 20

• Equal intervals easier to interpret but


• Unequal intervals may be used to illustrate specific attributes
of interest

Descriptive Statistics_Lectures 3-6_2017 11


Relative Frequency

• Proportion of total observations ascribed to


that value
• Divide frequency in the class interval by
total observation.

Descriptive Statistics_Lectures 3-6_2017 12


Cumulative Frequency
• Proportion of total observations with
certain value or less.

• Must correspond to end of class interval.

• Add up relative frequencies to preceding


values.

Descriptive Statistics_Lectures 3-6_2017 13


Frequency Distribution Table with One Variable
Table 1 Number of Cases of Primary and Secondary Syphilis
by Age Group, USA, 1989.
Age Group Cases

(in years) Frequency Cumulative Relative Frequency Cumulative


(Number) frequency (Percentage) Relative
Frequency
< 14 230 230 0.5 0.5
15-19 4378 4608 10.0 10.5
20-24 10,405 15013 23.6 34.1
25-29 9610 24623 21.8 55.9
30-34 8648 33271 19.6 75.5
35-44 6901 40172 15.7 91.2
45-54 2631 42803 6.0 97.2
> 55 1278 44081 2.9 100.0
Total 44,081 100 100.0%

Descriptive Statistics_Lectures 3-6_2017 14


Qualities of Frequency Tables

Simple Information (not more than 3 variables)


• Clear title to indicate what? when? where?
• Good labeling of rows & columns
• Indicate units of measurements
• Row, column & grand totals MUST add up

Descriptive Statistics_Lectures 3-6_2017 15


Terms used in constructing frequency table
• Classes (Class Intervals): categories of grouping data

• Frequency: It is the number of times a value or group of


values of a variable occurs i.e. number of observations
that fall in a class

• Frequency distribution: listing of all classes and their


frequencies

• Relative frequency: ratio of frequency of a class to the


total number of observations

• Relative-frequency distribution: listing of all classes and


their relative frequencies
Descriptive Statistics_Lectures 3-6_2017 16
Terms used in constructing frequency table …
• Class Limits – the end numbers of each class. Upper class
limit is the largest number of the class and Lower class
limit is the smallest number of the class
• Class Boundaries – When the upper limit of each class is
the same as the lower limit of the next class, the class
limits are referred to as Class Boundaries. This is obtained
by taking the midpoint of the upper limit of each class
and the lower limit of the next
• Class Mark – It is the midpoint between the lower and
upper class limits of a class
• Class Size (or Width) – It is the difference between the
upper and lower class boundaries
Descriptive Statistics_Lectures 3-6_2017 17
Graphical Presentation of Data
(Lecture 4)
Graphical Presentation
Need :
– To aid in visually exploring the data
– Diagram make better visual
impressions than numbers
• An adage says a picture is more than a
thousand words

Descriptive Statistics_Lectures 3-6_2017 19


Graphs/charts
Depends on type of data
• Quantitative or numerical data
– Histogram,
– frequency polygon,
– Cumulative frequency curve (known as “ogive”)
– Box plot
• Qualitative or categorical data
– Bar chart ,
– Pie chart

Descriptive Statistics_Lectures 3-6_2017 20


Histogram

• Plot of the frequency distribution


• Use to show data on interval scale or
continuous variables
• Slender rectangles adjoin each other
• Area under the histogram is equivalent to
the total frequency

Descriptive Statistics_Lectures 3-6_2017 21


Histogram – Purpose/uses
• Provides information on range of data
values
• Shows the location of the highest
concentration of measurement
• Reveals the presence or absence of
symmetry

Descriptive Statistics_Lectures 3-6_2017 22


Example 1: Represent the data on the age distribution
of adult admissions into UCH in June 1998.

Age (years) Frequency


10-19 16
20-29 27
30-39 23
40-49 24
50-59 23
60-69 19
70-79 15
Histogram of ages of adult admissions at
UCH, June 1998
Frequency
30

25

20
Frequency

15

10

0
10 - 19 20-29 30-39 40-49 50-59 60-69 70-79
Age

Descriptive Statistics_Lectures 3-6_2017 24


Frequency polygon
• Special line graph for frequency distribution
• Obtained by plotting frequencies against the
class marks
• From the histogram
– Plot the frequency at the midpoint
– Join the points with a straight line

Lecture 3 - 2015/2016 Session 25


FREQUENCY POLYGON
30

25

20
Frequency

15

10

0
10 20 30 40 50 60 70 80
Age (Yrs)

Lecture 3 - 2015/2016 Session 26


Cumulative frequency curve -ogive
• Plot of cumulative frequency against the upper class
boundaries, and joining all the consecutive points
• Additional point is obtained by plotting a frequency of
zero against the lowest lower boundary
• Used to show how many data values are accumulated up
to and including a specific class
• It may be applied to obtain measures of partition such as
– Quartiles
– Deciles
– percentiles

Lecture 3 - 2015/2016 Session 27


Ogive
156

136

116
Cummulative freq

96

76

56

36

16
19 29 39 49 59 69 79 89
Age (Yrs)

Lecture 3 - 2015/2016 Session 28


Box and Whisker Plot [ Box Plot]
• A simple but excellent tool for conveying location
and variation information in data sets
• It helps to display the symmetry properties of a
sample
• It can be used to visually describe the spread of
a sample
• It can also help identify possible outlying values
– that is, values that seem inconsistent with the rest of
the points in the sample.

Descriptive Statistics_Lectures 3-6_2017 29


Box and whiskers plot

Systolic 75th percentile


Blood
pressure
(mmHg) median
120

mean

25thpercentile

Descriptive Statistics_Lectures 3-6_2017 30


Box Plot …
• The box plot is interpreted as follows:
• The midline of a box plot is the median or 50th
percentile.
• The body or box portion of the plot is the
interquartile range going from the 25th percentile to
the 75th percentile.
– The interquartile range is the middle 50% of the data
– The width of the interquartile range is equal to:
• 75th Percentile – 25th Percentile
– The interquartile range is a robust measure of variability
Descriptive Statistics_Lectures 3-6_2017 31
Box Plot …
• How can the median, upper quartile, and lower
quartile be used to judge the symmetry of a
distribution?
1. If the distribution is symmetric, then the upper and lower
quartiles should be approximately equally spaced from
the median.
2. If the upper quartile is farther from the median than the
lower quartile, then the distribution is positively skewed.
3. If the lower quartile is farther from the median than the
upper quartile, then the distribution is negatively skewed
• Uses the 5-number summary indices
Descriptive Statistics_Lectures 3-6_2017 32
5-Number Summary Indices

•Arrange data in descending order


– Find Q1=1/4 data lies below this point
– Find Median= 1/2 data lies below this point
– Find Q3=3/4 data lies below this point
– Find Maximum score and
– Find Minimum score

Descriptive Statistics_Lectures 3-6_2017 33


Descriptive Statistics_Lectures 3-6_2017 34
Bar Chart
• Slender rectangles to represent frequency of
values of variable
• Rectangles are separate and distinct
• Height of rectangle correspond to frequency
• Types of bar chart
– Simple: consisting of a set of non-joining bars
– Component: like simple bar chart except that each
bar is split up into constituent parts
– Multiple: the component values are shown as
separate bars joined and always in the same
sequence

Descriptive Statistics_Lectures 3-6_2017 35


Example : Reasons For Physicians Not Smoking

» Reasons Frequency

• Health 25
• Religious 15
• Social 12
• Profession 5
• Others 3

Descriptive Statistics_Lectures 3-6_2017 36


Reasons for UCH physicians not smoking

25

20

15

10

0
Health Religious Social Profession Others

Descriptive Statistics_Lectures 3-6_2017 37


Pie Chart

• Graphical device consisting of a circle sub-divided


into sectors whose areas are proportional to the
whole quantity
• Use to show the components of a total
• More intelligent visual impressions sometimes

• Procedures:
– Draw a circle of any convenient radius, with a marked
centre to represent total observation.
– Divide circle into sectors according to the frequency of
each attribute.
– Use (n/N) x 3600 to represent each sector.
– Shade sectors in different colours to distinguish.
Descriptive Statistics_Lectures 3-6_2017 38
Pie Chart …
5%
8%

Health
42%
Religious
20%
Social
Profession
Others

25%

Descriptive Statistics_Lectures 3-6_2017 39


Numerical Summarisation of Data
(Lectures 5 & 6)
Types of Numerical Measures
Central Location / Position / Tendency - a
single value that represents (is a good
summary of) an entire distribution of data

Spread / Dispersion / Variability - how much


the distribution is spread or dispersed from
its central location

Descriptive Statistics_Lectures 3-6_2017 41


Central Location

20
? ?

15
Number of people

10

5
Spread
0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99

Age

Descriptive Statistics_Lectures 3-6_2017 42


Measures of central tendency
(Lecture 5)
Measures of Central Location
 Definition:
– a single value that represents (a good summary
of) an entire distribution of data

 Also known as:


– “Measure of central tendency”
– “Measure of central position”

 Common measures:
– Arithmetic mean
– Median
– Mode
Descriptive Statistics_Lectures 3-6_2017 44
Arithmetic Mean (Average)
• Most useful measure of central tendency
• Mean retains all original measurements
• Not good when data is skewed
• Sensitive to extreme values
– One data point could make a great change in
sample mean

• Calculation steps:
– Add up data
– Divide by number of observations (pieces of
data) i.e. sample size (n)

Descriptive Statistics_Lectures 3-6_2017 45


s/n Age
1 27 Raw data set:
2 30 Ages of students in a class (by years of
3 28
4 31 age)
5 28
6 36
7 29
Mean = Sum of all
8 37
9 29 observations
10 34
No of
n observations

x
11 30
12 30
13 27 i
14 30

X  i 1
15 28
16 31
17
18
19
32
30
29
n
20 29 –Usually pronounced “x bar”
Descriptive Statistics_Lectures 3-6_2017 46
Arithmetic Mean
Five systolic blood pressures (mmHg) (n = 5)
120, 80, 90, 110, 95
–Can be represented with math type notation:
x1= 120, x2 = 80, . . . , x5 = 95

–The sample mean is easily computed by adding up the five


values and dividing by five—in statistical notation the sample
mean is frequently represented by a letter with a line over it

x =120 + 80+90 + 110 +95


5

=99mmHg

Descriptive Statistics_Lectures 3-6_2017 47


PROCEDURE FOR CALCULATING MEAN
(Grouped data)
i. Find class mark for each interval

ii. Multiply class mark in each interval by


their corresponding frequencies

iii. Add results in (ii) across all intervals

iv. Divide results in (iii) by number of


observations or total frequency

Descriptive Statistics_Lectures 3-6_2017 48


Mean – Grouped Data
• The mean of the given data is calculated by
dividing the sum of all observations by the
number of observations
• If x1, x2, x3 .......xn are the given observations
with their respective frequencies f1, f2,
f3 .......fn then,
• Sum of observations = f1x1 + f2x2 + ... + fnxn
• Sum of frequencies = f1 + f2 + f3 + ... + fn
Descriptive Statistics_Lectures 3-6_2017 49
Mean – Grouped Data

Descriptive Statistics_Lectures 3-6_2017 50


Example
• Compute the mean of the data below
Age Interval Frequency (f)
20 -29 2
30 -39 3
40-49 3
50 -59 2
Total 10

Descriptive Statistics_Lectures 3-6_2017 51


Solution
Age Interval Frequency (f) Class Mark (X) fx

20 -29 2 24.5 49
30 -39 3 34.5 103.5
40-49 3 44.5 133.5
50 -59 2 54.5 109
Total 10 395
n

fx i i
395
x i 1
n
 39.5 years
10
f
i 1
i

Descriptive Statistics_Lectures 3-6_2017 52


Median
• The median is the middle number (also called the
50th percentile or second quartile)
– Other percentiles/quartiles can be computed as well, but
are not measures of centre
80 90 95 110 120

• Best measure of central tendency when data is
skewed
• Concentrates on ranks of values rather than
absolute values
• Unaffected by extreme values
– For example, if 120 became 200, the median would remain the same, but
the mean would change to 115
• It is determined mainly by middle value(s) in a sample and
insensitive to other values unlike the Arithmetic mean that uses
all observations Descriptive Statistics_Lectures 3-6_2017 53
Calculation of median values for
ungrouped data
• Steps
– Arrange observations in ascending or
descending order

– If n is odd : Pick observation in the middle as


median

– If n is even : take arithmetic mean of two


middle observations

Descriptive Statistics_Lectures 3-6_2017 54


EXAMPLE ON MEDIAN
Find the Median of age at marriage of ten
pregnant women seen at ANC:
42, 52, 31, 35, 50, 40, 27, 43, 35, 28

Step 1: Arrange in Ascending Order


27, 28, 31, 35, 35, 40, 42, 43, 50, 52

Step 2: Pick the middle observations


(35+40)/2
= 37.5 years

The Median (P50) is the value that separates the lower 50%
from the upper 50% of the observations
Median – Grouped Data
Step 1: Construct the cumulative frequency distribution
Step 2: Decide the class that contain the median
Median Class is the first class with the value position of
cumulative frequency equal at least n/2
Step 3: Find the median by using the following formula:
n  cf
Median lb  ( 2 )w
fm
n = total frequency
Lb = the lower class boundary of the median class
cf =cumulative frequency of the class preceding the median class
fm = the frequency of the median class
w =class width or class size

The median class is the class interval whose cumulative frequency is


nearly equal to n/2; if is odd, replace n/2 with (n+1)/2 in the equation
Descriptive Statistics_Lectures 3-6_2017 56
Recall the example on mean, and find its
median
Age Class Frequency Cumulative
Interval Boundaries (f) frequency (cf)
20 -29 19.5 – 29.5 2 2
30 -39 29.5 – 39.5 3 5
40-49 39.5 – 49.5 3 8
50 -59 49.5 - 59.5 2 10
Total 10

10  2
Median 29.5  ( 2 )10 39.5
3

Descriptive Statistics_Lectures 3-6_2017 57


MODE
Definition: Mode is the value that occurs most
frequently
• Least used measure of central tendency
• The observation that occurs most frequently
• Easiest measure to understand, explain,
identify
• Mathematical properties rather intractable
• Modes may not exist if there are many large
number of possible values
• May be more than one mode
• Insensitive to extreme values (outliers)
• Does not use all the data
Descriptive Statistics_Lectures 3-6_2017 58
Method for identification:

1. Arrange data into a frequency distribution


or histogram, showing all values of the
variable and the frequency with which
each value occurs

2. Identify the value that occurs most often

Descriptive Statistics_Lectures 3-6_2017 59


Ob
s Age
1
2
27
27
Mode
3 28
4 28 The most frequent value of the variable
5 28
6 29 Mode = 30
7 29 7
8 29
6
9 29
10 30 5
Frequency

11 30
4
12 30
13 30 3
14 30
2
15 31
16 31 1
17 32 2
18 34 27 8 29 30 31 32 33 34 35 36 37
19 36
Age (years)
20 37

Descriptive Statistics_Lectures 3-6_2017 60


Mode – Grouped Data
Mode
•Mode is the value that has the highest frequency in a data set.
•For grouped data, class mode (or, modal class) is the class with the highest frequency.
•To find mode for grouped data, use the following formula:

1
Mode lbWhere:
( )w
1   2
w = is the class
width
1 = fm – f1 is the difference between the frequency of class mode (f m)
and the frequency of the class preceding/before the class mode
2 = f – f is the difference between the frequency of class mode (f m) and
m 2
the frequency of the class succeeding/after the class mode

lm is the lower boundary of class mode


06/21/2025 Lecture 4 - Numerical measure of data I 61
Calculation of Grouped Data - Mode
Example: Based on the grouped data below, find the mode
Time to travel to work Frequency
1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7

Solution:
Based on the table,

Lmo = 10.5, 1 = (14 – 8) = 6,  2 = (14 – 12) = 2 and

6
Mode 10.5  ( )10 18
62
06/21/2025 Lecture 4 - Numerical measure of data I 62
Mode can also be obtained from a histogram.
Step 1: Identify the modal class and the bar representing it
Step 2: Draw two cross lines as shown in the diagram.
Step 3: Drop a perpendicular from the intersection of the
two lines until it touch the horizontal axis.
Step 4: Read the mode from the horizontal axis

06/21/2025 Lecture 4 - Numerical measure of data I 63


Geometric Mean
• Useful in laboratory data whereby values are
concentrations of one substance in another,
assessed by dilution techniques in multiples of a
standard number
• The geometric mean of positive n observations is
the nth root of the product of the observations.
That is;

06/21/2025 Lecture 4 - Numerical measure of data I 64


Relationship among the three main
measures of location

• Relationship between mean and median are useful


in assessing symmetry of distribution of data
– Mean=Median=Mode implies symmetry
– Mean>Median>Mode implies positive skewness
– Mean<Median<Mode implies negative skewness

Descriptive Statistics_Lectures 3-6_2017 65


Comparison of Mode, Median and Mean

Symmetrical:
Mode = Median = Mean

Skewed right:
Mode < Median < Mean

Skewed left:
Mean < Median < Mode

Descriptive Statistics_Lectures 3-6_2017 66


Pearson’s Measure of Skewness

• Skewness = (Mean-Median)
Standard Deviation
• Values:
1. Zero, if a perfect Symmetrical distribution
2. Negative, when negatively skewed or skewed to
the left
3. Positive, when positively skewed or skewed to
the right

Descriptive Statistics_Lectures 3-6_2017 67


Measures of Partition
• These are descriptive measures commonly
used for ORDERED observations
– Quartiles
– Deciles
– Percentiles

Descriptive Statistics_Lectures 3-6_2017 68


Measures of Partition
• Percentiles: divides a set of ordered observations
into 100 equal parts
– 20th percentile is the value below which 20% of the
observations lie.
• Deciles: divides a set of ordered observations into
10 equal parts
• Quartiles: divides a set of ordered observations into
4 equal parts
– Q1(1st quartile, 25th percentile), Q2 (2nd quartile, median,
50th percentile), Q3 (3rd quartile, 75% percentile)
Descriptive Statistics_Lectures 3-6_2017 69
For grouped data
Quartile Percentile Position Position
(odd) (even)
Q1 P25 (n+1)/4 n/4 middle observation of the lower half of
observations i.e.
The value that separates the lower 25%
from the upper 75% of the observations

Q2 P50 (n+1)/2 n/2 Middle observation


Q3 P75 3(n+1)/4 3n/4 middle observation of the upper half of
observations i.e.
The is the value that separates the lower
75% from the upper 25% of the
observations.

kn  cf
Pk lb  ( 100 ) w; k 1,2,...,99
fp
kn  cf
Qk lb Descriptive 4
( Statistics_Lectures
)w ; k 1,2,3
fq 3-6_2017 70
Recall that Interquartile Range is a
function of Q1 and Q3
20
18
16
14
12
N 10
8
6
4
2
0

Q1 Q3

Interquartile Interval
Descriptive Statistics_Lectures 3-6_2017 71
Quartiles
Using the same method of calculation as in the Median,
we can get Q1 and Q3 equation as follows:

n  cf 3n  cf
Q1 lb  ( 4 )w Q3 lb  ( 4 )w
fq fq

Example: Based on the grouped data below, find the Interquartile Range

Time to travel to work Frequency


1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7

Descriptive Statistics_Lectures 3-6_2017 72


Solution:
1st Step: Construct the cumulative frequency distribution

Time to travel Frequency Cumulative


to work Frequency
1 – 10 8 8
11 – 20 14 22
21 – 30 12 34
31 – 40 9 43
41 – 50 7 50
2nd Step: Determine the Q1 and Q3

n 50 n  cf
Class Q1   12.5
4 4 Q1 lb  ( 4 )w
fq
Class Q1 is the 2nd class
Therefore,
12.5  8
Q1 10.5  ( )10 13.7143 min s
14

Descriptive Statistics_Lectures 3-6_2017 73


3n 3 50  3n  cf
Class Q3 
4

4
37.5
Q3 lb  ( 4 )w
fq
Class Q3 is the 4th class
Therefore,
37.5  34
Q 3 30.5  ( )10 34.3889 min s
9
Interquartile Range

IQR = Q3 – Q1

Therefore IQR = Q3 – Q1

= 34.3889 –
13.7143
= 20.6746
Descriptive Statistics_Lectures 3-6_2017 74
Measures of Variation/Dispersion/Spread
(Lecture 6)
Distributional Spread from the Centre

High

Low
Descriptive Statistics_Lectures 3-6_2017 76
Measures of dispersion / spread / variation

• Range

• Interquartile range

• Variance

• Standard Deviation

• Coefficient of Variation

Descriptive Statistics_Lectures 3-6_2017 77


Range
• The difference between the smallest
observation (minimum value) and the largest
observation (maximum value) in a set of data
• range = maximum – minimum
• Rely on only 2 extreme values
• Easy to calculate
• But it is affected by outliers

Descriptive Statistics_Lectures 3-6_2017 78


Range

Minimum Maximum

Range
Descriptive Statistics_Lectures 3-6_2017 79
Inter-quartile Range.

• The interquartile range is the difference


between the 25th percentile (1st quartile) and
the 75th percentile (3rd quartile) in a set of data
– Difference between 3rd quartile and Ist quartile.
• Concentration on the middle 50% of the
ordered observations
– i.e. it gives an idea of the middle 50 percent of the
observations
• Not affected by outliers.
Descriptive Statistics_Lectures 3-6_2017 80
Median Mode
14

12

10

8
N
6

1st quartile 3rd quartile

Minimum Interquartile interval Maximum

Range

Descriptive Statistics_Lectures 3-6_2017 81


Variance

• Mean squared deviations from the mean


value.
• Square of standard deviations.

• Units of measurement in square of original


units S2 =  (xI - x)2

n -1

Descriptive Statistics_Lectures 3-6_2017 82


Standard Deviation

• Square root of variance


• Best measure of variation or
dispersion
• Unit same as original units
• Amenable to mathematical and
statistical manipulations
Descriptive Statistics_Lectures 3-6_2017 83
Steps to Calculate Variance and
Standard Deviation
x : mean
xi : value
å( x i - x )²
n : number s² =
s²: variance
s : standard deviation
n-1
1. Calculate the arithmetic mean x
2. Subtract the mean from each observation. xi- x
3. Square the difference.
( x i - x )²
å( x i - x )²
4. Sum the squared differences

5. Divide the sum of the squared differences by n – 1

6. Take the square root of the variance

s = s2
Descriptive Statistics_Lectures 3-6_2017 84
Example:
The frequency distribution of the
weight of 100 patients with
Rheumatoid Arthritis is as follows:

Weight (kgs) Frequency Class-Mid-Mark


60 - 69 5 64.5

70 - 79 15 74.5

80 - 89 20 84.5

90 - 99 25 94.5

100 - 109 20 104.5

110 - 119 15 114.5

Calculate the mean, variance and standard deviation


Descriptive Statistics_Lectures 3-6_2017 85
SOLUTION
Mean =  fI xI = 5(64.5) + 15(74.5) + 20(84.5) + 25(94.5) + 20(104.5) + 15(114.5)
 fI 100

= 322.5 + 1117.5 + 1690 + 2362.5 + 2090 + 1717.5= 9300


100 100

= 93 kgs

Variance =  fi (xI - x)2 = 5(64.5-93)2 +... + 15(114.5-93)2


 fI - 1 100 - 1

= 20275 = 204.798 kg2


99

Standard deviation =  fi (xI - x)2 = 20275 = 14.31kgs


 fI - 1 99

Descriptive Statistics_Lectures 3-6_2017 86


Example 2: Find the variance and standard deviation for the following data:
No. of order f
10 – 12 4
13 – 15 12
16 – 18 20
19 – 21 14
Total n = 50
Solutio
n:
No. of order f x fx fx2

10 – 12 4 11 44 484
13 – 15 12 14 168 2352
16 – 18 20 17 340 5780
19 – 21 14 20 280 5600

Total n = 50 832 14216

Descriptive Statistics_Lectures 3-6_2017 87


Variance,
 fx 
2

2
 fx 2

n
s 
n 1
832 
2

14216 
 50
50  1
7.5820
2
Standard Deviation, s  s  7.5820 2.75

Thus, the standard deviation of the number of orders


received at the office of this mail-order company
during the past 50 days is 2.75.

Descriptive Statistics_Lectures 3-6_2017 88


–Often abbreviated S, SD or sd
–The smaller the s, the lesser the
variability and the better the statistic
becomes
– s measures the spread about the mean
– s can equal 0 only if there is no spread
– All n observations have the same
value
Descriptive Statistics_Lectures 3-6_2017 89
–The units of s are the same as the units
COEFFICIENT OF VARIATION:

It is a measure of spread that corrects for differences in


magnitude or units of observations
– It is dimensionless thereby useful for comparing the spread
of two or more data sets efficiently, when the units are
different

– The lower the coefficient of variation , the smaller the


spread
It is defined as the ratio of standard deviation to the
mean of a data set; mathematically expressed as:

Descriptive Statistics_Lectures 3-6_2017 90


Choosing appropriate descriptive statistics
For single-peaked and symmetric distribution
– Position
• mean, median and mode are identical or nearly
equal
– Dispersion
• Standard Deviation
For data with significant outliers
– Position
• median is more informative than the mean
– Dispersion
• Range
• Interquartile interval

Descriptive Statistics_Lectures 3-6_2017 91


Review question
• What summary tools are the most appropriate
to use for the following sets of data?
– Salaries of physicians in a clinic
– Test scores of all students in a qualifying exam
– Serum sodium levels of healthy individuals
– Presence of diahrroea in a group of children
– Disease stage of cervical cancer patients in UCH

Descriptive Statistics_Lectures 3-6_2017 92


Exercise:
Based on the grouped data below, find the mean,
median, standard deviation, coefficient of variation,
and Pearson measure of skewness

Time to travel to work Frequency

1 – 10 8
11 – 20 14
21 – 30 12
31 – 40 9
41 – 50 7

Descriptive Statistics_Lectures 3-6_2017 93


Assignment:
Based on the frequency distribution table below, find
the mean, median, interquartile range, standard
deviation, coefficient of variation, and Pearson
measure of skewness
Distribution of the number of previous pregnancies of a
group of women aged 30–34 attending an antenatal clinic
No. of previous No. of women
pregnancies
0 18
1 27
2 31
3 19
4 5

Descriptive Statistics_Lectures 3-6_2017 94

You might also like