Significance of statistical activities
To deal with disorderly and unsystematic in-
formation: uncertainty of individuals, rules
of entirety.
Medical statistics is to find rules and make
predictions through analysis of medical
data.
Statistical analysis includes: statistical de-
scription & statistical inference
Statistical description
Definition: to describe and calculate the
quantitative characteristics and distribu-
tion regularities of original data using ap-
propriate statistical tables, charts and in-
dicators.
Purpose: to detect the inner rules and char-
acteristics of data by simple and under-
standable ways.
It is the foundation of statistical inference.
Using different approaches based on the
types of data
Descriptive statistics of nu-
merical data
Numerical data (quantitative data)
Continuous: the values are measurements and
can take any value within a certain range.
e.g. body weight, blood pressure, concentra-
tion of CO2 …
Discrete: the values for each target individual
are positive integer. e.g. pulse of normal
people, the number of kids for a family…
Statistical description of numerical data
(univariate numerical data )
1. Using frequency table/chart
2. Using statistical indicators
Frequency distribution
Frequency: in a sample, appearance amount of the
same/similar event, namely, the number of one
measured value, or the number of subjects in a
given range. To describe how often the observations
appear in one group.
Frequency distribution table:
to divide original data into appropriate groups ac-
cording to the measured values and count the indi-
viduals in each group. To understand the range,
form and regularity of data.
Methods: manual or by software
1. Frequency distribution of discrete vari-
able
e.g. times of antenatal examination for
96 pregnant women in a mountainous
area, 2008.
0,3,2,0,1,5,6,3,2,4,1,0,
6 , 5 , 1 , 3 , 3 ,…, 4 , a total of 96
values
Table antenatal examination for 96 pregnant woman in 2008
Frequency Cumulative Cumulative fre-
Times Frequency rate (%) frequency quency rate (%)
0 4 4.2 4 4.2
1 7 7.3 11 11.5
2 11 11.5 22 22.9
3 13 13.5 35 36.5
4 26 27.1 61 63.5
5 23 24.0 84 87.5
12 12.5 96 100.0
6
Total 96 100
Tips
Multiple values at the tailing end could be combined
into one group
Graph of frequency distribution by bar chart
30
f requency
25 rat e
(%)
20
15
10
0
0 1 2 3 4 5 >
65
Ti mes of antenatal exam
i nati on
Fi gure2 antenatal examinati on f or 96 woman
2. Frequency distribution of continuous
variable
To divide data into continuous subgroups, and
calculate the number in each subgroup, the
form a table. To indicate the data distribution.
e.g. height values of 120 boys at the age of ten
142.3 156.6 142.7 145.7 138.2 141.6 142.5 130.5 132.1 135.5
134.5 148.8 134.4 148.8 137.9 151.3 140.8 149.8 143.6 149.0
145.2 141.8 146.8 135.1 150.3 133.1 142.7 143.9 142.4 139.6
151.1 144.0 145.4 146.2 143.3 156.3 141.9 140.7 145.9 144.4
141.2 141.5 148.8 140.1 150.6 139.5 146.4 143.8 150.0 142.1
143.5 139.2 144.7 139.3 141.9 147.8 140.5 138.9 148.9 142.4
134.7 147.3 138.1 140.2 137.4 145.1 145.8 147.9 146.7 143.4
150.8 144.5 137.1 147.1 142.9 134.9 143.6 142.3 143.3 140.2
125.9 132.7 152.9 147.9 141.8 141.4 140.9 141.4 146.7 138.7
154.2 137.9 139.9 149.7 147.5 136.9 148.1 144.0 137.4 160.9
134.7 138.5 138.9 137.7 138.5 139.6 143.5 142.9 146.5 145.4
129.4 142.5 141.2 148.9 154.0 147.7 152.3 146.6 139.2 139.9
142.3 156.6 142.7 145.7 138.2 141.6 142.5 130.5 132.1 135.5
134.5 148.8 134.4 148.8 137.9 151.3 140.8 149.8 143.6 149.0
145.2 141.8 146.8 135.1 150.3 133.1 142.7 143.9 142.4 139.6
151.1 144.0 145.4 146.2 143.3 156.3 141.9 140.7 145.9 144.4
141.2 141.5 148.8 140.1 150.6 139.5 146.4 143.8 150.0 142.1
143.5 139.2 144.7 139.3 141.9 147.8 140.5 138.9 148.9 142.4
134.7 147.3 138.1 140.2 137.4 145.1 145.8 147.9 146.7 143.4
150.8 144.5 137.1 147.1 142.9 134.9 143.6 142.3 143.3 140.2
125.9 132.7 152.9 147.9 141.8 141.4 140.9 141.4 146.7 138.7
154.2 137.9 139.9 149.7 147.5 136.9 148.1 144.0 137.4 160.9
134.7 138.5 138.9 137.7 138.5 139.6 143.5 142.9 146.5 145.4
129.4 142.5 141.2 148.9 154.0 147.7 152.3 146.6 139.2 139.9
1. To calculate the range
To compute the range between maximum
and minimum data of the whole individuals
in the sample. To describe the variation ex-
tent of data.
Range in the present example:
R=160.9-125.9=35cm
2 、 Divide subgroups
(1) To decide the group number
The aim of frequency table is to simplify the data, thus the
number of group could not be too many or too few.
Appropriate group number depends on the sample size
n. n is below 50, group might be 5~8, and n is 50,
group might be 9~15. Usually we adopt 10.
We need to create a number of class intervals. We will use
what is called the Sturges’ rule.
The number of class intervals
= k = 1+3.32´log (n), where n is the sample size.
10
(2) determine the group interval (i)
interval= range / the group number
integer numbers, or neat numbers by profes-
sional habit
i =35/10=3.5≈4
(3) determine the group limits
low limit: starting point of each group
upper limit: ending point of each group
upper limit = low limit +interval
low limit of the 1st group is a neat number below
the minimum value in sample. The low limit of
this example would be 125 ( < 125.9), and the
upper limit =125+4=129
Every group could not be overlapping (semi-
open interval)
3. Calculate the frequencies of each group
and list the whole frequency table
Graph of frequency distribution by histogram
--symmetric distribution
125 129 133 137 141 145 149 153 157
Frequency distribution (unimodal)
To summarize the data: the range, the maximum
value, the minimum value, the central position, the
dispersed tendency, etc.
1. Central position: a tendency that data assembling
to one position. –the heights of most boys concen-
trate to the middle group.
2. Dispersed tendency: the dispersion and variation
of data. --A few of boys have very low or very high
body heights, the frequency are gradually decreas-
ing on both sides.
Common unimodal distribution of data
1. Symmetrical distribution:
bilateral symmetry, the central position locates at
the very middle place.
-- Normal distribution: full symmetry
2. Skewed distribution :
the central position locates at one side.
Positive skewed distribution: peak on the left,
tail on the right .
Negative skewed distribution: peak on the right,
tail on the left.
25
20
15
10
0
血清总胆固醇
2. 3 2. 6 2. 9 3. 2 3. 5 3. 8 4. 1 4. 4 4. 7 5. 0 5. 3 5. 6 5. 9
Symmetrical distribution
图 2-1 101 名正常成年女性血清总胆固醇频数分布
20
15
cases
10
0
12 24 36 48 60 72 84 96 108
i ncubat i on/ h
图 2-2 59 名链球菌咽喉炎患者潜伏期
Positive skewed distribution
myohemogl obi n of 101 normal peopl e
25
20
f requency
15
10
5
0
0 5 10 15 20 25 30 35 40 45
Negative skewed distribution
Characteristic parameters of distribution
1. coefficient of skewness (SKEW):
=0, Symmetrical distribution
> 0, Negative skewed distribution
< 0, Positive skewed distribution
2. coefficient of kurtosis (KURT)
=0, Symmetrical distribution
> 0, the peak is flat
< 0, the peak is sharp
5. Application of frequency table
To detect the distribution and characteris-
tics of different types of data.
To identify some suspicious data.
To estimate probability by specific group
frequency rate.
To calculate statistical indicators and help
statistical analysis.
Statistical indicators for nu-
merical data
I. Indicators for description of Central position
known as the average numbers, to describe the
central tendency and average level.
Representative value of average level, used for
comparison among different groups.
Commonly used average numbers :
1. Arithmetic mean (also called mean)
2. Geometric mean
3. Median
(1) Mean
Mean of population--written by
Mean of samples-- written by X
Application condition:
Numerical data in normal distribution or ap-
proximate normal distribution
1. directly calculation
X1 X2 X N X
N N
X1 X2 X n X
X
n n
2. weighing method (large sample size)
When make a frequency table, the actually mea-
sured values in each group could be substituted by
the class mid-value. Frequency of one group is
f , the overall value of this group is deemed as
class mid-values × f
Formula:
f1 x1 f 2 x2 f k xk fx
x
n n
Assume measured values are evenly distrib-
uted within each group, the class mid-
value=(lower limit +upper limit)/2, which is
used to substitute those actual values.
the result of weighting method is close to di-
rect calculation when the sample size is
large.
Precise calculation or approximate calculation ?
Characteristics of mean
For symmetrical distribution , X lo-
cates in the central. Likely to be af-
fected by extreme values
( x x ) 0
( x x ) < ( x a ) (a x )
2 2
Applications of mean
To reflect the average level of homogeneous
observed values
For comparison as representative value
To describe the central position of unimodal
symmetrical distribution (be influenced by
outliers)
(2) Geometric mean (G)
often used to described ① Symmetrical distribu-
tion after logarithmic transformation ;② Titer
data—often with equal ratio (antibody titer
data, bacteriological data, serum data, mate-
rial concentration, etc.)
Formula:
G n X 1 X 2 X n
G lg 1
(
f lg X
)
G lg 1
(
lg X
) f
n
e.g.
measured reciprocal of serologic titers of 10
persons: 2 , 2 , 4 , 4 , 8 , 8 , 8 , 8 , 32 , 32 ,
calculate the average titer
lg 2 lg 2 lg 4 lg 4 lg 8 lg 8 lg 8 lg 8 lg 32 lg 32
1
G lg 7
10
(3) Median (M)
Measured values are rearranged from the
smallest to largest, and the value of the
very middle one is the Median. Among all
values, the number of values ≥ M and ≤ M
are equal (proportion is 50%) .
Direct calculation:
1
M (X n X n ) n – even number
2 (2) ( 1)
2
M X n 1
n – odd number
( )
2
e.g.
measured apo_B in VLDLof five people (mg/dl):
0.84, 2.85, 5.46, 8.58, 9.60
if measured four people: 0.84, 2.85, 8.58, 9.60
e.g.
measured apo_B in VLDLof five people (mg/dl):
0.84, 2.85, 5.46, 8.58, 9.60
M=5.46 ( mg/dl )
if measured four people: 0.84, 2.85, 8.58, 9.60
M=(2.85+8.58)/2=5.72 ( mg/dl )
Data of frequency table
Table 1 serous triglyceride of 630 girls(mg/dl)
value Frequency Cumulative f Cumulative f
rate(%)
0.10 ~ 27 27 4.3
0.40 ~ 169 196 31.1 M
0.70 ~ 167 363 57.6
1.00 ~ 94 457 72.5
1.30 ~ 81 538 85.4
1.60 ~ 42 580 92.1
1.90 ~ 28 608 96.5
2.20 ~ 14 622 98.7
2.50 ~ 4 626 99.4
2.80 ~ 3 629 99.8
3.10 ~ 1 630 100.0
180
150
Uniform distribution
120
频
数 90
60
30
0 0.1 0.4 0.7 1. 0 1. 3 1. 6 1. 9 2. 2 2. 5 2. 8 3. 1
196 1
630×0.5 M 甘 油 三 脂 (mg/dL)
630 0.5 196
M 0.70 0.30 0.914
167
0. 5n f L
M L i M
fM
Application of M
The central position of quantitative data
in any distribution could be described by
Median.
Median is Mainly used in skewed distribu-
tion.
The mean makes full use of all measured
values, and it is more stable.
The relationship of mean and median
Normal distribution: X = M
Positive skewed distribution: X > M
Negative skewed distribution: X < M
(4) Mode
Among all measured values, the value
which appears in the most times.
Mode is always at the peak of a fre-
quency distribution.
(5) Percentile
Percentile is a kind of position index.
Measured values are rearranged from the
smallest to largest, and the value of X%
location, written as PX .
PX cut all the values to two groups, X% of
data ≤ PX , and the left (100-X)% of val-
ues ≥ PX
Median is a particular case of percentile, it
is the P50 .
For continuous numerical data of fre-
quency table could be calculated by the
formula:
iX
PX LX
fX
(n X % f L )
Multiple percentiles could be used to indi-
cate: dispersion, reference range, divide
data into ranks…
II. Indicators for description of dispersion/variation
We could not understand the overall data distribution
only by the average indicators.
e.g. continuously measured the systolic pressure
(mmHg) in five days for two patients
patient A: 162 145 178 142 186 ( xA =162.6 )
patient B: 164 160 163 159 166 ( xB =162.4 )
The means of BP for two patients are similar,
however, fluctuation of SP for patient A is larger.
1. Overall difference: Range (R), Inter-quartile
Range (Q)
2. Average difference:
Mean difference (l), Sum of square (SS),
Mean square deviation (σ2, s2) , Standard de-
viation (σ, s), Coefficient of variation (cv).
(1) Range, R=max-min
RA 186 142 44(mmHg)
RB 166 159 7 (mmHg)
Advantages: simple and clear
Disadvantage: rough
①only use the two extreme values, do not make full use of data.
②R might be increased when enlarging the sample size, because
more chance to include the extreme values.
③unstable when data is in skewed distribution.
(2) Q (inter- quartile range)
The range might be affected by uncertain data at both
two ends, and we could cut off 25% data at the two
tails, then we get:
Q = P75 - P25 = Qu - Ql
e.g. the P75 and P25 of 120 elder people is respectively
63.2 mg/dl and 135.7 mg/dl, and:
Q 135.7 63.2 72.5(mg/dl)
Q is more stable than R, but does not consider the
data of each measured value either.
(3) Mean Difference
l
X X
n
162 162.6 145 162.6 186 162.6
Patient A lA 15.52(mmHg)
5
164 162.4 160 162.4 166 162.4
Patient B lB 2.32(mmHg)
5
Characters: easy to understand. But introduce the
absolute value, and not convenient for arithmetical
operation.
(4) sum of square (SS)
Formula :
Can be transformed into:
( x ) 2
( x x ) x
2 2
n
(5) Variance / mean square deviation
Variance of population: σ2 = ( x ) 2
N
In practice, μ is usually unknown, and the Xis
adopted in the calculation, and we get the Vari-
ance of sample S2.
S2=
n-1
n-1 is named degree of freedom, symbol is
(df) is the number of the variables which
could be the arbitrary values (freely changed).
If variable values are restricted by k condi-
tions, the df = n-k
When calculate S, there are n values in the
sample, and there could be n deviation from
average, however, those values are limited by
one condition ( x x) 0 , and thus there is
only n-1 values could be arbitrary value.
(6) standard deviation (SD)
to restore the unit of variable
(x )
2
SD of population: σ=
N
μ is unknown, SD of sample:
2
( x x) x 2
( /n
x ) 2
S
n 1 n 1
e.g. continuously measured the systolic pres-
sure (mmHg) in five days for two patients
patient A: 162 145 178 142 186 ( xA =162.6 )
patient B: 164 160 163 159 166 ( xB =162.4 )
For Patient A:
X 813 X 2
133713 n 5
133713 8132 / 5
S 19.49(mmHg)
5 1
For Patient B:
S 2.88(mmHg)
Advantages of S
1. Calculate merged S (population variance is
the same)
2. Generally describe a normal distribution
combined with x
3. compare the variation when the x are similar
and units is the same.
Application: to describe the variation, calculate
standard error and CV, describe the normal distri-
bution, estimate reference rage.
(7) Coefficient of variation, CV .
Application condition:
①units are different.
②units are the same, but x is far too different.
Used in comparison of variation for two or more
indicators/groups.
Examples
1. Comparison of height and weight of 40 boys
2. Blood pressure measurement, the mean of diastolic pressure
is 77.5mmHg,SD is 10.7mmHg. the mean of systolic pressure is
122.9mmHg , SD is 17.1mmHg. Try to compare the variation.
10.7
diastolic pressure: cv 100% 13.8%
77.5
17.1
systolic pressure: cv 100% 13.9%
122.9
Commonly used combination for
the descriptive indicators of numerical data
Normal distribution: to use mean & SD
x S joint mark
Skewed distribution: to use median & quartile range
M Q