0% found this document useful (0 votes)
69 views31 pages

20-Introduction To Analysis of Variance

The document provides an introduction to analysis of variance (ANOVA). It defines the F-statistic as the ratio of the variance between groups to the variance within groups. A higher F-statistic indicates more variation between groups relative to within groups, suggesting the group means are likely different. The ANOVA procedure involves formulating hypotheses, calculating the F-statistic, determining the p-value, and making a conclusion about whether to reject or fail to reject the null hypothesis of equal group means. An example calculates F for a study comparing rat weights across 5 different diets.

Uploaded by

Kelz mang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views31 pages

20-Introduction To Analysis of Variance

The document provides an introduction to analysis of variance (ANOVA). It defines the F-statistic as the ratio of the variance between groups to the variance within groups. A higher F-statistic indicates more variation between groups relative to within groups, suggesting the group means are likely different. The ANOVA procedure involves formulating hypotheses, calculating the F-statistic, determining the p-value, and making a conclusion about whether to reject or fail to reject the null hypothesis of equal group means. An example calculates F for a study comparing rat weights across 5 different diets.

Uploaded by

Kelz mang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Introduction to analysis

of variance
[email protected]
+265993375505
Multi-sample z- and t-tests
• We can use the z- or t-tests to compare more
than two samples.
• If the number of samples is n, the number of
possible pairwise comparisons that are
possible is given by nC2.
Number of Possible number of pairwise
samples comparisons (given by nC2)
2 1
3 3
4 6
5 10
10 45
A review of the Type I Error
• Remember that when you reject a
hypothesis at (1- α), you are actually
committing Type I Error = α
• That is to say, you might be rejecting the null
hypothesis wrongly with a probability of 𝜶.
• When you are dealing with a multi-sample
situation and you carry out n z- or t- tests,
the probability of accepting the null
hypothesis when it is correct, 1 − 𝛼 𝑛
becomes smaller and we stand a high chance
of making the wrong decision, overall.
A review of the Type I Error
• The Type I Error for the combined set of these
comparisons is given by 1 − 1 − 𝛼 𝑛 and is
called the experiment-wise error.
• This is equal to 0.05 = α for a two-sample z or t
test, 0.142 > α for a three-sample test and
0.265 > α for a four-sample test
• To maintain Type I Error at 0.05 for the whole
test—which is what we want—we use the F
statistic to confidently compare more than two
samples.
• The F-statistic test is popularly called analysis of
variance.
The F statistic
• It is defined as the ratio of
• the mean sum of squares due to the variability
between groups
• to the mean sum of squares due to the
variability within groups.
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
𝐹=
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
• The critical value of F is read off from tables on
the F-distribution knowing the Type-I error and
the degrees of freedom between & within the
groups.
F statistic assumptions
•The populations have normal
distributions.
•The populations have the same
variance or standard deviations
•The samples are simple random
samples.
•The samples are independent of
each other.
Definition of the F statistic
• Consider the weaning weights of some kids as given
in the table below.

Kid 1 2 3 4 5 6
Weight 8 10 10 8 8 10

54
• The mean is = 9.
6
• The variance
2
of2 the data
2
is 2 2 2
−1 + 1 + 1 + −1 + −1 + 1 6
5 = = 1.2
5
Definition of the F statistic
• We may discover that the kids are of different sex as
in the table below

Kid 1 2 3 4 5 6
Sex F M M F F M
Weight 8 10 10 8 8 10

• From this, it can clearly be seen that the


variability in the weights is due to sex.
Definition of the F statistic
• Apart from the sex, there is also another source of
variation.
• This is due to errors that are random.
• Note that the weights have been given to the nearest
whole number.
• Let us suppose that the actual weights, expressed to
one decimal place, are as follow:

Kid 1 2 3 4 5 6
Sex F M M F F M
Weight 8.4 9.8 9.9 7.7 7.9 10.3
Definition of the F statistic
• There is some variation of each of the observation
from the mean of the group

Female 8.4 7.7 7.9


Male 9.8 9.9 10.3

Female 8 + 0.4 8 - 0.3 8 – 0.1


Male 10 - 0.2 10 - 0.1 10 + 0.3
• Remember that each observation also varies
from the overall mean.
• This can be shown in the table below:
Definition of the F statistic
Female 9 - 1 + 0.4 9 - 1 - 0.3 9 - 1 - 0.1
Male 9 + 1 + 0.2 9 + 1 - 0.1 9 + 1 + 0.3

• This shows that each weight is actually


composed of 3 parts
• The (overall) population mean (blue)
• An amount associated with the mean
within the group (green)
• The variation of the weight due to
random error
• The three parts of which the weight is actually
composed can be given in an equation as follows:

yi j =  + si + ei j
• Where
• yij is the weight of individual j belonging to sex i
• μ is the (overall) population mean
• si is the mean deviation of sex i (i = male,
female) from the population mean
• eij is the deviation of the weight j from the
overall mean not attributable to sex i (also
called random error)
ANOVA Procedure
Step 1: Formulation of the hypotheses
• Null hypothesis

𝐻0 : 𝜇1 = 𝜇2 = 𝜇3 = ⋯ 𝜇𝑖
where 𝑖 is the number of populations to
be compared
• Alternate hypothesis

𝐻𝐴 : Not all means are equal


Step 2
• State the significance level, alpha.
• Remember, if this has not been
specified, use 0.05
Step 3: Calculate the F-statistic
•The F-statistic is given by

𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠


𝐹=
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠
Step 4: Find the P-value
• The P-value for an ANOVA F-test is always one-
sided.
• The P-value is
Pr( Fdf1 ,df2  Fcalculated )

P-value
Step 5. Reject or fail to reject H0
based on the P-value

•If the P-value is less than or


equal to a, reject H0.
•It the P-value is greater than
a, fail to reject H0.
Step 6. State your conclusion.
• If H0 is rejected, “There is significant
statistical evidence that at least one of
the population means is different from
another.”
• If H0 is not rejected, “There is not
significant statistical evidence that at
least one of the population means is
different from another.”
Consider this
• The comparison of the effect of 5
different diets on the weights of rats,
with 4 rats randomly allocated to each
of the 5 diets
• Let’s designate the number of diets a,
the position of each diet 𝑖, the number
of rats in each diet 𝑛, the position of
each rat in each diet 𝑗 and the total
number of observations N.
ANOVA summary of results
Source of Degrees Sum of squares (SS) Mean square F p(F)
variation of
freedom
(df)
Between a−1 𝑆𝑆𝑇 𝑀𝑆𝑇 𝑀𝑆𝑇
diets 𝑀𝑆𝐸
Within 𝑁−𝑎 𝑆𝑆𝐸 𝑀𝑆𝐸
diets
Total 𝑵−𝟏 𝑻𝑺𝑺 𝑻𝑴𝑺
ANOVA summary of results
Source of Degrees Sum of squares (SS) Mean square F p(F)
variation of
freedom
(df)
2
Between 𝑎−1 Σ𝑛𝑖 𝑥ҧ𝑖 − 𝑥Ӗ Σ𝑛𝑖 𝑥ҧ𝑖 − 𝑥Ӗ 2 Σ𝑛𝑖 𝑥ҧ𝑖 − 𝑥Ӗ 2
diets 𝑎−1 𝑎−1
𝑎 𝑛
Within 𝑁−𝑎 2 Σ(𝑛𝑖 −1)𝑠𝑖 2 Σ(𝑛𝑖 −1)𝑠𝑖 2
diets ෍ ෍ 𝑥𝑖𝑗 − 𝑥ҧ𝑖 𝑁−𝑎 𝑁−𝑎
𝑖=1 𝑗=1
= Σ(𝑛𝑖 −1)𝑠𝑖 2
𝒂 𝒏 2
Total 𝑵−𝟏 2 σ𝒊𝒊−𝟏 σ𝒋𝒋=𝟏 𝑥𝑖𝑗 − 𝑥Ӗ
෍ ෍ 𝑥𝑖𝑗 − 𝑥Ӗ
𝒊−𝟏 𝒋=𝟏
𝑵−𝟏
Example: the rat diet data
The data
Rat
Diet 1 2 3 4
a 81.5 80.7 80.3 79.8
b 81.6 81.9 80.4 80.4
c 83.3 81.6 82.2 81.3
d 82.4 83.1 82.8 81.8
e 83.2 82.8 82.1 82.1
2 2 2
Diet 𝒙 ഥ
𝒙 𝑥ҧ − 𝑥Ӗ 𝑥ҧ − 𝑥Ӗ 𝑥 − 𝑥ҧ 𝑥 − 𝑥ҧ 𝑥 − 𝑥Ӗ 𝑥 − 𝑥Ӗ
a 81.5 80.6 -1.2 1.4161 0.9250 0.855625 -0.265 0.07
a 80.7 80.6 -1.2 1.4161 0.1250 0.015625 -1.065 1.13
a 80.3 80.6 -1.2 1.4161 -0.2750 0.075625 -1.465 2.15
a 79.8 80.6 -1.2 1.4161 -0.7750 0.600625 -1.965 3.86
b 81.6 81.1 -0.7 0.4761 0.5250 0.275625 -0.165 0.03
b 81.9 81.1 -0.7 0.4761 0.8250 0.680625 0.135 0.02
b 80.4 81.1 -0.7 0.4761 -0.6750 0.455625 -1.365 1.86
b 80.4 81.1 -0.7 0.4761 -0.6750 0.455625 -1.365 1.86
c 83.3 82.1 0.3 0.1122 1.2000 1.440000 1.535 2.36
c 81.6 82.1 0.3 0.1122 -0.5000 0.250000 -0.165 0.03
c 82.2 82.1 0.3 0.1122 0.1000 0.010000 0.435 0.19
c 81.3 82.1 0.3 0.1122 -0.8000 0.640000 -0.465 0.22
d 82.4 82.5 0.8 0.5776 -0.1250 0.015625 0.635 0.40
d 83.1 82.5 0.8 0.5776 0.5750 0.330625 1.335 1.78
d 82.8 82.5 0.8 0.5776 0.2750 0.075625 1.035 1.07
d 81.8 82.5 0.8 0.5776 -0.7250 0.525625 0.035 0.00
e 83.2 82.55 0.8 0.6162 0.6500 0.422500 1.435 2.06
e 82.8 82.55 0.8 0.6162 0.2500 0.062500 1.035 1.07
e 82.1 82.55 0.8 0.6162 -0.4500 0.202500 0.335 0.11
e 82.1 82.55 0.8 0.6162 -0.4500 0.202500 0.335 0.11
Mean 81.8 SST SSE TSS
Sum of squares 12.793 7.5925 20.39
Mean Sum of Squares 3.1982 0.5062
F-value= 3.198/0.5062 = 6.3186
Alternatively,
sum of sum of
Standard squares squares
diet n mean deviation variance between within
a 4 80.6 0.718 0.516 5.664 1.548
b 4 81.1 0.789 0.622 1.904 1.867
c 4 82.1 0.883 0.780 0.449 2.340
d 4 82.5 0.562 0.316 2.310 0.947
e 4 82.55 0.545 0.297 2.465 0.890
Overall 20 81.765 1.036 1.073 12.793 7.593
Mean square 3.198 0.506
F calculated 6.319 df between 4
F critical 3.80 df within 15
P 0.00346
Decision We reject the null hypothesis
Conclusion At least two of the diets are different
The information in the excel sheet
can be summarized as follows
Source of Degrees of Sum of Mean square F p(F)
variation freedom (df) squares (SS)
Between diets 4 12.7930 3.1982 6.32 0.003
Within diets 15 7.5925 0.5062
Total 𝟏𝟗 20.3855

• The variation between the diets is called treatment


effect
• The variation within the diets is called the residual
variation, or the variation of the rats not accounted
for by diets
In practice, we use the following formulae
2
 a n  The correction term (C). Is the squared sum of all
C =   xij  / N
 observations divided by their number. This is from
 i =1 j =1  𝑛
Σ𝑖=1 2 /𝑁 from the variance formula numerator

a n
TSS =  xij − C
2 The total sum of squares that includes all sources
of variation. This is the total SS.
i =1 j =1

2
 a

  xi  The sum of squares attributable to the variable of

SST =  i =1  − C
classification. This is the between SS, or among
groups SS or treatment SS.
ni

The sum of squares among individuals treated


SSE = TSS − SST alike. This is the within groups SS, or residual SS or
error SS. It is easier to calculate as a difference
Separation of means: post-hoc
tests and contrasts
• After determining that at least two of the means
are different, the job of the statistician is to
determine which.
• Most of the available tools are based on t-tests
designed to keep α constant even with many
multiple pair-wise comparisons.
• However, there are tools for making predesigned
comparisons in what is called contrast analysis.
• You will learn these things in Design and Analysis
of Experiments or similar courses.
Practice exercise
• Three different cultures of clover (a legume) were
inoculated with strains of the nitrogen-fixing bacteria
from another legume, alfalfa. As a sort of control, a
fourth trial was run in which a composite of the three
clover cultures was inoculated. Each of the trials was
repeated 5 times (we say that it was replicated 5
times). The table shows the mean number of nodules
per plant in each of the cultures. Are there any
differences between the cultures?

3DOK1 3DOK5 3DOK4 Composite


19.4 17.7 17.0 17.3
32.6 24.8 19.4 19.4
27.0 27.9 9.1 19.1
32.1 25.2 11.9 16.9
33.0 24.3 15.8 20.8
End of presentation

You might also like