0% found this document useful (0 votes)
63 views

Analysis of Variance

The document provides an overview of analysis of variance (ANOVA). It defines ANOVA as a technique used to compare mean values of more than two groups simultaneously. The key points are: 1) ANOVA compares the impact of a factor on a mean response across multiple levels of the factor, where multiple t-tests could result in incorrect errors. 2) ANOVA models the response as an overall mean plus factor effects and error. It partitions total variation into between-groups and within-groups components. 3) One-way ANOVA specifically analyzes a single factor with multiple levels, testing if all group means are equal versus not all equal. It calculates sum of squares, degrees of freedom, and mean squares for total

Uploaded by

aman.ace0701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Analysis of Variance

The document provides an overview of analysis of variance (ANOVA). It defines ANOVA as a technique used to compare mean values of more than two groups simultaneously. The key points are: 1) ANOVA compares the impact of a factor on a mean response across multiple levels of the factor, where multiple t-tests could result in incorrect errors. 2) ANOVA models the response as an overall mean plus factor effects and error. It partitions total variation into between-groups and within-groups components. 3) One-way ANOVA specifically analyzes a single factor with multiple levels, testing if all group means are equal versus not all equal. It calculates sum of squares, degrees of freedom, and mean squares for total

Uploaded by

aman.ace0701
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

ANALYSIS OF VARIANCE

“Analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic ”.
– Ronald Fisher
Introduction to ANOVA
• Analysis of Variance (ANOVA) is a hypothesis testing
techniques that is used for comparing mean values of more
than two groups simultaneously.

• When we have to compare the impact of a factor on mean


on more than two groups (created by different levels of the
factor) simultaneously, the hypothesis tests such as two-
sample t-tests discussed in Chapter 6 are not ideal approach
since they can result in incorrect Type I and Type II errors.
We use the Analysis of Variance (ANOVA) to understand the
differences in population means among more than two
populations.
INTRODUCTION TO ANOVA

Means Model: It is given by Yij     ij

where Yij is the value of the outcome variable of jth observation for ith
factor level,  is the overall mean value of all observations, ij is the
error assumed to be a normal distribution with mean 0 and standard
deviation .

Model defined in above Eq. is often called the reduced model, in


which the mean  is common for all levels of the factor.
Factor Effect Model

Factor Effect Model: It is given by Yij    i  ij

Here  is the overall mean and i is the effect of factor i (or


factor effect). i is the difference between overall mean and the
factor level mean. Our interest in this case would be to check
whether the values of i are different from zero

The model in above eq.is called full model


Multiple t-Tests for Comparing Several Means

When we have more than two levels of discounts, one option is to


compare the population parameters two at a time (two discount
values).
For example, we can compare the following three cases using two-
sample t-test:
• Test between 0% and 10%
• Test between 0% and 20%
• Test between 10% and 20%

However, when we want to test the hypothesis simultaneously, the


Type I and Type II errors will not be same if we conduct the three
different tests listed above. For example, assume that the mean
sale (population mean) at 0%, 10%, and 20% discount is 0, 10,
and 20, respectively
Consider the following three two-sample t-tests shown in Table

Test Null Hypothesis Alternative Significance ()

Hypothesis

A H0: 0 = 10 HA: 0  10  = 0.05

B H0: 0 = 20 HA: 0  20  = 0.05

C H0: 10 = 20 HA: 10  20  = 0.05

Let,
P(A) = P(Retain H0 in test A|H0 in test A is true)
P(B) = P(Retain H0 in test B|H0 in test B is true)
P(C) = P(Retain H0 in test C|H0 in test C is true)
Note that values of P(A) = P(B) = P(C) = 1 –  = 1 – 0.05 = 0.95
The conditional probability of simultaneously retaining all 3 null hypotheses
when they are true is P(A  B  C) = 0.8573.
Multiple t-Tests for Comparing Several Means

Now consider the following null hypothesis:


H0: 0 = 10 = 20 (7.3)
If we retain the null hypothesis based on the three individual t-tests, then the
significance or Type I error is not -value but much higher than  (Lunney,
1969; Siegel, 1990).

For the case discussed above, if we retain the null hypothesis based on 3
individual tests, then the Type I error is 1 – 0.8573 = 0.1426. That is, when
more than 2 groups are involved, checking the population parameter values
simultaneously using t-tests is inappropriate since the Type I and Type II errors
will be estimated incorrectly. For this reason, we use analysis of variance
(ANOVA) whenever we need to compare 3 or more groups for population
parameter values simultaneously.
One-Way Analysis of Variance (ANOVA)
One-way ANOVA is appropriate under the following conditions:

1. We would like to study the impact of a single treatment (also


known as factor) at different levels (thus forming different
groups) on a continuous response variable (or outcome variable).
For the above example discussed , the variable ‘price discount’ is
the treatment (or factor) and 0%, 10%, and 20% price discounts
are the different levels (3 levels in this case), different levels of
discount is likely to have varying impact on the sales of the
product, where sales is the outcome variable. We would like to
understand the impact of different levels of price discount on the
response variable, sales. The term ‘treatment’ is used since one
of the initial applications of ANOVA was to find the impact of
different fertilizer treatments on agricultural yield as studied by
British statistician R A Fisher (1934).
One-Way Analysis of Variance (ANOVA)
2. In each group, the population response variable follows a
normal distribution and the sample subjects are chosen using
random sampling.
3. The population variances for different groups are assumed
to be same. That is, variability in the response variable values
within different groups is same

Although conditions 2 and 3 are necessary for one-way


ANOVA, the model is robust and minor violations of the
assumptions may not result in incorrect decision about the null
hypothesis.
Setting up an Analysis of Variance
Assume that we would like to study the impact of a factor (such
as discount) with k levels on a continuous variable (such as sales
quantity). Then the null and alternative hypotheses for one way
ANOVA are given by

• H0: 1 = 2 = 3=…= k
• HA: Not all  values are equal

Note that the alternative hypothesis, ‘not all  values are equal’,
implies that some of them could be equal. The null hypothesis
is equivalent to stating that the factor effects 1, 2, …, k defined
in Eq Yij    i   ij are zero
Comparing three means (1, 2, and 3).

If the mean values of different groups are not equal, then the
variation of cases within the group will be much smaller
compared to variations between groups.
One-Way Analysis of Variance (ANOVA)
We are interested in analyzing single factor effect with k levels,
thus we will have k groups.
Let
k = Number of groups (or samples)
n­i = Number of observations in group i (i = 1, 2, …, k)
k
n = Total number of observations (=  ni
i 1
)
Yij = Observation j in group i
1 ni
i  Mean of groupi   Yij
n i j1

1 k ni
  Overall mean   Yij
n i 1 j1
Total Variation
To arrive at the statistic, we calculate the following measures,
which are variations within group and between groups:
• Sum of Squares of Total Variation (SST): Total variation is
the sum of squared variation of all values of response variable
(Yij) from the overall mean () and is given by
k ni
SST   (Yij   ) 2
i 1 j 1

The degrees of freedom for SST is (n 1) since only the value of
 is estimated from n observations and thus only one degree of
freedom is lost. Mean Square Total (MST) variation is given by
SST
MST 
n 1
Variation between groups
• Sum of Squares of Between (SSB) Group Variation: Sum of
squares of between variation is the sum of squared variation
between the group mean (i) and the overall mean () of the
data and is given by
k
SSB   ni  (  i   ) 2
i 1

The degrees of freedom is (k  1). Since the overall mean  is


estimated from the data, one degree of freedom is lost. Mean
square between variation (MSB) is given by
SSB
MSB 
k 1
Variation Within groups
• Sum of Squares of Within (SSW) Group Variation: Sum of
squares of within the group variation is the sum of squared
variation of all observations (Yij) from that group mean (i) and is
given by
k ni
SSW   (Yij   i ) 2
i 1 j 1

The degrees of freedom for SSW is (n  k). Here k degrees of


freedom are lost since we estimate k group means (i). The mean
square of variation within the group is
SSW
MSW 
nk
Algebraically
We can prove that

k ni k k ni

 ij
(Y  
i 1 j 1
) 2
  i i
n  (
i 1
   ) 2
  ij i
(Y 
i 1 j 1
 ) 2

That is
SST = SSB + SSW
Cochran’s Theorem
According to Cochran’s theorem (Kutner et al., 2013, page 70):
‘If Y1, Y2, …, Yn are drawn from a normal distribution with
mean  and standard deviation  and sum of squares of total
variation [Eq. (7.4)] is decomposed into k sum of squares (SSr)
with degrees of freedom dfr, then the ratio (SSr/2) are
independent 2 variables with dfr degrees of freedom if
k
 df r  n  1
r 1

Note that, in Eq SST = SSB + SSW the SST is decomposed into


two sums of squares (SSB and SSW) and thus, SSB/2 and
SSW/2 are chi-square variables.
The F-test
• If the null hypothesis is true, then there will be no difference
in the mean values which will result in no difference
between MSB and MSW.
• Alternatively, if the means are different, then MSB will be
larger than MSW.
• That is the ratio MSB/MSW will be close to 1 if there is no
difference between the mean values and will be larger than 1
if the means are different.
The F-test
Following Cochran’s theorem (Kirk, 1995) MSB/MSW is a ratio
of two chi-square variate which is an F-distribution. Thus the
statistic for testing the null hypothesis is

SSB /( k  1) MSB
F 
SSW /( n  k ) MSW

Note that the test statistic is a one-tailed test (right tailed) since
we are interested in finding whether the variation between groups
is greater than variation within the groups
ANOVA Example
Ms Rachael Khanna the brand manager of ENZO detergent
powder at the ‘one stop’ retail was interested in understanding
whether the price discounts has any impact on the sales
quantity of ENZO. To test whether the price discounts had any
impact, price discounts of 0% (no discount), 10% and 20%
were given on randomly selected days. The quantity (in
kilograms) of ENZO sold in a day under different discount
levels is shown in Table(next slide). Conduct a one-way
ANOVA to check whether discount had any significant impact
on the sales quantity at  = 0.05.
Sales of ENZO at different price discounts
No Discount (0% discount)

39 32 25 25 37 28 26 26 40 29

37 34 28 36 38 38 34 31 39 36

34 25 33 26 33 26 26 27 32 40

10% Discount

34 41 45 39 38 33 35 41 47 34

47 44 46 38 42 33 37 45 38 44

38 35 34 34 37 39 34 34 36 41

20% Discount

42 43 44 46 41 52 43 42 50 41

41 47 55 55 47 48 41 42 45 48
40 50 52 43 47 55 49 46 55 42
Solution
In this case, the number of groups k = 3; n1 = n2 = n3 = 30; 1 =
32, 2 = 38.77, 3 = 46.4; and  = 39.05.
The sum of squares of between groups variation (SSB) is given by
k
SSB   ni  (  i   ) 2  30  [(32  39.05) 2  (38.77  39.05) 2  (46.4  39.05) 2 ]  3114 .156
i 1

So
SSB 3114 .156
MSB    1557.078
k 1 2
Solution Continued…
The sum of squares of within the group variation is given by
k ni 30 30 30
SSW   (Yij  i )   (Y1 j  32)   (Y2 j  38.77)   (Y3 j  46.4) 2  2056.567
2 2 2

i 1 j 1 j 1 j 1 j 1

SSW 2056.567
MSW    23.63
nk 90  3

The F-statistic value is


MSB 1557.078
F2,87    65.86
MSW 23.6387
Solution Continued
The critical F-value with degrees of freedom (2, 87) for  =
0.05 is 3.101 [Excel function FINV(0.05, 2, 87) or
F.INV.RT(0.05, 2, 87)].

The p-value for F2,87 = 65.86 is 3.82 × 1018 [using Excel


function FDIST(65.86, 2, 87) or F.DIST.RT(65.86, 2, 87)].

Since the calculated F-statistic is much higher than the critical


F-value, we reject the null hypothesis and conclude that the
mean sales quantity values under different discounts are
different.
ANOVA Excel output

Anova: Single Factor


SUMMARY

Groups Count Sum Average Variance


27.1724
No Discount 30 960 32 1
38.7666 20.4609
10% Discount 30 1163 7 2
23.2827
20% Discount 30 1392 46.4 6

ANOVA
Source of Variation SS df MS F P-value F crit
1557.07 65.8698 3.10129
Between Groups 3114.15556 2 8 6 3.82E-18 6
Within Groups 2056.56667 87 23.6387

Total 5170.72222 89
Example
Share Raja Khan (SRK) is a top stockbroker and believes that
the average annual stock return depends on the industrial
sector. To validate his belief, SRK collected annual return of
shares from three different industrial sectors  consumer
goods, services, and industrial goods. The annual return of
shares in 20152016 for different sectors is shown in Table in
next slide.
Annual return of stocks under different industrial sector
Annual return on 30 consumer goods stocks
6.32% 14.73% 11.95% 12.36% 10.28% 3.81% 10.15% 11.06% 6.29% 5.15%

8.44% 14.28% 8.89% 5.98% 6.96% 11.62% 5.22% 5.34% 5.93% 7.10%

10.91% 8.20% 10.19% 9.04% 8.61% 9.39% 2.63% 2.77% 4.76% 9.60%

Annual return on 30 services stocks


13.70% 3.58% 1.36% 17.41% 10.01% 10.88% 15.63% -0.04% 10.32% 7.40%

11.48% 9.71% 11.19% 8.21% 1.64% 1.45% 10.12% 13.85% -10.27% 5.26%

12.05% 4.47% 8.71% 5.59% 10.02% 7.65% 10.03% 7.87% 6.59% 13.60%

Annual return on 30 industrial goods stocks

6.74% 7.11% 5.69% 2.48% 5.42% 8.00% 2.55% 8.34% 4.99% 3.39%

8.73% 13.85% 5.29% 9.06% 2.84% 5.82% 7.66% 4.12% 9.10% 8.76%

10.77% 1.48% 4.71% 10.66% 0.44% 2.94% 6.55% 2.84% 3.90% 7.28%
Solution
In this case, the number of cases k = 3; n1 = n2 = n3 = 30; 1 =
0.082, 2 = 0.079, 3 = 0.0605; and  = 0.0743
The sum of squares of between groups (SSB) variation is given
by
k
SSB   ni  (i  )2  30  [(0.082  0.0743) 2  (0.079  0.0743) 2  (0.0605  0.0743) 2 ]  0.0087
i 1

Therefore

SSB 0.0087
MSB    0.0043
k 1 2
Solution Continued…
The sum of squares of within the group variation is given by
k ni 30 30 30
SSW   (Yij  i )   (Y1 j  0.082)   (Y2 j  0.079)   (Y3 j  0.0605)2  0.1463
2 2 2

i 1 j 1 j 1 j 1 j 1

So
SSW 0.1463
MSW    0.0016
nk 90  3

The F-statistic value is

MSB 0.0043
F2,87    2.592
MSW 0.0016
The critical F-value with degrees of freedom (2, 87) for  = 0.05 is 3.101
[Excel function FINV(0.05, 2, 87) or F.INV.RT(0.05, 2, 87)].
The p-value for F2,87 = 2.592 is 0.0805 [using Excel function FDIST(2.592, 2,
87) or F.DIST.RT(2.592,2,87)].
Since the calculated F-statistic is less than the critical F-value, we retain the
null hypothesis and conclude that the average annual returns under industrial
sectors consumer goods, services, and industrial goods are not different.
F-distribution with critical value
ANOVA Table
Microsoft Excel ANOVA Table for Example

ANOVA: Single Factor


SUMMARY
Groups Count Sum Average Variance

Consumer Goods 30 2.4796 0.082653 0.00101

Services 30 2.3947 0.079823 0.003073

Industrial Goods 30 1.8151 0.060503 0.000963


ANOVA

Source of Variation SS df MS F P-value F critical

Between Groups 0.008722 2 0.004361 2.59294 0.080572 3.101296

Within Groups 0.146317 87 0.001682

Total 0.155039 89
Two-Way Analysis of Variance (ANOVA)

• The values of response variable may be influenced by


several factors. For example, in addition to price discounts,
location of the stores may also play an important role in the
sales quantity.

• The discounts may not have much impact if the store is


located near affluent community compared to stores located
near non-affluent community.
Two-Way Analysis of Variance (ANOVA)
We would like to understand the impact of both factors (price
discount and location) simultaneously on sales by trying to
answer to the following questions:

• Are there differences in the average sales quantity with


different levels of price discounts?

• Are there differences in the average sales quantity with respect


to different locations?

• Are there interactions between price discounts and location


with respect to average sales quantity?
Two- Way ANOVA model
The two-way ANOVA model can be expressed as

Yijk     i   j   i  j   ijk

Where,

Yijk = Value of the kth observation (k = 1, 2, …, K) of the response


variable at level i (i = 1, 2, …, a) of factor A and level j (j = 1, 2, …,
b) of factor B.
 = Overall mean value of the response variable Yijk
i = Level (effect) of factor A (i = 1, 2, …, a)
j = Level (effect) of factor B (j = 1, 2, …, b)
ij = Interaction of ith level of factor A and jth level of factor B
ijk = Error associated with kth of observation at level i of factor A and
level j of factor B.
ANOVA Hypothesis Tests
The hypothesis tests associated with two-way ANOVA are as follows:
• Test of Factor A Main Effects:
H0: i = 0 for all i (i = 1, 2, …, a)
HA: Not all i are zero

• Test of Factor B Main Effects:


H0: j = 0 for all j (j = 1, 2, …, b)
HA: Not all j are zero

• Test of Interaction Effects:


H0: ij = 0 for all i (i = 1, 2, …, a) and j (j = 1, 2, …, b)
HA: Not all ij are zero
Sum of Squares
The sum of squares in the case of two-way ANOVA with equal
sample sizes is given by
SST = SSA + SSB + SSAB + SSE
Various components in above Eq. are provided below:
• Sum of squared of total deviation (SST):
a b c
SST     ijk
(Y  
i 1 j 1 k 1
) 2

where c is the number of observations in each group and  is


the overall mean.
• Sum of squares of deviation due to factor A (SSA):
a
SSA  b  c   ( i   ) 2
i 1

where i is the mean of all observations in level i of factor A and


c is the number of observations in each group (assumed to be
same for all groups).
• Sum of squares of deviation due to factor B (SSB):

b
SSB  a  c   (  j   ) 2
j 1

Here j is the mean of all observations in level j of factor B.


• Sum of squares of deviation due to interaction of factors A
and B (SSAB)
a b
SSAB  c    ( ij  i   j   ) 2
i 1 j 1

where ij is the average of ith level of factor A and jth level of
factor B.

• Sum of squares of deviation within a group (SSW):


a b c
SSW   (Yijk  ij ) 2
i 1 j 1 k 1
Sum of squares of deviation for various effects and the corresponding
F-statistic in a two-way ANOVA with equal sample size

Sum of Squared Variation Degrees of Freedom Mean Squared Variation F-Statistics

SSA
a1 MSA = SSA/(a  1) F = MSA/MSW

SSB
b1 MSB = SSB/(b  1) F = MSB/MSW

SSAB
(a  1)(b  1) MSAB = SSAB/(a  1)(b  1) F = MSAB/MSW

SSW
ab(c  1) MSW = SSW/ab(c  1)
Example

Table in next slide shows the sales quantity of detergents at


different discount values and different locations collected over
20 days. Conduct a two-way ANOVA at  = 0.05 to test the
effects of discounts and location on the sales.
Location 1 Location 2
Discount Discount
0% 10% 20% 0% 10% 20%
20 28 32 20 19 20
16 23 29 21 27 31
24 25 28 23 23 35
20 31 27 19 30 25
19 25 30 25 25 31
10 24 26 22 21 31
24 28 37 25 33 31
16 23 33 21 26 23
25 26 27 26 22 22
16 25 31 22 28 32
18 22 37 25 24 22
20 24 28 23 23 29
17 26 25 23 26 25
26 28 23 24 16 34
16 21 26 20 30 30
21 27 33 23 22 25
24 25 28 18 16 39
19 20 30 19 25 32
19 26 30 19 34 29
21 26 26 30 23 22
Solution
The two-way ANOVA with replication (since the data in Table 7.7 is repeated
for locations) output from Microsoft Excel is shown in

ANOVA

Source of Variation
SS df MS F P-value F crit
Sample

(Location) 7.008333 1 7.008333 0.443898 0.506593 3.92433


Columns

(Discount) 1240.317 2 620.1583 39.27997 1.06E-13 3.075853


Interaction 84.81667 2 42.40833 2.686085 0.07246 3.075853
Within 1799.85 114 15.78816

Total 3131.992 119


In the table
• The sample stands for the row factor (which in this case is
location), column stands for the column factor (discount in
this case), and interaction stands for interaction effect
(location × discount).
• The p-value for locations (data in rows) is 0.5065, thus it is
not statistically significant (we retain the null hypothesis that
the locations have no statistical influence on sales), whereas
for discount rates (data in column) the p-value is 1.06 ×
1013, so we reject the null hypothesis (that is discount rate
has influence on sales).
• The p-value for the interaction effect is 0.0724 and is not
significant. That is only the factor discount is statistically
significant at  = 0.05.
Summary
• Analysis of Variance (ANOVA) is a hypothesis testing procedure
used for comparing means from several groups simultaneously.
• In an one-way ANOVA, we test whether the mean values of an
outcome variable for different levels of factor are different. Using
multiple two sample t-test to simultaneously test group means will
result in incorrect estimation of Type I error and ANOVA
overcomes this problem.
• ANOVA plays an important role in multiple linear regression
model diagnostics. The overall significance of the model is tested
using ANOVA.
• In a two way ANOVA we check the impact of more than one
factor simultaneously on several groups.

You might also like