Notes on Multicollinearity and
Heteroskedasticity
We begin with the basic multiple linear
regression (MLR) model and its assumptions.
1. The Population Model
y 1 x 1 2 x 2 . . . . k x k u
where, y Dependent variable - Observable
random variable (r.v.)
x i , i 1, . . . k Explanatory / Independent
variables - Observable r.v.
u Disturbance / Error term; Unobservable
r.v.
, i , i 1, . . . , k Unobservable parameters /
constants
1
2. The Assumptions
2.1 The model is linear in parameters
y 1 x 1 2 x 2 . . . . k x k u
2.2 Random Sampling
x 1 , x 2, . . . , x k; y i is a random sample from the
population such that
y i 1 x 1i 2 x 2i . . . . k x ki u i ,
i 1, 2, . . . , n
Note : From the definition of a random
sample it follows, that the u i are identically
and independently distributed (iid) [Link]., so
that
correlation between u i & u j for any i, ji j is zero
2.3 There is no Perfect Collinearity
This means : (i) None of the x i are a constant;
2
and
(ii) There is no exact linear relation between
any two x i
2.4. Zero Conditional Mean
Eu|x 1, x 2 , . . . , x k 0
and for the random sample
Eu i |x 1i, x 2i , . . . , x ki 0
The zero conditional mean assumption
implies :
(i) Ey|x 1 x 1 2 x 2 . . . . k x k
and Ey i |x i 1 x 1i 2 x 2i . . . . k x ki
(ii) That u and x are uncorrelated .
In fact Eu|x 1, x 2 , . . . , x k Eu and every
function (and not just linear functions) of x is
uncorrelated with u
3
2.5 Homoskedasticity
This is an assumption regarding the
disturbance variance being constant :
Varu|x 1, x 2 , . . . , x k 2
and
Varu i |x 1i, x 2i , . . . , x ki 2
2.6 The classical assumption regarding
distribution of the error term is that
u i are independent of x i and Normally distributed
i.e.,:
u i N0, 2
and
y i N x 1, x 2 , . . . , x k , 2 .
We use the MLR model to estimate the
unknown parameters i , using the estimators
i.
4
3. Some Basic Concepts
We will use the 2-variable model to discuss
some basic concepts related to estimators,
their properties, link between the OLS
assumptions and properties of estimators and
variances of estimators.
3.1 The OLS Estimators
The 2-variable Simple Linear Model is
y x u
or Ey|x x
A random sample of observations on x and y
is used to derive the ordinary least squares
(OLS) estimators for and .
In the 2-variable model, the OLS estimators
of the unknown parameters and are :
i x i xy i y
y x and
i x i x
2
5
3.2 Statistical Properties of the OLS estimators
As we saw in our Statistics classes that
estimators are random variables, so and
are random variables.
Since errors are normally
distributed, so is y
and so are and (being linear functions of
y).
According to the Gauss Markov theorem the
OLS estimators (i.e., and ) are Best,
Linear, Unbiased estimators (B.L.U.E) of
and .
i.e. N , Var ; since is unbiased, so
E
Similarly N , Var and E
and are ‘best’, i.e., in the class of all linear
and unbiased estimators of and , these
have the least variance or are most ‘efficient’.
3.3 The OLS Residuals
The method of least squares minimizes the
6
sum of squared residuals :
2
2
yi 0 1xi ui
i i
u i are the least squares residuals, which are
observable and:
ui yi yi
oru i y i 0 1 x i
The residuals u i are used as estimators of the
errors or disturbances
(u i ) which are unobservable.
3.4 Variance of Estimators in OLS Model
It follows from assumptions 2.1 to 2.6 that in
the 2-variable model :
2
N , Var , where, Var ,
2
i x i x
Similarly N , Varwhere,
7
2
n i xi 2
Var ,
i x i x 2
Both Var and Var are functions of the error
variance 2 which is unknown.
To estimate Var and Var we need an
estimate of 2 . How do we get that ?
Estimation of Disturbance Variance 2
We use the LS residuals u i to proxy for the
unobserved errors, u i .
So the residual variance seems a natural
estimator for the distubance
variance,Varu 2 ,
i.e.,
Var u n u i
1 2
i
( u is also a r.v. since it is a function of the
[Link]. yand y.)
However, Var u is not an unbiased estimator
of 2 . We can show that :
8
ui
2
E 2
i
ui
2
E n 2 2
i
ui 2
So we use n2
2 as an estimator of
2 in the 2-variable model. 2 is an unbiased
estimator of 2 , since
E 2 2
2
ui
i. e. , E 2
n2
In the multiple regression model with k
explanatory variables,
2
ui
2
nk1
Note : is not an unbiased estimator of but
it is consistent, i.e.,p lim n
9
Wereplace 2 by 2 in the expressions for
Var and Var to get the estimated variances
of and .
i.e.,
2
Var
i x i x 2
2
n i xi 2
and, Var
i x i x 2
Positive square root of these estimated
variances give us the standard errors of the
estimators and .
i.e., Var [Link] of ; and
Var [Link] of
3.5 Assumptions & Properties of OLS Estimators
According to the Gauss Markov
theorem the
OLS estimators (i.e., and ) are Best,
10
Linear, Unbiased estimators (B.L.U.E) of
and .
Assumptions 2.1 to 2.4 above are required for
obtaining the estimators and to prove
Unbiasedness of and . If Assumption 2.4
(zero conditional mean) is violated and there
is correlation between the
errors and the x-variables the estimators
would be biased.
Assumption 2.5 (homoskedasticity) is
required to prove efficiency of and .
Assumption 2.6 (normally distributederrors)
is NOT required to prove that and are
B.L.U.E. This assumption is required for
carrying out hypothesis tests.
11
4. Multicollinearity
4.1 What is Multicollinearity ?
Multicollinearity refers to the presence of high
linear correlation between the explanatory
[Link] is part of the Data generating
process (DGP), very common in business /
social sciences.
4.2 How does it affect properties of OLS estimators ?
OLS estimators are BLUE in the presence of
Multicollinearity (But, perfect multicollinearity
is ruled out by assumption 2.3 above).
Why do we rule out perfect multicollinearity ?
In our discussion on the Dummy variable trap
we saw the OLS model in matrix form and the
X-matrix. In the presence of
perfect multicollinearity columns of the
X-matrix would become linearly dependent
12
(as we saw in case of the dummy variable
trap), so inverse of the X-matrix would not
exist and we would not be able to estimate
the unknown parameters and.
4.3 What is the problem due to multicollinearity?
To understand why multicollinearity is a
problem see the variance of the estimators in
the MLR model :
Var j 2
, where
SST x j 1R 2j
Var j Variance of x j , the jth explantory
variable
SST x j Total sample variation in x j
R 2j R 2 from regressing x j on all the other
explanatory variables
High correlation between explanatory
variables means R 2j is very high (close to 1)
and 1 R 2j is very low, so that Var j tends
to be high, given values of SST x j and 2 .
13
High Var j leads to high standard errors and
low values of t-statistics (remember t-statistic
j
of j ).
se j
In the presence of multicollinearity, variables
can be statistically insignificant (very low
t-statistics for individual variables), even
though the F-statistic, for overall significance
is significant.
4.4. How do we detect Multicollinearity ?
We can calculate Variance Inflation Factor or
VIF for each explanatory variable,
1
VIF of x j ;
1R 2j
The general cutoff value for VIF is 10
(whenR 2j is close to 1, say 0.9, indicating high
correlation between x j and other explantory
variables).
It indicates to what extent standard error of x j
is higher due to its correlation with other
14
xi, i j
But there is no exact measure or cutoff for
what is ‘too high’. In practice therefore, VIF
has limited relevance.
4.5 What can we do about Multicollinearity ?
We try to have large datasets to ensure high
variance in the x-variables. Look at the
components of Var j . If total variance of x j is
high, SST x j would be high and this would
reduce Var j and lower standard errors, for
given values of 2 and R 2j .
Do you see,
when data sets are small SST x j is also lower ?
This creates a problem of ‘micronumerosity’ !
Small data sets have low values of SST x j
which can lead to high Var j , high standard
errors and low t-statistics, even if there is no
multicollinearity.
15
Taking log of variables may also help in the
presence of multicollinearity.
There is no simple rule of thumb to tackle this
problem.
Be careful about dropping explanatory
variables to avoid multicollinearity. This can
create a problem of omitted variables bias.
Remember estimators are BLUE in the
presence of multicollinearity. But dropping a
variable can lead to biased estimators.
Readings : Section on Multicollinearity from
Chapter 3 in Wooldridge (2012).
16
5. Heteroskedasticity
5.1 What is Heteroskedasticity ?
It is violation of Assumption 2.5 above and is
best understood in contrast with
Homoskedasticity, the assumption regarding
the disturbance variance being constant.
Under Homoskedasticity :
Varu|x 1, x 2 , . . . , x k 2
Since Eu|x 1, x 2 , . . . , x k 0
we can write
Eu 2 |x 1, x 2 , . . . , x k 2
In contrast, under Heteroskedasticity :
Varu i |x 1, x 2 , . . . , x k 2i , i 1, . . . n
Since Eu i |x 1, x 2 , . . . , x k 0
we can write
Eu 2i |x 1, x 2 , . . . , x k 2i
17
Heteroskedasticity essentially means error
variances may not be constant, rather they
may be functions of the explanatory variables
or x-variables.
5.2 What problem does Heteroskedasticity cause ?
(i) When there is a relation between
error variance and x-variables the
assumption of homoskedasticity is violated,
so OLS estimators are not efficient, not BLUE
any more.
As you saw above the error variance ( 2 )
affects variance of estimators (Var j ). Hence
standard errors of estimators are affected and
results of hypothesis tests are no longer valid
(iii) But estimators are unbiased and
consistent; this follows as long Assumption
2.4 is valid i.e., there is
zero correlation between errors and x-variables
18
5.3 What to do about Heteroskedasticity ?
First we discuss what to do when we
do not know the form of heteroskedasticity.
(When form of heteroskedasticity is known
we use WLS as discussed in Section 5.5
below).
Suppose we suspect
Eu 2i |x 1, x 2 , . . . , x k 2i fx 1, x 2 , . . . , x k but we
do not know what is the exact funtional form
of ffx 1, x 2 , . . . , x k .
In this case we can use
heteroskedasticity- robust standard errors .
How to estimate robust standard errors ?
Recall in the 2-variable model, under
homoskedasticity :
19
i x i x 2 2 2
Var
i x i x
2 2
i x i x 2
Under heteroskedasticity :
i x i x 2 2i
Var
i x i x 2 2
To estimate Var in the presence of
heteroskedasticity of any form , replace the
error variance 2i above by its estimator, the
residual variance û 2i (recall, residuals,û i also
have zero mean);
So robust standard error estimation involves
using the following to estimate Var:
x i x 2 û 2i
Var
2 2
x i x
20
(ii) Robust standard errors can always be
used with cross-section data and is valid
even under homoskedasticity, when the data
set is large.
5.4 Tests for Heteroskedasticity
(i) With heteroskedasticity, the error variance
( 2 ) is a function of the x-variables.
Therefore tests check for presence or
absence of a relation between the squared
residuals (û 2 ) (proxy for the error variance)
and the included explanatory variables, on
the basis of an auxilliary regression of the
following type :
û 2 0 1 x 1 2 x 2 3 x 21 4 x 22 5 x 1 x 2
After this regression we test :
H o : 1 2 . . 5 0 against H 1 :At least
one 0
You can see that with just 2 explanatory
variables in the model, the auxilliary
regression has to estimate 6 parameters. So
21
there can be problems with degrees of
freedom, in models with more than 2
regressors.
(ii) The following test for heteroskedasticity
can be used to conserve degrees of freedom
:
û2 0 1 y 2 y 2
where, y are predicted values from the
regression of y on the x-variables.
5.5 Weighted Least Squares (WLS) Estimators
Suppose our regression model has
heteroskedasticity and
we know the exact functional form relating the
error variance to the x-variables. In this case
WLS estimators are used as they correct for
heteroskedasticity.
Recall from our statistics class, for any
22
random variable y and a constant m:
Varmy m 2 Vary
ifVary 2 ,
Varmy m 2 2
Also remember, for a set of independent
variables, (e.g., the error terms are assumed
to be independent), the variance of the sum is
equal to the sum of variances :
Varu i 2 , i
Then
Var u i n 2
We will use these results as we proceed.
Weighted Least Squares (WLS) - Example 1
Consider the following example :
Model :
23
y i x i u i 1
where, Varu i |x i 2 fx 2 x i
In this case we use WLS and transform our
model as follows, where each term is
weighted by 1x i :
yi
1 x i u i 2
xi xi xi xi
y
or, i x i u i 2
xi xi xi
What is the variance of the error term in
Model (2) ?
Var u i x1i Varu i x1i 2 x i 2
xi
i. e. Var u i 2
xi
Using WLS we have homoskedastic errors,
i.e., the error variance is constant in Model
(2) !
So when the exact form of heteroskeasiticy is
known we use Generalized Least Squares
24
(GLS) based on the following general
principle. Suppose :
Varu|x 2 fx
We transform the model by using WLS using
the weights 1 as follows :
fx
y x u 3
fx fx fx fx
So that :
Var u 1 Varu 1 2 fx
fx fx fx
or, Var u 2
fx
Look at the estimated coefficients in the WLS
Models (2) and (3). Their interpretation is
exactly the same as in Model 1.
Clearly, when the form of heteroskeasiticy is
known, WLS estimators are more efficient.
25
Weighted Least Squares (WLS) - Example 2
Suppose the model is :
y x u
and
Varu 2
But, the data is not available for y and x, only
data on averages is available.
E.g., instead of years of education (y) and
age (x) of each employee, for each firm, you
have the average number of years of
education and average age of all the
employees.
i.e., the Model you are estimating is :
y x u
where
26
ui
Varu Var n 12 Var u i
n
12 n 2
n
1n 2
The errors are heteroskedastic in the
averages-Model and the form of
heteroskedasticity is known.
So we use WLS as discussed above.
Here fx 1n so the weights used will be
1
n
fx
Using WLS the model is transformed as
follows :
y i n i n i x i n i u i n i
In this model errors are homoskedastic :
Varu i n i nVaru i n 1n 2 2
This example shows, if we are working with
cross-section data on, say, firm-level
27
averages for employees of each firm, it is
best to weight each observation by
square-root of number of employees in the
firm. This WLS estimator is more efficient
compared to the OLS estimator.
Carefully go through the examples in
Wooldridge Chapter 8 to understand this
better.
Note : When the form of the
heteroskedasticity is unknown, i.e.fx is not
known, it may be estimated using the
residuals to obtain fx . When fx is used to
transform the data we are using Feasible
Generalized Least Squares (FGLS).
Suggested Readings for Heteroskedasticity :
Chapter 8 in Wooldridge (2012).
Wooldridge, J.M. (2012), Introductory
econometrics: A modern approach, Cengage
Learning (Latest edition)
28
29