0% found this document useful (0 votes)
61 views

Lect 6

This document describes the key concepts and methods for multiple linear regression analysis. It discusses: 1. The multiple linear regression model, assumptions, and matrix formulation. 2. Methods for estimating the regression coefficients (β) using least squares and their properties, including being unbiased, minimum variance (Gauss-Markov theorem), and scale invariant. 3. Methods for estimating the variance (σ2) including an unbiased estimator used in practice. 4. When normality is assumed, the maximum likelihood estimators are the same as the least squares estimators and their properties are described.

Uploaded by

AmalAbdlFattah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Lect 6

This document describes the key concepts and methods for multiple linear regression analysis. It discusses: 1. The multiple linear regression model, assumptions, and matrix formulation. 2. Methods for estimating the regression coefficients (β) using least squares and their properties, including being unbiased, minimum variance (Gauss-Markov theorem), and scale invariant. 3. Methods for estimating the variance (σ2) including an unbiased estimator used in practice. 4. When normality is assumed, the maximum likelihood estimators are the same as the least squares estimators and their properties are described.

Uploaded by

AmalAbdlFattah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Ch6.

Multiple Regression: Estimation

1 The model
yi = β0 + β1 xi1 + β2 xi2 + · · · + βk xik + ǫi , i = 1, 2, · · · , n, (1)

The assumptions for ǫi and yi are analogous to as those for simple linear regression,
namely

1. E(ǫi ) = 0 for all i = 1, 2, · · · , n, or, equivalently, E(yi ) = β0 + β1 xi1 +


β2 xi2 + · · · + βk xik .

2. var(ǫi ) = σ 2 for all i = 1, 2, · · · , n, or, equivalently, var(yi ) = σ 2 .

3. cov(ǫi , ǫj ) = 0 for all i 6= j , or, equivalently, cov(yi , yj ) = 0.


In matrix form, the model can be written as
      
y1 1 x11 x12 ··· x1k β0 ǫ1
      
 y2  1 x21 x22 ··· x2k   β1   ǫ2 
 .  = .  
..   ..  +  
 .  . .. ..
 .  . . . .   .  · · ·
yn 1 xn1 xn2 ··· xnk betak ǫn

or
y = Xβ + ǫ.

The assumption on ǫi or yi can be expressed as

1. E(ǫ) = 0 or E(y) = Xβ .

2. cov(ǫ) = σ 2 I or cov(y) = σ 2 I .

n × (k + 1) and is called the design matrix. In this chapter, we


The matrix is
assume that n > k + 1 and rank(X )=k + 1.
2 Estimation of β and σ 2

2.1 Least squares estimator for β


The least squares approach is to seek the estimators of β which minimize the sums of
squares if deviations of the n observed y ’s from their predicted values ŷ , i.e., minimize
n
X n
X
ǫ̂2i = (yi − ŷi )2 .
i=1 i=1

Theorem 2.1 If y = Xβ + ǫ, where X is n × (k + 1) of rank k + 1 < n, then


the least squares estimator of β is

β̂ = (X ′ X)−1 X ′ y.

P ROOF : Exercise.

2.2 Properties of the least squares estimator β̂


Theorem 2.2 If E(y) = Xβ , then β̂ is an unbiased estimator for β .
P ROOF :

E(β̂) = E[(X ′ X)−1 X ′ y]


= (X ′ X)−1 X ′ E(y)
= (X ′ X)−1 X ′ Xβ
= β.

Theorem 2.3 If cov(y) = σ 2 I , the covariance matrix for β is given by σ 2 (X ′ X)−1 .

P ROOF : Exercise.

Theorem 2.4 (Gauss-Markov Theorem) If E(y) = Xβ and cov(y) = σ 2 I , the


least squares estimators β̂j , j = 0, 1, · · · , k , have minimum variance among all
linear unbiased estimators, i.e., the least squares estimators β̂j , j = 0, 1, · · · , k are
best linear unbiased estimators (BLUE).
P ROOF : We consider a linear estimator Ay of β and seek the matrix A for which Ay
is a minimum variance unbiased estimator of β . Since Ay is to be unbiased for β , we
have
E(Ay) = AE(y) = AXβ = β,
which gives the unbiasedness condition

AX = I

since the relationship AXβ = β must hold for any positive value of β .
The covariance matrix for Ay is

cov(Ay) = A(σ 2 I)A′ = σ 2 AA′ .

The variance of the βj ’s are on the diagonal of σ 2 AA′ , and therefore we need to
choose A (subject to
bAX = I ) so that the diagonal elements of AA′ are minimized. Since

AA′ = [A − (X ′ X)−1 X ′ + (X ′ X)−1 X ′ ][A − (X ′ X)−1 X ′ + (X ′ X)−1 X ′ ]′


= [A − (X ′ X)−1 X ′ ][A − (X ′ X)−1 X ′ ]′ + (X ′ X)−1 .
Note in the last equality, AX = I is used. Since [A − (X ′ X)−1 X ′ ][A −
(X ′ X)−1 X ′ ]′ is positive semidefinite, the diagonal elements are great than or equal
to zero. These diagonal elements can be made equal to zero by choosing A =
(X ′ X)−1 X ′ . (This value of A also satisfies the unbiasedness condition AX = I ).
The resulting minimum variance estimator of β is

Ay = (X ′ X)−1 X ′ y,

which is equal to the least square estimator β̂ .

Remark: The remarkable feature of the Gauss-Markov theorem is its distributional


y ; normality is not required. The
generality. The result holds for any distribution of
only assumptions used in the proof are E(y) = Xβ and cov(y) = σ 2 I . If these
assumptions do not hold, β̂ may be biased or each β̂j may have a larger variance than
that of some other estimator.

Corollary 2.1 If E(y) = Xβ and cov(y) = σ 2 I , the best linear unbiased estima-
′ ′
tor of a′ β is a′ β̂ , where β = (X X)−1 X y .
A fourth property of β̂ is the following: the predicted value ŷ = β̂0 + β̂1 x1 +

· · · + β̂k xk = β̂ x is invariant to simple linear changes of scale on the x’s, where
x = (1, x1 , x2 , · · · , xk )′ .

Theorem 2.5 If x = (1, x1 , · · · , xk )′ and z = (1, c1 x1 , · · · , ck xk ), then ŷ =


′ ′ ′
β̂ x = β̂ z z , where β̂ z is the least squares estimator from the regression of y on z .

P ROOF : We can write Z = XD , where D = diag(1, c1 , c2 , · · · , ck ). Substitut-


ing Z = XD into β̂ z , we have

β̂ z = [(XD)′ (XD)]−1 (XD)′ y


= D −1 (X ′ X)−1 X ′ y
= D −1 β̂.

Hence,
′ −1 ′ ′
β̂ z z = (D β̂) Dx = β̂ x.
2.3 An estimator for σ 2
By assumption 1, E(yi ) = x′i β , and by assumption 2, σ 2 = E[yi − E(yi )]2 , we
have
σ 2 = E(yi − x′i β)2 .

Hence, σ 2 can be estimated by

X n
2 1
s = (yi − x′i β)2
n − k − 1 i=1
1
= (y − X β̂)′ (y − X β̂)
n−k−1
SSE
= .
n−k−1
With the denominator n − k − 1, s2 is an unbiased estimator of σ 2 .

Theorem 2.6 If E(y) = Xβ and cov(y) = σ 2 I , then

E(s2 ) = σ 2 .
P ROOF : Exercise.

Corollary 2.2 An unbiased estimator of cov(β̂) is given by

c β̂) = s2 (X ′ X)−1 .
cov(

Theorem 2.7 If E(y) = Xβ , cov(y) = σ 2 I , and E(ǫ4i ) = 3σ 4 for the linear


model y = Xβ + ǫ, then s2 is the best (minimum variance) quadratic unbiased
estimator of σ 2 .

P ROOF : See Graybill (1954) or Wang and Chow (1994, pp.161-163).

3 The model in centered form


In matrix form, the centered model for the linear multiple regression becomes
!
α
y = (j, X c ) + ǫ,
β1
where j is a vector of 1’s, β 1 = (β1 , β2 , · · · , βk )′ ,
 
x11 − x̄1 x12 − x̄2 ··· x1k − x̄k
 
1  x21 − x̄1 x22 − x̄2 ··· x2k − x̄k 
X c = (I − J )X 1 = 
 .. .. ..
.

n  . . . 
xn1 − x̄1 xn2 − x̄2 ··· xnk − x̄k

The matrix I − n1 J is sometimes called the centering matrix.


The corresponding least squares estimator becomes
!
α̂
= [(j, X c )′ (j, X c )]−1 (j, X c )′ y
β̂ 1
!−1
n 0  
= nȳ X ′c y
0 X ′c X c
!

= ′ −1 ′ .
(X c X c ) X c y
4 Normal model

4.1 Assumptions
Normality assumption:

y is Nn (Xβ, σ 2 I) or ǫ is Nn (0, σ 2 I).

Under normality, cov(y) = cov(ǫ) = σ 2 I implies that the y ’s are independent as


well as uncorrelated.

4.2 Maximum likelihood estimators for β and σ 2


Theorem 4.1 If y is Nn (Xβ, σ 2 I), where X is n × (k + 1) of rank k + 1 < n,
the maximum likelihood estimators of β and σ 2 are

β̂ = (X ′ X)−1 X ′ y,
1
σ̂ 2 = (y − X β̂)′ (y − X β̂).
n
P ROOF : Exercise.

The maximum likelihood estimator β̂ is the same as the least squares estimator
β̂ . The estimator σ̂ 2 is biased since the denominator is n rather n − k − 1. We often
use the unbiased estimator s2 to estimate σ 2 .

4.3 Properties of β̂ and σ̂ 2


Theorem 4.2 Suppose y is Nn (Xβ, σ 2 I), where X is n×(k +1) of rank k +1 <
n and β = (β0 , · · · , βk )′ . Then the maximum likelihood estimators β̂ and σ̂ 2 have
the following distributional properties:

(i) β̂ is Nk+1 (β̂, σ 2 (X ′ X)−1 ).


(ii) nσ̂ 2 /σ 2 is χ2 (n − k − 1), or equivalently, (n − k − 1)s2 /σ 2 is χ2 (n − k − 1).
(iii) β̂ and σ̂ 2 (or s2 ) are independent.

= (X ′ X)−1 X ′ y is a linear function of y , E(β̂) =


P ROOF : (i) Since y is normal, β̂
β and cov(β̂) = σ 2 (X ′ X)−1 , β̂ ∼ Nk+1 (β, σ 2 (X ′ X)−1 ).
(ii)
2 y′2 y
nσ̂ /σ = (I − X(X ′ X)X ′ ) ,
σ σ
and I − X(X X)X is idempotent, hence nσ̂ 2 /σ 2 is χ2 (n − k − 1).
′ ′

′ ′
(iii) Since β̂ = (X X)−1 X y and

nσ̂ 2 = y ′ (I − X(X ′ X)X ′ )y,

and that (X ′ X)−1 X ′ (I − X(X ′ X)X ′ ) = O, we have β̂ and σ̂ 2 (or s2 ) are


independent.

Theorem 4.3 If y is Nn (Xβ, σ 2 I), then β̂ and σ̂ 2 are jointly sufficient for β and
σ2 .

P ROOF : Using the Neyman factorization theorem. For details, see Rencher and Schaalje
(2008, pp.159-160).

Since β̂ and σ̂ 2 are jointly sufficient for β and σ 2 , no other estimators can improve
on the information they extract from the sample to estimate β and σ 2 . Thus, it is not
surprising that β̂ and s2 are minimum variance unbiased estimators.
Theorem 4.4 If y is Nn (Xβ, σ 2 I), then β̂ and s2 have minimum variance among
all unbiased estimators.

P ROOF : See Graybill (1976, P.176) or Christensen (1996, pp.25-27).

2
5 R in fixed-x regression
The proportion of the total sum of squares due to regression is measured by
SSR2 SSE
R = =1− ,
SST SST
Pn 2
Pn 2 ′ ′ 2
where SST = i=1 (yi − ȳ) , SSR = i=1 (ŷi − ȳ) = β̂ X y − nȳ , and

SST = SSR + SSE,


Pn
where SSE = i=1 (yi − ŷi )2 .
The R2 is called the coefficient of determination or the squared multiple correla-
tion. The positive square root R is called the multiple correlation coefficient. If the x’s
were random, R would estimate a population multiple correlation.
We list some properties of R2 and R.

1. The range of R2 is 0 ≤ R2 ≤ 1. If all the β̂j ’s were zero, except for β̂0 , R2
would be zero. (This event has probability zero for continuous data.) If all the
y -values fell on the fitted surface, that is, if yi = ŷi , i = 1, 2, · · · , n, then R2
would be 1.

2. R = ryŷ ; that is, the multiple correlation is equal to the simple correlation
between the observed yi ’s and the fitted ŷi ’s.

3. Adding a variable x to the model increases (can not decrease) the value of R2 .

4. If β1 = β2 = · · · = βk = 0, then
k
E(R2 ) = .
n−1

Note that the β̂j ’s will not be zero when the βj ’s are zero.
5. R2 cannot be partitioned into k components, each of which is uniquely at-
tributable to an xj , unless the x’s are mutually orthogonal, that is, unless
Pn
i=1 (xij − x̄j )(xim − x̄m ) = 0 for j 6= m.

6. R2 is invariant to full-rank linear transformations on the x’s and to a scale


change on y (but not invariant to a joint linear transformation including y and
the x’s).

Adjusted R2 (AdjR2 )

(R2 − k/(n − 1))(n − 1)


Ra2 2
= AdjR =
n−k−1
(n − 1)R2 − k
=
n−k−1
SSE/(n − k − 1)
=1−
SST /(n − 1)
R2 can also be expressed in terms of sample variances and covariances:
′′
2 β̂ 1 X c X c β̂ 1
R = Pn 2
i=1 (yi − ȳ)
s′yx S −1
xx (n − 1)S xx S −1
xx syx
= P n 2
i=1 (yi − ȳ)
s′yx S −1
xx syx
=
s2y

X cy X ′c X c ′
X cy
Note that β̂ 1 = (n − 1)(X ′c X c )−1 n−1 =( n−1 )−1 n−1 = S −1
xx syx . This
form of R2 will facilitate a comparison with R2 for the random-x case.
Geometrically, R is the cosine of the angle θ between y and ŷ corrected for their
means. The mean of ŷ is ȳ , the same as the mean of y . Thus, the centered form of y
and ŷ are y − ȳj and ŷ − ȳj .

(y − ȳj)′ (ŷ − ȳj)


cos θ = p .
[(y − ȳj) (y − ȳj)][(ŷ − ȳj) (ŷ − ȳj)]
′ ′
Note that

(y − ȳj)′ (ŷ − ȳj) = [(ŷ − ȳj) + (y − ŷ)]′ (ŷ − ȳj)


= (ŷ − ȳj)′ (ŷ − ȳj) + (y − ŷ)′ (ŷ − ȳj)
= (ŷ − ȳj)′ (ŷ − ȳj) + 0.

Hence, p r
(ŷ − ȳj)′ (ŷ − ȳj) SSR
cos θ = = = R.
(y − ȳj)′ (y − ȳj) SST

6 Generalized least squares: cov(y) = σ 2V


The model

y = Xβ + ǫ, E(y) = Xβ, cov(y) = Σ = σ 2 V ,

Theorem 6.1 Let y = Xβ + ǫ, E(y) = Xβ , and cov(y) = cov(ǫ) = σ 2 V ,


where X is a full-rank matrix and V is a known positive definite matrix. For this model,
we obtain the following results:
(i) The best linear unbiased estimator (BLUE) of β is

β̂ = (X ′ V −1 X)−1 X ′ V −1 y.

(ii) The covariance matrix for β̂ is

cov(β) = σ 2 (X ′ V −1 X)−1 .

(iii) An unbiased estimator of σ 2 is

2 (y − X β̂)′ V −1 (y − X β̂)
s = .
n−k−1
V is positive definite, there exists an n × n nonsingular matrix P
P ROOF : (i) Since
′ −1
such that V = P P . Multiplying y = Xβ + ǫ by P , we obtain

P −1 y = P −1 Xβ + P −1 ǫ,

Applying the least square approach to this transformed model, we get

β̂ = [X ′ (P −1 )′ P −1 X]−1 X ′ (P −1 )′ P −1 y
= (X ′ V −1 X)−1 X ′ V −1 y.
X is full rank, X ′ V −1 X is positive definite. The estimator β̂ =
Note that since
(X ′ V −1 X)−1 X ′ V −1 y is usually called the generalized least squares estimator.
(ii) and (iii) are left as exercises.

Theorem 6.2 If y is Nn (Xβ, σ 2 V ), where X is n × (k + 1) of rank k + 1 and V


is a known positive definite matrix, then the maximum likelihood estimators for β and
σ 2 are

β̂ = (X ′ V −1 X)−1 X ′ V −1 y,
2 1
σ̂ = (y − X β̂)′ V −1 (y − X β̂).
n
P ROOF : Exercise.

You might also like