0% found this document useful (0 votes)
31 views51 pages

lecture-5 2

This lecture focuses on collinearity and inference in multivariable linear regression, detailing its consequences, detection methods, and hypothesis testing. Key learning objectives include understanding collinearity's impact on regression models and how to test the significance of predictors. The document also provides practical examples and statistical techniques for analyzing regression coefficients and their relationships.

Uploaded by

winniexvvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views51 pages

lecture-5 2

This lecture focuses on collinearity and inference in multivariable linear regression, detailing its consequences, detection methods, and hypothesis testing. Key learning objectives include understanding collinearity's impact on regression models and how to test the significance of predictors. The document also provides practical examples and statistical techniques for analyzing regression coefficients and their relationships.

Uploaded by

winniexvvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Advanced Regression Models

Lecture 5: Collinearity and Inference for Multivariable Linear


Regression

Heping Zhang

October 1, 2024
Acknowledgement and Appreciation

The instructor wishes to clarify that he uses the lecture notes


originally made by Professor Yize Zhao, and thank Professor Zhao
for her generosity.
A few notes on mid-term

▶ Midterm will be on Oct 15


▶ Most students will take the exam in the classroom starting
from 10am
▶ A few with pre-informed conflicts will take the exam in the
conference room of Suite 523, 300 George Street, starting from
9am
▶ Allow two pages of letter-size cheat sheet (you can do
two-sided)
▶ Coverage: contents from the first six lectures
▶ We allow calculators and/or laptops
Learning Objectives

▶ Understand the consequences of collinearity


▶ Know how to detect and address collinearity
▶ Understand multiple hypothesis testing
▶ Infer the importance of one or more predictors in the presence
of the others (extra sum of squares)
▶ Infer the importance of linear combinations of regression
coefficients
▶ Infer the simultaneous importance of regression coefficients
(global tests)
Recap

Model:
y = Xβ + ϵ

▶ Desired properties of LSE


▶ BLUE
▶ MLE under Gaussian error

▶ Non-identifiability and collinearity: (X T X)−1


▶ Consequences: LSE, inference, prediction
How to detect the collinearity?
▶ Exam the pairwise correlation among all predictors:
- High correlation in any pair is the most obvious one to detect
using correlation matrix.
▶ Examine the eigenvalues of X T X:
- Eigendecomposition A = UΣU T where U is the matrix
whose columns are the eigenvectors of A, and Σ is the
diagonal matrix whose diagonal elements are the corresponding
eigenvalues of A.
- Look for relatively small eigenvalues of X T X.
▶ Regress a predictor against the others:
▶ Define Rj2 be the R-square of predictor Xj on all the other
predictors.
▶ Obtain R12 , . . . , Rp2 .
▶ Inspect if any Rj2 is close to 1 (e.g. > 0.8)
▶ Calculate the variance inflation factor (VIF): 1
1−Rj2
,j = 1, . . . , p
▶ If Rj2 is close to 1, VIF will be large (e.g. > 5).
Let’s see an example

▶ A dataset with 38 drivers to study where different drivers will


position the seat depending on their body size and age.
library(faraway)
head(seatpos)

## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter


## 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300
## 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210
## 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673
## 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720
## 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230
## 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 -185.150

Both heights with (HtShoes) and without (Ht) are included in the
model, so is the height when seated. These are all trouble makers!
Fit the model (sign of collinearity)
g=lm(hipcenter~., seatpos)
summary(g)

##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
Examine the correlation matrix

round(cor(seatpos),3)

## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter


## Age 1.000 0.081 -0.079 -0.090 -0.170 0.360 0.091 -0.042 0.205
## Weight 0.081 1.000 0.828 0.829 0.776 0.698 0.573 0.784 -0.640
## HtShoes -0.079 0.828 1.000 0.998 0.930 0.752 0.725 0.908 -0.797
## Ht -0.090 0.829 0.998 1.000 0.928 0.752 0.735 0.910 -0.799
## Seated -0.170 0.776 0.930 0.928 1.000 0.625 0.607 0.812 -0.731
## Arm 0.360 0.698 0.752 0.752 0.625 1.000 0.671 0.754 -0.585
## Thigh 0.091 0.573 0.725 0.735 0.607 0.671 1.000 0.650 -0.591
## Leg -0.042 0.784 0.908 0.910 0.812 0.754 0.650 1.000 -0.787
## hipcenter 0.205 -0.640 -0.797 -0.799 -0.731 -0.585 -0.591 -0.787 1.000

Except age, other variables are highly correlated.


Examine the eigenvalues of X T X

X=as.matrix(seatpos)[,-9] #exclude the last column


e=eigen(t(X)%*%X)
e$val

## [1] 3.653671e+06 2.147948e+04 9.043225e+03 2.989526e+02 1.483948e+02


## [6] 8.117397e+01 5.336194e+01 7.298209e+00

There is a large range of the eigenvalues, and the smallest one is


particularly small relative to the largest one.
Let’s check the VIFs

vif(X)

## Age Weight HtShoes Ht Seated Arm Thigh


## 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886
## Leg
## 6.694291

Two predictors (HtShoes and Ht) have very large VIFs and three
variables (Seated, leg, arm) has relatively large VIFs.
Some take away messages for collinearity

▶ Collinearity can (and does) happen, so be careful


▶ Often contributes to the problem of variable selection, which
we’ll touch on later
▶ Generally does not impact the estimation of overall variability
explained by the model (R 2 ), and you don’t need to worry too
much if you only care about prediction
Hypothesis testing in SLR

▶ In the SLR we have learned the hypothesis testing and


construction of CI for β1 (and β0 )
(0)
βˆ1 −β1
▶ Under H0 : β1 = β1(0) , our test statistics: ∼ Tn−2
se(βˆ1 )

▶ Under H0 : β1 = 0, there is no linear association between Y


ˆ
and X . se(ββ1ˆ ) ∼ Tn−2 and f = MSR
MSE ∼ F1,n−2 yield equivalent
1
t- and F-tests, respectively.
Rejection and non-rejection regions
MLR

MLR involves multiple predictors. Can we infer the importance of


one or more predictors in the presence of the others?
Revisit the income dataset
▶ Regress income against working hours, age, gender and race.
## income employment hrs_work race
## Min. : 50 Length:745 Min. : 1.00 Length:745
## 1st Qu.: 16000 Class :character 1st Qu.:36.00 Class :character
## Median : 34000 Mode :character Median :40.00 Mode :character
## Mean : 47492 Mean :39.23
## 3rd Qu.: 58000 3rd Qu.:42.00
## Max. :450000 Max. :99.00
## age gender citizen time_to_work
## Min. :16.00 Length:745 Length:745 Min. : 1.00
## 1st Qu.:30.00 Class :character Class :character 1st Qu.: 10.00
## Median :43.00 Mode :character Mode :character Median : 20.00
## Mean :42.76 Mean : 26.22
## 3rd Qu.:54.00 3rd Qu.: 30.00
## Max. :94.00 Max. :163.00
## lang married edu disability
## Length:745 Length:745 Length:745 Length:745
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## birth_qrtr
## Length:745
## Class :character
## Mode :character
##
##
##
Self-compute LSE

income$income=income$income/1000
X=cbind(1,income$age,income$hrs_work,income$race=="asian",income$race=="black",income$race=="white",income
y=income$income
betahat=solve(t(X)%*%X)%*%t(X)%*%y
betahat

## [,1]
## [1,] -57.2980196
## [2,] 0.7672869
## [3,] 1.3124369
## [4,] 46.4238090
## [5,] -2.3020209
## [6,] 11.3375245
## [7,] 17.0306377
Questions of interest

▶ Can we say anything about whether the effect of age is


“significant” after adjusting for other variables?
▶ Can we compare this model to a model with only race and
gender?
▶ ...
Sampling distribution

i.i.d
If our usual assumptions are satisfied and ϵi ∼ N(0, σ 2 ), then

β̂ ∼ N(β, σ 2 (X T X)−1 )

β̂j ∼ N(βj , σ 2 (X T X)−1


jj )

where (X T X )−1 T −1
jj is the jth element of the diagonal of (X X) .

▶ This will be used for inference on any individual βj such as the


null hypothesis H0 : βj = 0.
Testing procedure

As before, we calculate the probability of the observed data (or


more extreme data) under a null hypothesis.
▶ For example, H0 : β1 = 0 and Ha : β1 ̸= 0
▶ Set α = P (falsely rejecting a true null hypothesis) (type I error
rate, e.g., 0.05)
▶ Calculate the value of the test statistic from the data
▶ Under the null distribution, compute the p-value

P(As or more extreme than the observed test statistic | H0 )

▶ Reject or fail to reject H0


Individual coefficients

For an individual coefficient: H0 : βj = βj0 (usually 0)


▶ We can use the test statistic

β̂j − βj0 β̂j − βj0


t= =q ∼ Tn−p−1
se(β̂j ) σ̂ 2 (X T X )−1
jj

▶ For a two-sided test of size α, the rejection region is

|t| > T1−α/2,n−p−1

▶ The p-value gives 2P(Tn−p−1 > |tobs | | H0 )


Even though we test one coefficient, we prefer the inclusion of all
predictors in the model to estimate the variance of the errors.
Hence, the degree of freedom is n − p − 1.
Caution on the notation

▶ t may denote a random variable following a t-distribution


Tn−p−1 .
▶ Tn−p−1 itself may also denote such a t-distributed random
variable.
▶ T1−α/2,n−p−1 is the 1 − α/2 quantile of the t-distribution
Tn−p−1 .
Revisit the income example
##
## Call:
## lm(formula = income ~ age + hrs_work + relevel(factor(race),
## ref = "other") + gender, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.29 -23.47 -9.00 8.14 356.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.2980 10.4454 -5.485 5.67e-08
## age 0.7673 0.1343 5.714 1.60e-08
## hrs_work 1.3124 0.1640 8.003 4.71e-15
## relevel(factor(race), ref = "other")asian 46.4238 11.4483 4.055 5.54e-05
## relevel(factor(race), ref = "other")black -2.3020 9.6375 -0.239 0.811
## relevel(factor(race), ref = "other")white 11.3375 7.6504 1.482 0.139
## gendermale 17.0306 4.0560 4.199 3.01e-05
##
## (Intercept) ***
## age ***
## hrs_work ***
## relevel(factor(race), ref = "other")asian ***
## relevel(factor(race), ref = "other")black
## relevel(factor(race), ref = "other")white
## gendermale ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.64 on 738 degrees of freedom
## Multiple R-squared: 0.2005, Adjusted R-squared: 0.194
## F-statistic: 30.85 on 6 and 738 DF, p-value: < 2.2e-16
The impact of age on income

library(broom)
model1=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
tidy(model1)

## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 "(Intercept)" -57.3 10.4 -5.49 5.67e- 8
## 2 "age" 0.767 0.134 5.71 1.60e- 8
## 3 "hrs_work" 1.31 0.164 8.00 4.71e-15
## 4 "relevel(factor(race), ref = \"other\")~ 46.4 11.4 4.06 5.54e- 5
## 5 "relevel(factor(race), ref = \"other\")~ -2.30 9.64 -0.239 8.11e- 1
## 6 "relevel(factor(race), ref = \"other\")~ 11.3 7.65 1.48 1.39e- 1
## 7 "gendermale" 17.0 4.06 4.20 3.01e- 5
Inference for linear combinations

Sometimes we are interested in making claims about c T β for some


c to be question specific.
Examples:
▶ Whether age and working hour have the same effect (a weird
one but we could do it)?
▶ H0 : β1 = β2 or H0 : β1 − β2 = 0.

▶ Whether white and black have the same income?


▶ For black, we have β4 + β0 and for white, we have β5 + β0 .
▶ H0 : β4 + β0 = β5 + β0 or H0 : β4 − β5 = 0.
Inference for linear combinations of coefficients

▶ Define H0 : c T β = c T β 0 ; e.g., H0 : c T β = 0.
▶ Let’s findout the
 c:
β0

 β1 , 

▶ β= β3  .

 β4  
 β5 
β6
▶ Whether age and working hour have the same effect (a weird
one but we could do it)? For H0 : β1 − β2 = 0,
c = (0, 1, −1, 0, 0, 0, 0)T .
▶ Whether white and black have the same income?
c = (0, 0, 0, 0, 1, −1, 0)T .
Inference for linear combinations of coefficients

▶ We can use the test statistic

c T β̂ − c T β 0 c T β̂ − c T β 0
t= =q
ˆ T β̂)
se(c σˆ2 c T (X T X )−1 c

▶ For a two-sided test of size α, we reject if

|t| > T1−α/2,n−p−1

Linear combinations of normally distributed random variables are


still normally distributed.
Inference about multiple coefficients

Our model contains multiple parameters; therefore we may need to


perform multiple tests in order to conclude on multiple predictors:

H01 : β1 = 0
H02 : β2 = 0
.. ..
.=.
H0k : βk = 0

▶ For any individual test, we have a type I error as α:


P(reject H0i | H0i ) = α
▶ Inflated type I error
Family-wise error rate (FWER)

▶ Family-wise error rate is the probability of making at least one


type I error
▶ To calculate the FWER
▶ First note P(no rejections | all H0i are true) = (1 − α)k
▶ It follows that
P(at least one rejection | all H0i are true) = 1 − (1 − α)k
▶ As we can see, if we set α = 0.05, k = 10
FWER=1-(1-0.05)ˆ10
FWER

## [1] 0.4012631
Linea approximation to the family-wise error rate

1.00

0.75
fwer

0.50

0.25

0 25 50 75 100
k

The blue line is kα and the black curve is 1 − (1 − α)k . The former
is on the top of the others.
Addressing multiple comparisons

▶ Correct for multiple comparisons


▶ Often, use the Bonferroni correction and use αi = α/k for each
test
▶ Thanks to the Bonferroni inequality, this gives an overall
FWER≤ α
▶ Control false discovery rate
▶ Use a global test (F test)
Global tests

▶ Extend ANOVA to MLR

Source Sum of squares df MS


(ŷi − ȳ )2 MSR = SSR
P
Model p p
SSE
(yi − ŷi )2 n−p−1
P
Error MSE = n−p−1
(y − ȳ )2 n−1
P
Total i
Global tests

▶ F test for regression relationship

H0 : β1 = β2 = · · · = βp = 0
H1 : not all βk equal zero

▶ We use the test statistic


MSR
f =
MSE

▶ f ∼ F (p, n − p − 1) under H0
▶ The decision rule: P(F (p, n − p − 1) > fobs ).
Example revisit

▶ What can we conclude globally?


model1=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
anova(model1)

## Analysis of Variance Table


##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 123920 123920 44.7139 4.502e-11 ***
## hrs_work 1 266469 266469 96.1498 < 2.2e-16 ***
## relevel(factor(race), ref = "other") 3 73761 24587 8.8718 8.811e-06 ***
## gender 1 48862 48862 17.6309 3.009e-05 ***
## Residuals 738 2045287 2771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Extra Sum of Squares

▶ How much do one or multiple predictors contribute in addition


to the others in a model?
▶ Start a model with X1 : SSR(X1 )

▶ Adding X2 into the model


▶ The extra increase in SSR is:
SSR(X2 | X1 ) = SSR(X1 , X2 ) − SSR(X1 )
▶ Equivalently: SSR(X2 | X1 ) = SSE (X1 ) − SSE (X1 , X2 )

▶ We can keep this by adding X3 , . . . ,


Extra Sum of Squares

SSR(X3 | X1 , X2 ) = SSR(X1 , X2 , x3 ) − SSR(X1 , X2 )


= SSR(X1 , X2 , x3 ) − [SSR(X1 ) + SSR(X2 | X1 )].

SSR(X1 , X2 , x3 ) = SSR(X1 ) + SSR(X2 | X1 ) + SSR(X3 | X1 , X2 ).


Decompose SSR into Extra Sum of Squares (ANOVA)

Source Sum of squares df MS


Model SSR(X1 , . . . , Xp ) p MSR(X1 , . . . , Xp )
X1 SSR(X1 ) 1 MSR(X1 )
X2 | X1 SSR(X2 | X1 ) 1 MSR(X2 | X1 )
X3 | SSR(X3 | X1 , X2 ) 1 MSR(X3 | X1 , X2 )
X1 , X2
.. .. .. ..
. . . .
Error SSE (X1 , . . . , Xp ) n−p −1 MSE (X1 , . . . , Xp )
Total SSTO n−1
Test whether several βk = 0

▶ How about we want to test whether race is a significant


predictor for income given the other variables?
▶ What is the null hypothesis?

H0 : β3 = β4 = β5 = 0.

This is because there are 3 dummies for the race variable.


Test whether several βk = 0

▶ To test the null hypothesis H0 : β3 = β4 = β5 = 0, we basically


compare a smaller “null” model to a larger “alternative” model
▶ What is the “null” model, and what is the “alternative” model
in this case?
H0 : y = β0 + β1 age + β2 hours + β6 gender + ϵ.
H0 : y =
β0 + β1 age + β2 hours + β3 asian + β4 black + β5 white + β6 gender + ϵ.
We can use F test in such cases

▶ We will use something like

MSR(X3 , X4 , X5 | X1 , X2 , X6 )
f =
MSE (X1 , . . . , X6 )
SSR(X3 ,X4 ,X5 ,X1 ,X2 ,X6 )−SSR(X1 ,X2 ,X6 )
6−3
=
MSE (X1 , . . . , X6 )

▶ In order to use F test, we need to satisfy the smaller model


must be nested in the larger model
▶ That is, the smaller model must be a special case of the larger
model
Test whether several βk = 0

▶ In general
(SSES − SSEL )/(dfS − dfL )
f =
SSEL /dfL
▶ If H0 is true, then f ∼ FdfS −dfL ,dfL
▶ Note dfS = n − pS − 1 and dfL = n − pL − 1
▶ We reject the null hypothesis if the p-value is above α

P value = P(FdfS −dfL ,dfL > fobs )


Nested models

▶ These models are nested:

Smaller = Regression of Y on X1
Larger = Regression of Y on X1 , X2 , X3 , X4

▶ These models are not nested:

Smaller = Regression of Y on X2
Larger = Regression of Y on X1 , X3
Example
model_0=lm(income~1,data=income)
model_1=lm(income~age,data=income)
model_2=lm(income~age+hrs_work,data=income)
model_3=lm(income~age+hrs_work+relevel(factor(race),ref='other'),data=income)
model_full=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
summary.aov(model_0); summary.aov(model_1);summary.aov(model_2);summary.aov(model_3)

## Df Sum Sq Mean Sq F value Pr(>F)


## Residuals 744 2558299 3439

## Df Sum Sq Mean Sq F value Pr(>F)


## age 1 123920 123920 37.82 1.27e-09 ***
## Residuals 743 2434379 3276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Df Sum Sq Mean Sq F value Pr(>F)


## age 1 123920 123920 42.41 1.36e-10 ***
## hrs_work 1 266469 266469 91.20 < 2e-16 ***
## Residuals 742 2167911 2922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Df Sum Sq Mean Sq F value Pr(>F)


## age 1 123920 123920 43.730 7.23e-11 ***
## hrs_work 1 266469 266469 94.034 < 2e-16 ***
## relevel(factor(race), ref = "other") 3 73761 24587 8.677 1.16e-05 ***
## Residuals 739 2094149 2834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example
summary.aov(model_full)

## Df Sum Sq Mean Sq F value Pr(>F)


## age 1 123920 123920 44.714 4.50e-11 ***
## hrs_work 1 266469 266469 96.150 < 2e-16 ***
## relevel(factor(race), ref = "other") 3 73761 24587 8.872 8.81e-06 ***
## gender 1 48862 48862 17.631 3.01e-05 ***
## Residuals 738 2045287 2771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_1=lm(income~hrs_work,data=income)
model_2=lm(income~age+hrs_work,data=income)
summary.aov(model_1)

## Df Sum Sq Mean Sq F value Pr(>F)


## hrs_work 1 303875 303875 100.1 <2e-16 ***
## Residuals 743 2254424 3034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(model_2)

## Df Sum Sq Mean Sq F value Pr(>F)


## age 1 123920 123920 42.41 1.36e-10 ***
## hrs_work 1 266469 266469 91.20 < 2e-16 ***
## Residuals 742 2167911 2922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example

model_full_new=lm(income~age+relevel(factor(race),ref='other')+gender+hrs_work,data=income)
summary.aov(model_full_new)

## Df Sum Sq Mean Sq F value Pr(>F)


## age 1 123920 123920 44.71 4.50e-11 ***
## relevel(factor(race), ref = "other") 3 91514 30505 11.01 4.47e-07 ***
## gender 1 120082 120082 43.33 8.78e-11 ***
## hrs_work 1 177496 177496 64.05 4.71e-15 ***
## Residuals 738 2045287 2771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example
▶ Now if we want to test if race and gender are significant
predictors within the full model
model_al=lm(income~age+hrs_work,data=income)
anova(model_al,model_full_new)

## Analysis of Variance Table


##
## Model 1: income ~ age + hrs_work
## Model 2: income ~ age + relevel(factor(race), ref = "other") + gender +
## hrs_work
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 742 2167911
## 2 738 2045287 4 122624 11.062 1.021e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_al,model_full)

## Analysis of Variance Table


##
## Model 1: income ~ age + hrs_work
## Model 2: income ~ age + hrs_work + relevel(factor(race), ref = "other") +
## gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 742 2167911
## 2 738 2045287 4 122624 11.062 1.021e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Global F tests

▶ There are a couple of important special cases for the F test


▶ The null model contains the intercept only
▶ The null model and the alternative model differ only by one term
▶ Gives a way of testing for a single coefficient
▶ Turns out to be equivalent to a two-sided t-test: Tdf2 = F1,dfL
L
Example

▶ Let’s consider the question is testing the effect of age on


income given others fixed.

Null/Reduced = Regression of Y on X2 , X3 , X4 , X5 , X6
Alternative/Full = Regression of Y on X1 , X2 , X3 , X4 , X5 , X6
Example

Let’s state the hypothesis

H0 : the null model fits as well as the alternative model


H1 : the alternative model fits significantly better

H0 : age does not contribute significantly to predicting outcome


given working hour, race and gender are already considered
H1 : age contributes significantly, . . . , . . .

H0 : β 1 = 0 vs. H1 : β1 ̸= 0
Example
##
## Call:
## lm(formula = income ~ age + hrs_work + relevel(factor(race),
## ref = "other") + gender, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.29 -23.47 -9.00 8.14 356.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.2980 10.4454 -5.485 5.67e-08
## age 0.7673 0.1343 5.714 1.60e-08
## hrs_work 1.3124 0.1640 8.003 4.71e-15
## relevel(factor(race), ref = "other")asian 46.4238 11.4483 4.055 5.54e-05
## relevel(factor(race), ref = "other")black -2.3020 9.6375 -0.239 0.811
## relevel(factor(race), ref = "other")white 11.3375 7.6504 1.482 0.139
## gendermale 17.0306 4.0560 4.199 3.01e-05
##
## (Intercept) ***
## age ***
## hrs_work ***
## relevel(factor(race), ref = "other")asian ***
## relevel(factor(race), ref = "other")black
## relevel(factor(race), ref = "other")white
## gendermale ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.64 on 738 degrees of freedom
## Multiple R-squared: 0.2005, Adjusted R-squared: 0.194
## F-statistic: 30.85 on 6 and 738 DF, p-value: < 2.2e-16
Example

model_full_d=lm(income~hrs_work+relevel(factor(race),ref='other')+gender,data=income)
summary.aov(model_full_d)

## Df Sum Sq Mean Sq F value Pr(>F)


## hrs_work 1 303875 303875 105.144 < 2e-16 ***
## relevel(factor(race), ref = "other") 3 71088 23696 8.199 2.25e-05 ***
## gender 1 47555 47555 16.455 5.51e-05 ***
## Residuals 739 2135781 2890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_full_d,model_full)

## Analysis of Variance Table


##
## Model 1: income ~ hrs_work + relevel(factor(race), ref = "other") + gender
## Model 2: income ~ age + hrs_work + relevel(factor(race), ref = "other") +
## gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 739 2135781
## 2 738 2045287 1 90494 32.653 1.599e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You might also like