0% found this document useful (0 votes)

31 views51 pages

lecture-5 2

This lecture focuses on collinearity and inference in multivariable linear regression, detailing its consequences, detection methods, and hypothesis testing. Key learning objectives include understanding collinearity's impact on regression models and how to test the significance of predictors. The document also provides practical examples and statistical techniques for analyzing regression coefficients and their relationships.

Uploaded by

winniexvvv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views51 pages

lecture-5 2

Uploaded by

winniexvvv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Advanced Regression Models

Lecture 5: Collinearity and Inference for Multivariable Linear

Regression

Heping Zhang

October 1, 2024
Acknowledgement and Appreciation

The instructor wishes to clarify that he uses the lecture notes

originally made by Professor Yize Zhao, and thank Professor Zhao
for her generosity.
A few notes on mid-term

▶ Midterm will be on Oct 15

▶ Most students will take the exam in the classroom starting
from 10am
▶ A few with pre-informed conflicts will take the exam in the
conference room of Suite 523, 300 George Street, starting from
9am
▶ Allow two pages of letter-size cheat sheet (you can do
two-sided)
▶ Coverage: contents from the first six lectures
▶ We allow calculators and/or laptops
Learning Objectives

▶ Understand the consequences of collinearity

▶ Know how to detect and address collinearity
▶ Understand multiple hypothesis testing
▶ Infer the importance of one or more predictors in the presence
of the others (extra sum of squares)
▶ Infer the importance of linear combinations of regression
coefficients
▶ Infer the simultaneous importance of regression coefficients
(global tests)
Recap

Model:
y = Xβ + ϵ

▶ Desired properties of LSE

▶ BLUE
▶ MLE under Gaussian error

▶ Non-identifiability and collinearity: (X T X)−1

▶ Consequences: LSE, inference, prediction
How to detect the collinearity?
▶ Exam the pairwise correlation among all predictors:
- High correlation in any pair is the most obvious one to detect
using correlation matrix.
▶ Examine the eigenvalues of X T X:
- Eigendecomposition A = UΣU T where U is the matrix
whose columns are the eigenvectors of A, and Σ is the
diagonal matrix whose diagonal elements are the corresponding
eigenvalues of A.
- Look for relatively small eigenvalues of X T X.
▶ Regress a predictor against the others:
▶ Define Rj2 be the R-square of predictor Xj on all the other
predictors.
▶ Obtain R12 , . . . , Rp2 .
▶ Inspect if any Rj2 is close to 1 (e.g. > 0.8)
▶ Calculate the variance inflation factor (VIF): 1
1−Rj2
,j = 1, . . . , p
▶ If Rj2 is close to 1, VIF will be large (e.g. > 5).
Let’s see an example

▶ A dataset with 38 drivers to study where different drivers will

position the seat depending on their body size and age.
library(faraway)
head(seatpos)

## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter

## 1 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300
## 2 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210
## 3 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673
## 4 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720
## 5 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230
## 6 47 170 178.7 177.0 92.4 36.0 43.2 37.4 -185.150

Both heights with (HtShoes) and without (Ht) are included in the
model, so is the height when seated. These are all trouble makers!
Fit the model (sign of collinearity)
g=lm(hipcenter~., seatpos)
summary(g)

##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
Examine the correlation matrix

round(cor(seatpos),3)

## Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter

## Age 1.000 0.081 -0.079 -0.090 -0.170 0.360 0.091 -0.042 0.205
## Weight 0.081 1.000 0.828 0.829 0.776 0.698 0.573 0.784 -0.640
## HtShoes -0.079 0.828 1.000 0.998 0.930 0.752 0.725 0.908 -0.797
## Ht -0.090 0.829 0.998 1.000 0.928 0.752 0.735 0.910 -0.799
## Seated -0.170 0.776 0.930 0.928 1.000 0.625 0.607 0.812 -0.731
## Arm 0.360 0.698 0.752 0.752 0.625 1.000 0.671 0.754 -0.585
## Thigh 0.091 0.573 0.725 0.735 0.607 0.671 1.000 0.650 -0.591
## Leg -0.042 0.784 0.908 0.910 0.812 0.754 0.650 1.000 -0.787
## hipcenter 0.205 -0.640 -0.797 -0.799 -0.731 -0.585 -0.591 -0.787 1.000

Except age, other variables are highly correlated.

Examine the eigenvalues of X T X

X=as.matrix(seatpos)[,-9] #exclude the last column

e=eigen(t(X)%*%X)
e$val

## [1] 3.653671e+06 2.147948e+04 9.043225e+03 2.989526e+02 1.483948e+02

## [6] 8.117397e+01 5.336194e+01 7.298209e+00

There is a large range of the eigenvalues, and the smallest one is

particularly small relative to the largest one.
Let’s check the VIFs

vif(X)

## Age Weight HtShoes Ht Seated Arm Thigh

## 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886
## Leg
## 6.694291

Two predictors (HtShoes and Ht) have very large VIFs and three
variables (Seated, leg, arm) has relatively large VIFs.
Some take away messages for collinearity

▶ Collinearity can (and does) happen, so be careful

▶ Often contributes to the problem of variable selection, which
we’ll touch on later
▶ Generally does not impact the estimation of overall variability
explained by the model (R 2 ), and you don’t need to worry too
much if you only care about prediction
Hypothesis testing in SLR

▶ In the SLR we have learned the hypothesis testing and

construction of CI for β1 (and β0 )
(0)
βˆ1 −β1
▶ Under H0 : β1 = β1(0) , our test statistics: ∼ Tn−2
se(βˆ1 )

▶ Under H0 : β1 = 0, there is no linear association between Y

ˆ
and X . se(ββ1ˆ ) ∼ Tn−2 and f = MSR
MSE ∼ F1,n−2 yield equivalent
1
t- and F-tests, respectively.
Rejection and non-rejection regions
MLR

MLR involves multiple predictors. Can we infer the importance of

one or more predictors in the presence of the others?
Revisit the income dataset
▶ Regress income against working hours, age, gender and race.
## income employment hrs_work race
## Min. : 50 Length:745 Min. : 1.00 Length:745
## 1st Qu.: 16000 Class :character 1st Qu.:36.00 Class :character
## Median : 34000 Mode :character Median :40.00 Mode :character
## Mean : 47492 Mean :39.23
## 3rd Qu.: 58000 3rd Qu.:42.00
## Max. :450000 Max. :99.00
## age gender citizen time_to_work
## Min. :16.00 Length:745 Length:745 Min. : 1.00
## 1st Qu.:30.00 Class :character Class :character 1st Qu.: 10.00
## Median :43.00 Mode :character Mode :character Median : 20.00
## Mean :42.76 Mean : 26.22
## 3rd Qu.:54.00 3rd Qu.: 30.00
## Max. :94.00 Max. :163.00
## lang married edu disability
## Length:745 Length:745 Length:745 Length:745
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## birth_qrtr
## Length:745
## Class :character
## Mode :character
##
##
##
Self-compute LSE

income$income=income$income/1000
X=cbind(1,income$age,income$hrs_work,income$race=="asian",income$race=="black",income$race=="white",income
y=income$income
betahat=solve(t(X)%*%X)%*%t(X)%*%y
betahat

## [,1]
## [1,] -57.2980196
## [2,] 0.7672869
## [3,] 1.3124369
## [4,] 46.4238090
## [5,] -2.3020209
## [6,] 11.3375245
## [7,] 17.0306377
Questions of interest

▶ Can we say anything about whether the effect of age is

“significant” after adjusting for other variables?
▶ Can we compare this model to a model with only race and
gender?
▶ ...
Sampling distribution

i.i.d
If our usual assumptions are satisfied and ϵi ∼ N(0, σ 2 ), then

β̂ ∼ N(β, σ 2 (X T X)−1 )

β̂j ∼ N(βj , σ 2 (X T X)−1

jj )

where (X T X )−1 T −1
jj is the jth element of the diagonal of (X X) .

▶ This will be used for inference on any individual βj such as the

null hypothesis H0 : βj = 0.
Testing procedure

As before, we calculate the probability of the observed data (or

more extreme data) under a null hypothesis.
▶ For example, H0 : β1 = 0 and Ha : β1 ̸= 0
▶ Set α = P (falsely rejecting a true null hypothesis) (type I error
rate, e.g., 0.05)
▶ Calculate the value of the test statistic from the data
▶ Under the null distribution, compute the p-value

P(As or more extreme than the observed test statistic | H0 )

▶ Reject or fail to reject H0

Individual coefficients

For an individual coefficient: H0 : βj = βj0 (usually 0)

▶ We can use the test statistic

β̂j − βj0 β̂j − βj0

t= =q ∼ Tn−p−1
se(β̂j ) σ̂ 2 (X T X )−1
jj

▶ For a two-sided test of size α, the rejection region is

|t| > T1−α/2,n−p−1

▶ The p-value gives 2P(Tn−p−1 > |tobs | | H0 )

Even though we test one coefficient, we prefer the inclusion of all
predictors in the model to estimate the variance of the errors.
Hence, the degree of freedom is n − p − 1.
Caution on the notation

▶ t may denote a random variable following a t-distribution

Tn−p−1 .
▶ Tn−p−1 itself may also denote such a t-distributed random
variable.
▶ T1−α/2,n−p−1 is the 1 − α/2 quantile of the t-distribution
Tn−p−1 .
Revisit the income example
##
## Call:
## lm(formula = income ~ age + hrs_work + relevel(factor(race),
## ref = "other") + gender, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.29 -23.47 -9.00 8.14 356.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.2980 10.4454 -5.485 5.67e-08
## age 0.7673 0.1343 5.714 1.60e-08
## hrs_work 1.3124 0.1640 8.003 4.71e-15
## relevel(factor(race), ref = "other")asian 46.4238 11.4483 4.055 5.54e-05
## relevel(factor(race), ref = "other")black -2.3020 9.6375 -0.239 0.811
## relevel(factor(race), ref = "other")white 11.3375 7.6504 1.482 0.139
## gendermale 17.0306 4.0560 4.199 3.01e-05
##
## (Intercept) ***
## age ***
## hrs_work ***
## relevel(factor(race), ref = "other")asian ***
## relevel(factor(race), ref = "other")black
## relevel(factor(race), ref = "other")white
## gendermale ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.64 on 738 degrees of freedom
## Multiple R-squared: 0.2005, Adjusted R-squared: 0.194
## F-statistic: 30.85 on 6 and 738 DF, p-value: < 2.2e-16
The impact of age on income

library(broom)
model1=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
tidy(model1)

## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 "(Intercept)" -57.3 10.4 -5.49 5.67e- 8
## 2 "age" 0.767 0.134 5.71 1.60e- 8
## 3 "hrs_work" 1.31 0.164 8.00 4.71e-15
## 4 "relevel(factor(race), ref = \"other\")~ 46.4 11.4 4.06 5.54e- 5
## 5 "relevel(factor(race), ref = \"other\")~ -2.30 9.64 -0.239 8.11e- 1
## 6 "relevel(factor(race), ref = \"other\")~ 11.3 7.65 1.48 1.39e- 1
## 7 "gendermale" 17.0 4.06 4.20 3.01e- 5
Inference for linear combinations

Sometimes we are interested in making claims about c T β for some

c to be question specific.
Examples:
▶ Whether age and working hour have the same effect (a weird
one but we could do it)?
▶ H0 : β1 = β2 or H0 : β1 − β2 = 0.

▶ Whether white and black have the same income?

▶ For black, we have β4 + β0 and for white, we have β5 + β0 .
▶ H0 : β4 + β0 = β5 + β0 or H0 : β4 − β5 = 0.
Inference for linear combinations of coefficients

▶ Define H0 : c T β = c T β 0 ; e.g., H0 : c T β = 0.
▶ Let’s findout the
 c:
β0

 β1 , 

▶ β= β3  .

 β4  
 β5 
β6
▶ Whether age and working hour have the same effect (a weird
one but we could do it)? For H0 : β1 − β2 = 0,
c = (0, 1, −1, 0, 0, 0, 0)T .
▶ Whether white and black have the same income?
c = (0, 0, 0, 0, 1, −1, 0)T .
Inference for linear combinations of coefficients

▶ We can use the test statistic

c T β̂ − c T β 0 c T β̂ − c T β 0
t= =q
ˆ T β̂)
se(c σˆ2 c T (X T X )−1 c

▶ For a two-sided test of size α, we reject if

|t| > T1−α/2,n−p−1

Linear combinations of normally distributed random variables are

still normally distributed.
Inference about multiple coefficients

Our model contains multiple parameters; therefore we may need to

perform multiple tests in order to conclude on multiple predictors:

H01 : β1 = 0
H02 : β2 = 0
.. ..
.=.
H0k : βk = 0

▶ For any individual test, we have a type I error as α:

P(reject H0i | H0i ) = α
▶ Inflated type I error
Family-wise error rate (FWER)

▶ Family-wise error rate is the probability of making at least one

type I error
▶ To calculate the FWER
▶ First note P(no rejections | all H0i are true) = (1 − α)k
▶ It follows that
P(at least one rejection | all H0i are true) = 1 − (1 − α)k
▶ As we can see, if we set α = 0.05, k = 10
FWER=1-(1-0.05)ˆ10
FWER

## [1] 0.4012631
Linea approximation to the family-wise error rate

1.00

0.75
fwer

0.50

0.25

0 25 50 75 100
k

The blue line is kα and the black curve is 1 − (1 − α)k . The former
is on the top of the others.
Addressing multiple comparisons

▶ Correct for multiple comparisons

▶ Often, use the Bonferroni correction and use αi = α/k for each
test
▶ Thanks to the Bonferroni inequality, this gives an overall
FWER≤ α
▶ Control false discovery rate
▶ Use a global test (F test)
Global tests

▶ Extend ANOVA to MLR

Source Sum of squares df MS

(ŷi − ȳ )2 MSR = SSR
P
Model p p
SSE
(yi − ŷi )2 n−p−1
P
Error MSE = n−p−1
(y − ȳ )2 n−1
P
Total i
Global tests

▶ F test for regression relationship

H0 : β1 = β2 = · · · = βp = 0
H1 : not all βk equal zero

▶ We use the test statistic

MSR
f =
MSE

▶ f ∼ F (p, n − p − 1) under H0
▶ The decision rule: P(F (p, n − p − 1) > fobs ).
Example revisit

▶ What can we conclude globally?

model1=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
anova(model1)

## Analysis of Variance Table

##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 123920 123920 44.7139 4.502e-11 ***
## hrs_work 1 266469 266469 96.1498 < 2.2e-16 ***
## relevel(factor(race), ref = "other") 3 73761 24587 8.8718 8.811e-06 ***
## gender 1 48862 48862 17.6309 3.009e-05 ***
## Residuals 738 2045287 2771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Extra Sum of Squares

▶ How much do one or multiple predictors contribute in addition

to the others in a model?
▶ Start a model with X1 : SSR(X1 )

▶ Adding X2 into the model

▶ The extra increase in SSR is:
SSR(X2 | X1 ) = SSR(X1 , X2 ) − SSR(X1 )
▶ Equivalently: SSR(X2 | X1 ) = SSE (X1 ) − SSE (X1 , X2 )

▶ We can keep this by adding X3 , . . . ,

Extra Sum of Squares

SSR(X3 | X1 , X2 ) = SSR(X1 , X2 , x3 ) − SSR(X1 , X2 )

= SSR(X1 , X2 , x3 ) − [SSR(X1 ) + SSR(X2 | X1 )].

SSR(X1 , X2 , x3 ) = SSR(X1 ) + SSR(X2 | X1 ) + SSR(X3 | X1 , X2 ).

Decompose SSR into Extra Sum of Squares (ANOVA)

Source Sum of squares df MS

▶ How about we want to test whether race is a significant

predictor for income given the other variables?
▶ What is the null hypothesis?

H0 : β3 = β4 = β5 = 0.

This is because there are 3 dummies for the race variable.

Test whether several βk = 0

▶ To test the null hypothesis H0 : β3 = β4 = β5 = 0, we basically

compare a smaller “null” model to a larger “alternative” model
▶ What is the “null” model, and what is the “alternative” model
in this case?
H0 : y = β0 + β1 age + β2 hours + β6 gender + ϵ.
H0 : y =
β0 + β1 age + β2 hours + β3 asian + β4 black + β5 white + β6 gender + ϵ.
We can use F test in such cases

▶ We will use something like

MSR(X3 , X4 , X5 | X1 , X2 , X6 )
f =
MSE (X1 , . . . , X6 )
SSR(X3 ,X4 ,X5 ,X1 ,X2 ,X6 )−SSR(X1 ,X2 ,X6 )
6−3
=
MSE (X1 , . . . , X6 )

▶ In order to use F test, we need to satisfy the smaller model

must be nested in the larger model
▶ That is, the smaller model must be a special case of the larger
model
Test whether several βk = 0

▶ In general
(SSES − SSEL )/(dfS − dfL )
f =
SSEL /dfL
▶ If H0 is true, then f ∼ FdfS −dfL ,dfL
▶ Note dfS = n − pS − 1 and dfL = n − pL − 1
▶ We reject the null hypothesis if the p-value is above α

P value = P(FdfS −dfL ,dfL > fobs )

Nested models

▶ These models are nested:

Smaller = Regression of Y on X1
Larger = Regression of Y on X1 , X2 , X3 , X4

▶ These models are not nested:

Smaller = Regression of Y on X2
Larger = Regression of Y on X1 , X3
Example
model_0=lm(income~1,data=income)
model_1=lm(income~age,data=income)
model_2=lm(income~age+hrs_work,data=income)
model_3=lm(income~age+hrs_work+relevel(factor(race),ref='other'),data=income)
model_full=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
summary.aov(model_0); summary.aov(model_1);summary.aov(model_2);summary.aov(model_3)

## Df Sum Sq Mean Sq F value Pr(>F)

## Residuals 744 2558299 3439

## Df Sum Sq Mean Sq F value Pr(>F)

## age 1 123920 123920 37.82 1.27e-09 ***
## Residuals 743 2434379 3276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Df Sum Sq Mean Sq F value Pr(>F)

## age 1 123920 123920 42.41 1.36e-10 ***
## hrs_work 1 266469 266469 91.20 < 2e-16 ***
## Residuals 742 2167911 2922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Df Sum Sq Mean Sq F value Pr(>F)

## age 1 123920 123920 43.730 7.23e-11 ***
## hrs_work 1 266469 266469 94.034 < 2e-16 ***
## relevel(factor(race), ref = "other") 3 73761 24587 8.677 1.16e-05 ***
## Residuals 739 2094149 2834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example
summary.aov(model_full)

## Df Sum Sq Mean Sq F value Pr(>F)

## age 1 123920 123920 44.714 4.50e-11 ***
## hrs_work 1 266469 266469 96.150 < 2e-16 ***
## relevel(factor(race), ref = "other") 3 73761 24587 8.872 8.81e-06 ***
## gender 1 48862 48862 17.631 3.01e-05 ***
## Residuals 738 2045287 2771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model_1=lm(income~hrs_work,data=income)
model_2=lm(income~age+hrs_work,data=income)
summary.aov(model_1)

## Df Sum Sq Mean Sq F value Pr(>F)

## hrs_work 1 303875 303875 100.1 <2e-16 ***
## Residuals 743 2254424 3034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(model_2)

## Df Sum Sq Mean Sq F value Pr(>F)

## age 1 123920 123920 42.41 1.36e-10 ***
## hrs_work 1 266469 266469 91.20 < 2e-16 ***
## Residuals 742 2167911 2922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example

model_full_new=lm(income~age+relevel(factor(race),ref='other')+gender+hrs_work,data=income)
summary.aov(model_full_new)

## Df Sum Sq Mean Sq F value Pr(>F)

## age 1 123920 123920 44.71 4.50e-11 ***
## relevel(factor(race), ref = "other") 3 91514 30505 11.01 4.47e-07 ***
## gender 1 120082 120082 43.33 8.78e-11 ***
## hrs_work 1 177496 177496 64.05 4.71e-15 ***
## Residuals 738 2045287 2771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Example
▶ Now if we want to test if race and gender are significant
predictors within the full model
model_al=lm(income~age+hrs_work,data=income)
anova(model_al,model_full_new)

## Analysis of Variance Table

##
## Model 1: income ~ age + hrs_work
## Model 2: income ~ age + relevel(factor(race), ref = "other") + gender +
## hrs_work
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 742 2167911
## 2 738 2045287 4 122624 11.062 1.021e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_al,model_full)

## Analysis of Variance Table

##
## Model 1: income ~ age + hrs_work
## Model 2: income ~ age + hrs_work + relevel(factor(race), ref = "other") +
## gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 742 2167911
## 2 738 2045287 4 122624 11.062 1.021e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Global F tests

▶ There are a couple of important special cases for the F test

▶ The null model contains the intercept only
▶ The null model and the alternative model differ only by one term
▶ Gives a way of testing for a single coefficient
▶ Turns out to be equivalent to a two-sided t-test: Tdf2 = F1,dfL
L
Example

▶ Let’s consider the question is testing the effect of age on

income given others fixed.

Null/Reduced = Regression of Y on X2 , X3 , X4 , X5 , X6
Alternative/Full = Regression of Y on X1 , X2 , X3 , X4 , X5 , X6
Example

Let’s state the hypothesis

H0 : the null model fits as well as the alternative model

H1 : the alternative model fits significantly better

H0 : age does not contribute significantly to predicting outcome

given working hour, race and gender are already considered
H1 : age contributes significantly, . . . , . . .

H0 : β 1 = 0 vs. H1 : β1 ̸= 0
Example
##
## Call:
## lm(formula = income ~ age + hrs_work + relevel(factor(race),
## ref = "other") + gender, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.29 -23.47 -9.00 8.14 356.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.2980 10.4454 -5.485 5.67e-08
## age 0.7673 0.1343 5.714 1.60e-08
## hrs_work 1.3124 0.1640 8.003 4.71e-15
## relevel(factor(race), ref = "other")asian 46.4238 11.4483 4.055 5.54e-05
## relevel(factor(race), ref = "other")black -2.3020 9.6375 -0.239 0.811
## relevel(factor(race), ref = "other")white 11.3375 7.6504 1.482 0.139
## gendermale 17.0306 4.0560 4.199 3.01e-05
##
## (Intercept) ***
## age ***
## hrs_work ***
## relevel(factor(race), ref = "other")asian ***
## relevel(factor(race), ref = "other")black
## relevel(factor(race), ref = "other")white
## gendermale ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.64 on 738 degrees of freedom
## Multiple R-squared: 0.2005, Adjusted R-squared: 0.194
## F-statistic: 30.85 on 6 and 738 DF, p-value: < 2.2e-16
Example

model_full_d=lm(income~hrs_work+relevel(factor(race),ref='other')+gender,data=income)
summary.aov(model_full_d)

## Df Sum Sq Mean Sq F value Pr(>F)

## hrs_work 1 303875 303875 105.144 < 2e-16 ***
## relevel(factor(race), ref = "other") 3 71088 23696 8.199 2.25e-05 ***
## gender 1 47555 47555 16.455 5.51e-05 ***
## Residuals 739 2135781 2890
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model_full_d,model_full)

## Analysis of Variance Table

##
## Model 1: income ~ hrs_work + relevel(factor(race), ref = "other") + gender
## Model 2: income ~ age + hrs_work + relevel(factor(race), ref = "other") +
## gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 739 2135781
## 2 738 2045287 1 90494 32.653 1.599e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlations: Correlations /variables P1 P2 P3 P4 X1 /print Twotail Nosig /statistics Descriptives /missing Pairwise
No ratings yet
Correlations: Correlations /variables P1 P2 P3 P4 X1 /print Twotail Nosig /statistics Descriptives /missing Pairwise
16 pages
1metrix
No ratings yet
1metrix
4 pages
Gender and Mobile Payment System Adoption among Students of Tertiary Institutions in Nigeria
No ratings yet
Gender and Mobile Payment System Adoption among Students of Tertiary Institutions in Nigeria
8 pages
Velte 2017
100% (1)
Velte 2017
11 pages
BT3041_Topic7_d6a5007a5488fd7514398e3814107f3e
No ratings yet
BT3041_Topic7_d6a5007a5488fd7514398e3814107f3e
31 pages
297.full
No ratings yet
297.full
7 pages
QUIZ Notes
No ratings yet
QUIZ Notes
5 pages
r notesss
No ratings yet
r notesss
12 pages
3 Paper7892
No ratings yet
3 Paper7892
14 pages
ch-9
No ratings yet
ch-9
38 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
26 pages
Calvo-Mora Et Al.2006
No ratings yet
Calvo-Mora Et Al.2006
24 pages
1 s2.0 S2666764923000437 Main
No ratings yet
1 s2.0 S2666764923000437 Main
8 pages
Behaviors of Seniors and Impact of Spatial Form in Small-Scale Public Spaces in Chinese Old City Zones
No ratings yet
Behaviors of Seniors and Impact of Spatial Form in Small-Scale Public Spaces in Chinese Old City Zones
9 pages
Machine Learning Approach For The Detection of Vitamin D Level: A Comparative Study
No ratings yet
Machine Learning Approach For The Detection of Vitamin D Level: A Comparative Study
19 pages
Lecture 12 - Adv. Correlation and Multiple Regression
No ratings yet
Lecture 12 - Adv. Correlation and Multiple Regression
32 pages
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
No ratings yet
Correlation, Simple Linear Regression and Multiple Linear Regression Practice
50 pages
Applied Quantitative Methodology - Summary 4
No ratings yet
Applied Quantitative Methodology - Summary 4
24 pages
Linear Regression
No ratings yet
Linear Regression
13 pages
Dummy Variable Regression
No ratings yet
Dummy Variable Regression
8 pages
Oral Exam Preparation for Week 4
No ratings yet
Oral Exam Preparation for Week 4
29 pages
linearregression
No ratings yet
linearregression
18 pages
a-study-on-the-performance-of-insurance-companies-in-1xynrowx1f
No ratings yet
a-study-on-the-performance-of-insurance-companies-in-1xynrowx1f
13 pages
Prakhar Sikka - CV
No ratings yet
Prakhar Sikka - CV
1 page
Practical Session 1 Solved
No ratings yet
Practical Session 1 Solved
14 pages
Correlation Regression Tutorial
No ratings yet
Correlation Regression Tutorial
42 pages
Multiple Linear Regression Session 4
No ratings yet
Multiple Linear Regression Session 4
32 pages
Multiple Regression
100% (1)
Multiple Regression
21 pages
Week 03 Regression
No ratings yet
Week 03 Regression
14 pages
Theme 3 Multivariante Regression Model
No ratings yet
Theme 3 Multivariante Regression Model
8 pages
Module 4
No ratings yet
Module 4
33 pages
Religion and Crime Reexamined: The Impact of Religion, Secular Controls, and Social Ecology On Adult Criminality
No ratings yet
Religion and Crime Reexamined: The Impact of Religion, Secular Controls, and Social Ecology On Adult Criminality
31 pages
Linear Model
No ratings yet
Linear Model
10 pages
ST T153A Regression Analysis
No ratings yet
ST T153A Regression Analysis
54 pages
HW4 Solutions: Problem 6.2
No ratings yet
HW4 Solutions: Problem 6.2
8 pages
Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form
No ratings yet
Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form
7 pages
85 ArticleText 340 1 10 20200730
No ratings yet
85 ArticleText 340 1 10 20200730
8 pages
reg
No ratings yet
reg
110 pages
Introduction of Regression
No ratings yet
Introduction of Regression
57 pages
Akuntansi+Vol +2+no +3+September+2023+Hal+109-126
No ratings yet
Akuntansi+Vol +2+no +3+September+2023+Hal+109-126
18 pages
Design and Learning Effectiveness Evaluation of Gamification in e Learning Systems
No ratings yet
Design and Learning Effectiveness Evaluation of Gamification in e Learning Systems
5 pages
1 s2.0 S2666149722000202 Main
No ratings yet
1 s2.0 S2666149722000202 Main
6 pages
Ch08 Part 2 - Multtiple Regression
No ratings yet
Ch08 Part 2 - Multtiple Regression
45 pages
Cheat Sheet Statistics
No ratings yet
Cheat Sheet Statistics
3 pages
Lec 6
No ratings yet
Lec 6
133 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Multiple Linear Regression (Multiple Regression Analysis)
No ratings yet
Multiple Linear Regression (Multiple Regression Analysis)
37 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
Notes On Linear Regression - 2
No ratings yet
Notes On Linear Regression - 2
4 pages
Chapter 5
No ratings yet
Chapter 5
45 pages
Module01 LinearRegression
No ratings yet
Module01 LinearRegression
41 pages
Week 8 - 10
No ratings yet
Week 8 - 10
72 pages
Almahamid, S., Awwad, A., & McAdams, A. C. (2010) .
No ratings yet
Almahamid, S., Awwad, A., & McAdams, A. C. (2010) .
20 pages
15Multiple Linear Regression
No ratings yet
15Multiple Linear Regression
168 pages
MSC Nursing
No ratings yet
MSC Nursing
8 pages
Final Predictive Vaibhav 2020
No ratings yet
Final Predictive Vaibhav 2020
101 pages
Unit 3
No ratings yet
Unit 3
24 pages
3-Linear Regreesion-Assumptions
No ratings yet
3-Linear Regreesion-Assumptions
28 pages
Personality and Individual Di Fferences: Sciencedirect
No ratings yet
Personality and Individual Di Fferences: Sciencedirect
10 pages
Supplement 5 - Multiple Regression
No ratings yet
Supplement 5 - Multiple Regression
19 pages
Soluciones Unidad 3 Opcionales
No ratings yet
Soluciones Unidad 3 Opcionales
15 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Capitulo7 Gujarati
No ratings yet
Capitulo7 Gujarati
46 pages
Chapter 6 (Part Ii)
No ratings yet
Chapter 6 (Part Ii)
41 pages
lecture-4 2
No ratings yet
lecture-4 2
50 pages
MultivariableRegression Summary
No ratings yet
MultivariableRegression Summary
15 pages
R Egression Simplified
No ratings yet
R Egression Simplified
24 pages
Regression in R
No ratings yet
Regression in R
40 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
A FINALS Econometrics - II MCQs
100% (2)
A FINALS Econometrics - II MCQs
6 pages
Determination of The Aluminium Content in Different Brands of Deodor
No ratings yet
Determination of The Aluminium Content in Different Brands of Deodor
14 pages
Module01.1 LinearRegression
No ratings yet
Module01.1 LinearRegression
32 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Least Squares Adjustment
100% (1)
Least Squares Adjustment
47 pages
Cursus Advanced Econometrics
No ratings yet
Cursus Advanced Econometrics
129 pages
786633.trade Perspectives 2015
No ratings yet
786633.trade Perspectives 2015
269 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Multivariate Regression
No ratings yet
Multivariate Regression
20 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Analytics Compendium
No ratings yet
Analytics Compendium
41 pages
Ra Web
No ratings yet
Ra Web
70 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
Wilson, Hardy and Harwood (2006)
No ratings yet
Wilson, Hardy and Harwood (2006)
15 pages
10 Regression Analysis
No ratings yet
10 Regression Analysis
55 pages
Analytics
0% (1)
Analytics
50 pages
Multiple Linear Regression Analysis
No ratings yet
Multiple Linear Regression Analysis
23 pages
05 Diagnostic Test of CLRM 2
No ratings yet
05 Diagnostic Test of CLRM 2
39 pages
Robust Regression Modeling With STATA Lecture Notes
No ratings yet
Robust Regression Modeling With STATA Lecture Notes
93 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Measurement of Length - Screw Gauge (Physics) Question Bank
From Everand
Measurement of Length - Screw Gauge (Physics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)