lecture-5 2
lecture-5 2
Heping Zhang
October 1, 2024
Acknowledgement and Appreciation
Model:
y = Xβ + ϵ
Both heights with (HtShoes) and without (Ht) are included in the
model, so is the height when seated. These are all trouble makers!
Fit the model (sign of collinearity)
g=lm(hipcenter~., seatpos)
summary(g)
##
## Call:
## lm(formula = hipcenter ~ ., data = seatpos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.827 -22.833 -3.678 25.017 62.337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.43213 166.57162 2.620 0.0138 *
## Age 0.77572 0.57033 1.360 0.1843
## Weight 0.02631 0.33097 0.080 0.9372
## HtShoes -2.69241 9.75304 -0.276 0.7845
## Ht 0.60134 10.12987 0.059 0.9531
## Seated 0.53375 3.76189 0.142 0.8882
## Arm -1.32807 3.90020 -0.341 0.7359
## Thigh -1.14312 2.66002 -0.430 0.6706
## Leg -6.43905 4.71386 -1.366 0.1824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.72 on 29 degrees of freedom
## Multiple R-squared: 0.6866, Adjusted R-squared: 0.6001
## F-statistic: 7.94 on 8 and 29 DF, p-value: 1.306e-05
Examine the correlation matrix
round(cor(seatpos),3)
vif(X)
Two predictors (HtShoes and Ht) have very large VIFs and three
variables (Seated, leg, arm) has relatively large VIFs.
Some take away messages for collinearity
income$income=income$income/1000
X=cbind(1,income$age,income$hrs_work,income$race=="asian",income$race=="black",income$race=="white",income
y=income$income
betahat=solve(t(X)%*%X)%*%t(X)%*%y
betahat
## [,1]
## [1,] -57.2980196
## [2,] 0.7672869
## [3,] 1.3124369
## [4,] 46.4238090
## [5,] -2.3020209
## [6,] 11.3375245
## [7,] 17.0306377
Questions of interest
i.i.d
If our usual assumptions are satisfied and ϵi ∼ N(0, σ 2 ), then
β̂ ∼ N(β, σ 2 (X T X)−1 )
where (X T X )−1 T −1
jj is the jth element of the diagonal of (X X) .
library(broom)
model1=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
tidy(model1)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 "(Intercept)" -57.3 10.4 -5.49 5.67e- 8
## 2 "age" 0.767 0.134 5.71 1.60e- 8
## 3 "hrs_work" 1.31 0.164 8.00 4.71e-15
## 4 "relevel(factor(race), ref = \"other\")~ 46.4 11.4 4.06 5.54e- 5
## 5 "relevel(factor(race), ref = \"other\")~ -2.30 9.64 -0.239 8.11e- 1
## 6 "relevel(factor(race), ref = \"other\")~ 11.3 7.65 1.48 1.39e- 1
## 7 "gendermale" 17.0 4.06 4.20 3.01e- 5
Inference for linear combinations
▶ Define H0 : c T β = c T β 0 ; e.g., H0 : c T β = 0.
▶ Let’s findout the
c:
β0
β1 ,
▶ β= β3 .
β4
β5
β6
▶ Whether age and working hour have the same effect (a weird
one but we could do it)? For H0 : β1 − β2 = 0,
c = (0, 1, −1, 0, 0, 0, 0)T .
▶ Whether white and black have the same income?
c = (0, 0, 0, 0, 1, −1, 0)T .
Inference for linear combinations of coefficients
c T β̂ − c T β 0 c T β̂ − c T β 0
t= =q
ˆ T β̂)
se(c σˆ2 c T (X T X )−1 c
H01 : β1 = 0
H02 : β2 = 0
.. ..
.=.
H0k : βk = 0
## [1] 0.4012631
Linea approximation to the family-wise error rate
1.00
0.75
fwer
0.50
0.25
0 25 50 75 100
k
The blue line is kα and the black curve is 1 − (1 − α)k . The former
is on the top of the others.
Addressing multiple comparisons
H0 : β1 = β2 = · · · = βp = 0
H1 : not all βk equal zero
▶ f ∼ F (p, n − p − 1) under H0
▶ The decision rule: P(F (p, n − p − 1) > fobs ).
Example revisit
H0 : β3 = β4 = β5 = 0.
MSR(X3 , X4 , X5 | X1 , X2 , X6 )
f =
MSE (X1 , . . . , X6 )
SSR(X3 ,X4 ,X5 ,X1 ,X2 ,X6 )−SSR(X1 ,X2 ,X6 )
6−3
=
MSE (X1 , . . . , X6 )
▶ In general
(SSES − SSEL )/(dfS − dfL )
f =
SSEL /dfL
▶ If H0 is true, then f ∼ FdfS −dfL ,dfL
▶ Note dfS = n − pS − 1 and dfL = n − pL − 1
▶ We reject the null hypothesis if the p-value is above α
Smaller = Regression of Y on X1
Larger = Regression of Y on X1 , X2 , X3 , X4
Smaller = Regression of Y on X2
Larger = Regression of Y on X1 , X3
Example
model_0=lm(income~1,data=income)
model_1=lm(income~age,data=income)
model_2=lm(income~age+hrs_work,data=income)
model_3=lm(income~age+hrs_work+relevel(factor(race),ref='other'),data=income)
model_full=lm(income~age+hrs_work+relevel(factor(race),ref='other')+gender,data=income)
summary.aov(model_0); summary.aov(model_1);summary.aov(model_2);summary.aov(model_3)
model_full_new=lm(income~age+relevel(factor(race),ref='other')+gender+hrs_work,data=income)
summary.aov(model_full_new)
Null/Reduced = Regression of Y on X2 , X3 , X4 , X5 , X6
Alternative/Full = Regression of Y on X1 , X2 , X3 , X4 , X5 , X6
Example
H0 : β 1 = 0 vs. H1 : β1 ̸= 0
Example
##
## Call:
## lm(formula = income ~ age + hrs_work + relevel(factor(race),
## ref = "other") + gender, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -100.29 -23.47 -9.00 8.14 356.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.2980 10.4454 -5.485 5.67e-08
## age 0.7673 0.1343 5.714 1.60e-08
## hrs_work 1.3124 0.1640 8.003 4.71e-15
## relevel(factor(race), ref = "other")asian 46.4238 11.4483 4.055 5.54e-05
## relevel(factor(race), ref = "other")black -2.3020 9.6375 -0.239 0.811
## relevel(factor(race), ref = "other")white 11.3375 7.6504 1.482 0.139
## gendermale 17.0306 4.0560 4.199 3.01e-05
##
## (Intercept) ***
## age ***
## hrs_work ***
## relevel(factor(race), ref = "other")asian ***
## relevel(factor(race), ref = "other")black
## relevel(factor(race), ref = "other")white
## gendermale ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.64 on 738 degrees of freedom
## Multiple R-squared: 0.2005, Adjusted R-squared: 0.194
## F-statistic: 30.85 on 6 and 738 DF, p-value: < 2.2e-16
Example
model_full_d=lm(income~hrs_work+relevel(factor(race),ref='other')+gender,data=income)
summary.aov(model_full_d)