0% found this document useful (0 votes)
15 views7 pages

Model Selection

StatModek

Uploaded by

kz4scq65gy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views7 pages

Model Selection

StatModek

Uploaded by

kz4scq65gy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Multiple linear regression models:

Model selection

A new example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A multiple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Some candidate linear regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Coefficients of multiple linear determination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Comparing Ra2 for models with the same number of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Best subset selection - exhaustive search based on Ra2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Overall best model selected according to Ra2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Best subset selection - exhaustive search based on AIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Best subset selection - exhaustive search based on SBC/BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Overall best model selected according to SBC/BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Comparison between AIC and SBC/BIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1
A new example
In a study for evaluating which characteristics of an house affect its sale price, the following information have been
collected on a sample of 24 houses sold in a given area during a given year:
■ Y: Sale price of the house (thousands of dollars)
■ X1: Taxes (local, county, school - thousands of dollars)
■ X2: Number of bathrooms
■ X3: Lot size (thousands of square feet)
■ X4: Living space (thousands of square feet)
■ X5: Number of garage stalls
■ X6: Number of bedrooms
■ X7: Age of of the house (years)
■ X8: Number of fireplaces

Stat. Mod. Giuliano Galimberti – 2

A multiple linear regression model


Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.4054 4.3603 3.3038 0.0048
X1 1.8164 0.8243 2.2036 0.0436
X2 7.1389 4.0135 1.7787 0.0956
X3 0.1472 0.4768 0.3087 0.7618
X4 2.7334 4.2492 0.6433 0.5298
X5 2.0652 1.3388 1.5425 0.1438
X6 -1.9124 1.7936 -1.0662 0.3032
X7 -0.0383 0.0651 -0.5887 0.5648
X8 1.4875 1.6593 0.8964 0.3842
Residual standard error: 2.878 on 15 degrees of freedom
Multiple R-squared: 0.8506, Adjusted R-squared: 0.771
F-statistic: 10.68 on 8 and 15 DF, p-value: 5.936e-05

X1 X2 X3 X4 X5 X6 X7 X8
V IFk 4.7235 2.5938 2.4376 3.8278 1.8197 2.8489 2.3201 1.4963

⇒ Which regressors should be kept in the model?

Stat. Mod. Giuliano Galimberti – 3

2
Some candidate linear regression models
Model R formula Regressors
M1 Y~X3 Lot size
M2 Y~X4 Living space
M3 Y~X4+X5 Living space, number of garage stalls
M4 Y~X2+X5 Number of bathrooms, number of garage stalls
M5 Y~X2+X4+X6 Number of bathrooms, living space, number of bedrooms
M6 Y~X3+X4+X6 Lot size, living space, number of bedrooms

Stat. Mod. Giuliano Galimberti – 4

Coefficients of multiple linear determination


Model R formula R2
M1 Y~X3 0.4194
M2 Y~X4 0.5009
M3 Y~X4+X5 0.5503
M4 Y~X2+X5 0.6001
M5 Y~X2+X4+X6 0.6058
M6 Y~X3+X4+X6 0.5947
Note that, when comparing models with the same number of regressors:
■ M2 is better than M1
■ M4 is better than M3
■ M5 is (slightly) better than M6
Furthermore:
■ the R2 values for M2 and M5 could be compared using a partial F test (M2 is nested in M5)
However:
■ the R2 for M2 and M4 should not be compared (M2 is not nested in M4)
■ the R2 for M4 and M5 should not be compared (M4 is not nested in M5)

Stat. Mod. Giuliano Galimberti – 5

3
Adjusted R2
Model R formula R2 Ra2
M1 Y~X3 0.4194 0.3930
M2 Y~X4 0.5009 0.4782
M3 Y~X4+X5 0.5503 0.5075
M4 Y~X2+X5 0.6001 0.5620
M5 Y~X2+X4+X6 0.6058 0.5467
M6 Y~X3+X4+X6 0.5947 0.5339
Note that, when comparing the R2a values for models with the same number of regressors:
■ M2 is better than M1
■ M4 is better than M3
■ M5 is (slightly) better than M6
Furthermore:
■ according R2a , M4 is better than M2 and M5

Stat. Mod. Giuliano Galimberti – 6

Comparing Ra2 for models with the same number of parameters


Consider two models, fitted on the same sample of units:
M1: y = X1 β 1 + ε 1
M2: y = X2 β 2 + ε 2
such that p1 = p2 = p (where p1 and p2 denote the numbers of columns of X1 and X2 , respectively). Then

   
SSE(M1) n − 1 SSE(M2) n − 1
Ra2 (M1) > Ra2 (M2) ⇔ 1− · >1− ·
SST O n−p SST O n−p

SSE(M1) n − 1 SSE(M2) n − 1
⇔ · < ·
SST O n−p SST O n−p

⇔ SSE(M1) < SSE(M2)

⇔ R2 (M1) > R2 (M2)

Stat. Mod. Giuliano Galimberti – 7

4
Best subset selection - exhaustive search based on Ra2

p Best model (R formula) R2 R2a

0.8
1 Y~1 0 0
2 Y~X1 0.7637 0.7530
3 Y~X1+X2 0.7981 0.7788

0.6
4 Y~X1+X2+X8 0.8112 0.7829
5 Y~X1+X2+X5+X6 0.8321 0.7968 R2
6 Y~X1+X2+X5+X6+X8 0.8396 0.7950 R2a

0.4
7 Y~X1+X2+X4+X5+X6+X8 0.8456 0.7911
8 Y~X1+X2+X4+X5+X6+X7+X8 0.8497 0.7839
9 Y~. 0.8506 0.7710

0.2
0.0
2 4 6 8

■ Note that the sequence of best models is not necessarily a nested sequence
(the best model for p = 4 is not nested in the best model for p = 5)
■ According to R2a , the overall best model is Y~X1+X2+X5+X6

Stat. Mod. Giuliano Galimberti – 8

Overall best model selected according to Ra2


Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.6212 3.6725 3.7090 0.0015
X1 2.4123 0.5225 4.6168 0.0002
X2 8.4589 3.3300 2.5402 0.0200
X5 2.0604 1.2235 1.6840 0.1085
X6 -2.2154 1.2901 -1.7173 0.1022
Residual standard error: 2.711 on 19 degrees of freedom
Multiple R-squared: 0.8321, Adjusted R-squared: 0.7968
F-statistic: 23.54 on 4 and 19 DF, p-value: 3.866e-07

X1 X2 X5 X6
V IFk 2.1389 2.0124 1.7127 1.6611

Stat. Mod. Giuliano Galimberti – 9

5
Best subset selection - exhaustive search based on AIC

p Best model (R formula) R2 AIC


1 Y~1 0 87.0845
2 Y~X1 0.7637 54.4587
3 Y~X1+X2 0.7981 52.6892
4 Y~X1+X2+X8 0.8112 53.0752
5 Y~X1+X2+X5+X6 0.8321 52.2572
6 Y~X1+X2+X5+X6+X8 0.8396 53.1627
7 Y~X1+X2+X4+X5+X6+X8 0.8456 54.2435
8 Y~X1+X2+X4+X5+X6+X7+X8 0.8497 55.6052
9 Y~. 0.8506 57.4532

■ Also according to AIC, the overall best model is Y~X1+X2+X5+X6

Stat. Mod. Giuliano Galimberti – 10

Best subset selection - exhaustive search based on SBC/BIC

p Best model (R formula) R2 SBC/BIC


1 Y~1 0 88.2626
2 Y~X1 0.7637 56.8148
3 Y~X1+X2 0.7981 56.2234
4 Y~X1+X2+X8 0.8112 57.7874
5 Y~X1+X2+X5+X6 0.8321 58.1474
6 Y~X1+X2+X5+X6+X8 0.8396 60.2310
7 Y~X1+X2+X4+X5+X6+X8 0.8456 62.4899
8 Y~X1+X2+X4+X5+X6+X7+X8 0.8497 65.0297
9 Y~. 0.8506 68.0557

■ According to SBC/BIC, the overall best model is Y~X1+X2

Stat. Mod. Giuliano Galimberti – 11

6
Overall best model selected according to SBC/BIC
Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.1120 2.9961 3.3750 0.0029
X1 2.7170 0.4911 5.5320 0.0000
X2 6.0985 3.2271 1.8898 0.0727
Residual standard error: 2.828 on 21 degrees of freedom
Multiple R-squared: 0.7981, Adjusted R-squared: 0.7788
F-statistic: 41.5 on 2 and 21 DF, p-value: 5.067e-08

X1 X2
V IFk 1.7366 1.7366

Stat. Mod. Giuliano Galimberti – 12

Comparison between AIC and SBC/BIC

n>8

SSE
n × ln
n
AIC

SBC/BIC
0

When n > 8, ln(n) > 2, SBC/BIC gives more weight to the parsimony term. Thus, SBC/BIC tends to select
multiple linear regression models that are simpler (with less regressors) than the ones selected by AIC

Stat. Mod. Giuliano Galimberti – 13

You might also like