Ch5 Slide VariableSelection
Ch5 Slide VariableSelection
04/2025
0/35
Chapter 5
References:
Chapter 10: Variable Selection and Model Building
Montgomery, D.C., Peck, E.A., & Vining, G.G. (2012) Introduction to Linear Regression Analysis.
5th Ed. Wiley.
1/35
Outline
Introduction
Goals of model selection
Criteria to compare models
Model-building problem
Consequences of model misspecification
Criteria for evaluating equations
2/35
Introduction
Goals of model selection
Criteria to compare models
Model-building problem
Consequences of model misspecification
Criteria for evaluating equations
Introduction 3/35
Model selection: goals
Introduction 4/35
Model selection: strategies
Introduction 5/35
Possible criteria
Introduction 6/35
Model-building problem
Introduction 7/35
• Building a regression model that includes only a subset of the available
regressors involves two conflicting objectives.
1. The model includes as many regressors as possible so that the information
content in these factors can influence the predicted value of y ;
2. The model includes as few regressors as possible because the variance of
the prediction ŷ increases as the number of regressors increases. Also the
more regressors, the greater the costs of data collection and model
maintenance.
The process of finding a model that is a compromise between these two
objectives is called selecting the “best” regression equation.
• None of the variable selection procedures are guaranteed to produce the best
regression equation for a given data set.
Introduction 8/35
Consequences of model misspecification
Consequences of incorrect model specification
• Assume that there are K candidate regressors x1 , . . . , xK and n ≥ K + 1
observations on these regressors and the response y .
K
yi = β 0 + ∑ βj xij + ε i , i = 1, . . . , n or y = Xβ+ε (1)
j =1
• Let r be the number of regressors that are deleted form (1). Then the number of
variables tht are retained is p = K + 1 − r , i.e. the subset model contains
p − 1 = K − r of the original regressors
y = X p βp + X r βr + ε (2)
Introduction 9/35
• The properties of the estimates β̂ p and σ̂2
? E ( β̂ p ) = β p + (X p0 X p )−1 X p0 X r β r = β p + Aβ r where
A = (X p0 X p )−1 X p0 X r ;
? Var ( β̂ p ) = σ2 (X p0 X p )−1 and Var ( β̂∗ ) = σ2 (X 0 X )−1 . Also
Var ( β̂∗ ) − Var ( β̂ p ) is positive semidefinite;
? Since β̂ p is a biased estimate of β p and β̂∗ is not, it is more reasonable to
compare the precision of the parameter estimates from the full and subset
models in terms of means square error;
? The parameter σ̂2∗ from the full model is an unbiased estimate of σ2 .
β 0 X 0 [I −X p (X 0 X p ) −1 X 0 ]X r β r
However, for the subset model E (σ̂2 ) = σ2 + r r p
n −p
p
.
2 2
That is σ̂ is generally biased upward as an estimate of σ ;
? Suppose we wish to predict the response at the point x 0 = [x p0 , x r0 ]. If we
use the full model, the predicted value is ŷ ∗ = x 0 β̂∗ , with mean x 0 β and
prediction variance Var (ŷ ∗ ) = σ2 [1 + x 0 (X 0 X )−1 x ]
Introduction 10/35
Motivation for variable selection
• Improve the precision of the parameter estimates of the retained variables by
deleting variables from the model, even though some of the deleted variables are
not negligible.
? This is also true for the variance of a predicted response;
? Deleting variables potentially introduces bias into the estimates of the
coefficients of retained variables and the response.
? However, if the deleted variables have small effects, the MSE of the biased
estimates will be less than the variance of the unbiased estimates.
? There is danger in retaining negligible variables, that is, variables with zero
coefficients or coefficients less than their corresponding standard errors
from the full model. This danger is that the variances of the estimates of
the parameters and the predicted response are increased.
Introduction 11/35
Criteria for evaluating subset regression models
• Two key aspects of the variable selection:
? Generating the subset models;
? Deciding if one subset is better than another.
Computational methods for variable selection
• (Coefficient of Multiple Determination R 2 ) A measure of the adequacy of a
regression model. Let Rp2 denote the coefficient of multiple determination for a
subset regression model with p terms, that is, p − 1 regressors and an intercept
term β 0
SSR (p ) SS (p )
Rp2 = = 1 − Res (4)
SST SST
K
? There are values of Rp2 for each value of p, and Rp2 increases as p increases
p−1
and is a maximum when p = K + 1;
? The analyst uses this criterion by adding regressors to the model up to the point
where an additional variable is not useful in that it provides only a small increase in Rp2
Introduction 12/35
• Since we cannot find an “optimum” value of R 2 for a subset regression model,
we must look for a “satisfactory” value.
?
R02 = 1 − (1 − RK2 +1 )(1 + dα,n,K ), (5)
KF
α,K ,n−K −1
where dα,n,K = n −K −1 and RK2 +1 is the value of R 2 for the full model.
? Any subset of regressor variables producing an R 2 greater than R02 is called an
R 2 -adequate (α) subset.
• (Adjusted R 2 ) The adjusted R 2 statistic, defined for a p-term equation as
n−1
2
RAdj,p = 1− (1 − Rp2 ) (6)
n−p
2
? RAdj,p statistic does not necessarily increase as additional regressors are introduced
into the model.
2 2
? In fact, if s regressors are added to the model, RAdj,p +s
will exceed RAdj,p if and only if
the partial F statistic for testing the significance of the s additional regressors exceeds
1;
? Consequently, one criterion for selection of an optimum subset model is to choose the
2
model that has a maximum RAdj,p .
Introduction 13/35
• (Residual mean square) The residual mean square for a subset regression model
SSRes (p )
MSRes (p ) = (7)
n−p
(Remark: the eventual increase in MSRes (p ) occurs when the reduction in SSRes (p ) from adding a regressor to the
model is not sufficient to compensate for the loss of one degress of freedom in the denominator of (7).)
2
? The subset regression model that minimizes MSRes (p ) will also maximize RAdj,p .
Thus, the criteria minimum MSRes (p ) and maximum adjusted R 2 are equivalent.
Introduction 14/35
• (Mallow’s Cp Statistic)
SSRes (M)
Cp (M) = − n + 2 · p (M) (8)
σ2
b
Introduction 15/35
• (AIC & BIC)
? Mallow’s Cp is (almost) a special case of Akaike Information Criterion(AIC)
AIC (M) = −2logL(M) + 2 · p (M).
? L(M) is the likelihood function of the parameters in model M evaluated
at the MLE (Maximum Likelihood Estimators).
? Schwarz’s Bayesian Information Criterion (BIC)
BIC (M) = −2logL(M) + p (M) · log n.
Introduction 16/35
Search strategies
• “Best subset”: search all possible models and take the one with highest Ra2 or
lowest Cp .
• Stepwise (forward, backward or both): useful when the number of predictors is
large. Choose an initial model and be “greedy”.
• “Greedy” means always take the biggest jump (up or down) in your selected
criterion.
Introduction 17/35
Implementations in R
• “Best subset”: use the function leaps. Works only for multiple linear regression
models.
• Stepwise: use the function step. Works for any model with Akaike Information
Criterion (AIC). In multiple linear regression, AIC is (almost) a linear function of
Cp .
Introduction 18/35
Introduction
• Cases: when there are a large number of potential explanatory variables ( #=q);
not involve computing of all possible equations (2q ).
• Common feature: the variables are introduced or deleted from the equation one
at a time; exam only a subset of all possible equations (evaluate at most q + 1
equations)
• Procedure Categories:
? Forward selection procedure
I goes through the full set of variables and provides with q possible equations.
? Backward elimination procedure
I involves fitting at most q regression equations
? Stepwise regression (a modification of the FS procedure)
I a number of possible combinations of the afore two procedures.
• Step 2: The second variable is the one which has the highest correlation with y ,
after y has been adjusted for the effect of the first variable.
? i.e. the variable has the highest simple correlation coefficient with the residuals from
step 1;
? Retain x2 when β 2 is significantly different from zero;
? Search for a second variable.
• ···
• Terminate the procedure: insignificant β q .
? Judged by the standard t-statistic computed from the latest equation;
? Mostly by a low t cutoff value for testing the coefficient of the newly entered variable.
Perform
residual analysis
Perform all
possible regressions
No
Select models
for further analysis
Make
recommendations
Strategy for variable selection and model building 34/35
Variable selection for high-dim data