0% found this document useful (0 votes)
4 views36 pages

Ch5 Slide VariableSelection

The document discusses variable selection in regression equations, outlining goals, criteria for model comparison, and computational techniques for selecting variables. It emphasizes the importance of choosing an appropriate subset of regressors to improve model accuracy while managing the trade-off between model complexity and variance. Various criteria and strategies, including adjusted R², Mallow's Cp, and Akaike Information Criterion, are presented for evaluating and selecting models.

Uploaded by

debbyzhuang1129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

Ch5 Slide VariableSelection

The document discusses variable selection in regression equations, outlining goals, criteria for model comparison, and computational techniques for selecting variables. It emphasizes the importance of choosing an appropriate subset of regressors to improve model accuracy while managing the trade-off between model complexity and variance. Various criteria and strategies, including adjusted R², Mallow's Cp, and Akaike Information Criterion, are presented for evaluating and selecting models.

Uploaded by

debbyzhuang1129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

AMA3602

Applied Linear Models


Department of Applied Mathematics

04/2025

0/35
Chapter 5

Variable Selection in Regression Equations

References:
Chapter 10: Variable Selection and Model Building
Montgomery, D.C., Peck, E.A., & Vining, G.G. (2012) Introduction to Linear Regression Analysis.
5th Ed. Wiley.

1/35
Outline

Introduction
Goals of model selection
Criteria to compare models
Model-building problem
Consequences of model misspecification
Criteria for evaluating equations

Computational Techniques for Variables Selection


All possible regression
Stepwise Regression Methods

Strategy for variable selection and model building

2/35
Introduction
Goals of model selection
Criteria to compare models
Model-building problem
Consequences of model misspecification
Criteria for evaluating equations

Computational Techniques for Variables Selection

Strategy for variable selection and model building

Introduction 3/35
Model selection: goals

• When we have many predictors (with many possible interactions), it can be


difficult to find a good model.
• Which main effects do we include?
• Which interactions do we include?
• Model selection tries to “simplify” this task.
• This is an “unsolved” problem in statistics: there are no magic procedures to get
you the “best model”.
• In some sense, model selection is “data mining”.
• Data miners / machine learners often work with very many predictors.

Introduction 4/35
Model selection: strategies

• To “implement” this, we need:


? a criterion or benchmark to compare two models.
? a search strategy.
• With a limited number of predictors, it is possible to search all possible models.

Introduction 5/35
Possible criteria

• R 2 : not a good criterion. Always increase with model size → “optimum” is to


take the biggest model.
• Adjusted R 2 : better. It “penalized” bigger models.
• Mallow’s Cp .
• Akaike’s Information Criterion(AIC), Schwarz’s BIC.

Introduction 6/35
Model-building problem

• Previously our concern: functional specification correct or not; underlying


assumption about the error term valid or not.
• We have employed the classical approach to regression model selection, which
assumes that we have a very good idea of the basic form of the model and that
we know all (or nearly all) of the regressors that should be used.
• In practice, particularly retrospective studies, variable selection problem: find an
appropriate subset of regressors for the model when
1. there is no clear-cut theory to determine the variables;
2. there is a rather large pool of candidate variables;
3. only a few are likely to be important.
• Good variable selection methods are very important in the presence of
multicollinearity
? It help to justify the presence of these highly related regressors in the final
model;
? It does not guarantee elimination of multicollinearity.

Introduction 7/35
• Building a regression model that includes only a subset of the available
regressors involves two conflicting objectives.
1. The model includes as many regressors as possible so that the information
content in these factors can influence the predicted value of y ;
2. The model includes as few regressors as possible because the variance of
the prediction ŷ increases as the number of regressors increases. Also the
more regressors, the greater the costs of data collection and model
maintenance.
The process of finding a model that is a compromise between these two
objectives is called selecting the “best” regression equation.
• None of the variable selection procedures are guaranteed to produce the best
regression equation for a given data set.

Introduction 8/35
Consequences of model misspecification
Consequences of incorrect model specification
• Assume that there are K candidate regressors x1 , . . . , xK and n ≥ K + 1
observations on these regressors and the response y .
K
yi = β 0 + ∑ βj xij + ε i , i = 1, . . . , n or y = Xβ+ε (1)
j =1

• Let r be the number of regressors that are deleted form (1). Then the number of
variables tht are retained is p = K + 1 − r , i.e. the subset model contains
p − 1 = K − r of the original regressors
y = X p βp + X r βr + ε (2)

? The least-squares of β is β̂∗ = (X 0 X )−1 X 0 y


? The estimate of the residual variance σ2 is
y 0 y − β̂∗ 0 X 0 y y 0 [I −X (X 0 X ) −1 X 0 ]y
σ̂2∗ = n−K −1 = n −K −1
• For the subset model
y = X p βp + ε (3)
0
? The least-squares of β p is β̂ p = (X p X p ) −1 X 0 p
y
? The estimate of the residual variance σ2 is
y 0 y − β̂0p X p0 y y 0 [I −X p (X p0 X p )−1 X p0 ]y
σ̂2 = n −p = n −p

Introduction 9/35
• The properties of the estimates β̂ p and σ̂2
? E ( β̂ p ) = β p + (X p0 X p )−1 X p0 X r β r = β p + Aβ r where
A = (X p0 X p )−1 X p0 X r ;
? Var ( β̂ p ) = σ2 (X p0 X p )−1 and Var ( β̂∗ ) = σ2 (X 0 X )−1 . Also
Var ( β̂∗ ) − Var ( β̂ p ) is positive semidefinite;
? Since β̂ p is a biased estimate of β p and β̂∗ is not, it is more reasonable to
compare the precision of the parameter estimates from the full and subset
models in terms of means square error;
? The parameter σ̂2∗ from the full model is an unbiased estimate of σ2 .
β 0 X 0 [I −X p (X 0 X p ) −1 X 0 ]X r β r
However, for the subset model E (σ̂2 ) = σ2 + r r p
n −p
p
.
2 2
That is σ̂ is generally biased upward as an estimate of σ ;
? Suppose we wish to predict the response at the point x 0 = [x p0 , x r0 ]. If we
use the full model, the predicted value is ŷ ∗ = x 0 β̂∗ , with mean x 0 β and
prediction variance Var (ŷ ∗ ) = σ2 [1 + x 0 (X 0 X )−1 x ]

Introduction 10/35
Motivation for variable selection
• Improve the precision of the parameter estimates of the retained variables by
deleting variables from the model, even though some of the deleted variables are
not negligible.
? This is also true for the variance of a predicted response;
? Deleting variables potentially introduces bias into the estimates of the
coefficients of retained variables and the response.
? However, if the deleted variables have small effects, the MSE of the biased
estimates will be less than the variance of the unbiased estimates.
? There is danger in retaining negligible variables, that is, variables with zero
coefficients or coefficients less than their corresponding standard errors
from the full model. This danger is that the variances of the estimates of
the parameters and the predicted response are increased.

Introduction 11/35
Criteria for evaluating subset regression models
• Two key aspects of the variable selection:
? Generating the subset models;
? Deciding if one subset is better than another.
Computational methods for variable selection
• (Coefficient of Multiple Determination R 2 ) A measure of the adequacy of a
regression model. Let Rp2 denote the coefficient of multiple determination for a
subset regression model with p terms, that is, p − 1 regressors and an intercept
term β 0
SSR (p ) SS (p )
Rp2 = = 1 − Res (4)
SST SST
 
K
? There are values of Rp2 for each value of p, and Rp2 increases as p increases
p−1
and is a maximum when p = K + 1;
? The analyst uses this criterion by adding regressors to the model up to the point
where an additional variable is not useful in that it provides only a small increase in Rp2

Introduction 12/35
• Since we cannot find an “optimum” value of R 2 for a subset regression model,
we must look for a “satisfactory” value.
?
R02 = 1 − (1 − RK2 +1 )(1 + dα,n,K ), (5)
KF
α,K ,n−K −1
where dα,n,K = n −K −1 and RK2 +1 is the value of R 2 for the full model.
? Any subset of regressor variables producing an R 2 greater than R02 is called an
R 2 -adequate (α) subset.
• (Adjusted R 2 ) The adjusted R 2 statistic, defined for a p-term equation as
n−1
 
2
RAdj,p = 1− (1 − Rp2 ) (6)
n−p

2
? RAdj,p statistic does not necessarily increase as additional regressors are introduced
into the model.
2 2
? In fact, if s regressors are added to the model, RAdj,p +s
will exceed RAdj,p if and only if
the partial F statistic for testing the significance of the s additional regressors exceeds
1;
? Consequently, one criterion for selection of an optimum subset model is to choose the
2
model that has a maximum RAdj,p .

Introduction 13/35
• (Residual mean square) The residual mean square for a subset regression model
SSRes (p )
MSRes (p ) = (7)
n−p

? The general behavior of MSRes (p ) as p increases is illustrated in the following figure.


MSRes (p ) initially decreases, then stabilizes, and eventually may increase.

(Remark: the eventual increase in MSRes (p ) occurs when the reduction in SSRes (p ) from adding a regressor to the
model is not sufficient to compensate for the loss of one degress of freedom in the denominator of (7).)
2
? The subset regression model that minimizes MSRes (p ) will also maximize RAdj,p .

2 n−1 n − 1 SSRes (p ) MSRes (p )


RAdj,p = 1− (1 − Rp2 ) = 1 − = 1−
n−p n − p SST SST /(n − 1)

Thus, the criteria minimum MSRes (p ) and maximum adjusted R 2 are equivalent.

Introduction 14/35
• (Mallow’s Cp Statistic)
SSRes (M)
Cp (M) = − n + 2 · p (M) (8)
σ2
b

σ2 = SSRes (F )/dfF is the “best” estimate of σ2 , we have (use the fullest


? b
model).
? SSRes (M) = kY − YbM k2 is the SSRes of the model M.
? p (M) is the number of predictors in M, or the degrees of freedom used up
by the model.
? Based on an estimate of
n n
1 1
σ2 ∑ E((Yi − E(Yi ))2 ) = σ2 ∑ E((Yi − Ybi )2 ) + Var(Ybi ).
i =1 i =1

Introduction 15/35
• (AIC & BIC)
? Mallow’s Cp is (almost) a special case of Akaike Information Criterion(AIC)
AIC (M) = −2logL(M) + 2 · p (M).
? L(M) is the likelihood function of the parameters in model M evaluated
at the MLE (Maximum Likelihood Estimators).
? Schwarz’s Bayesian Information Criterion (BIC)
BIC (M) = −2logL(M) + p (M) · log n.

Introduction 16/35
Search strategies

• “Best subset”: search all possible models and take the one with highest Ra2 or
lowest Cp .
• Stepwise (forward, backward or both): useful when the number of predictors is
large. Choose an initial model and be “greedy”.
• “Greedy” means always take the biggest jump (up or down) in your selected
criterion.

Introduction 17/35
Implementations in R

• “Best subset”: use the function leaps. Works only for multiple linear regression
models.
• Stepwise: use the function step. Works for any model with Akaike Information
Criterion (AIC). In multiple linear regression, AIC is (almost) a linear function of
Cp .

Introduction 18/35
Introduction

Computational Techniques for Variables Selection


All possible regression
Stepwise Regression Methods

Strategy for variable selection and model building

Computational Techniques for Variables Selection 19/35


Selection for variables

• To find the subset of variables to use in the final equation, it is natural to


consider fitting models with various combinations of the candidate regressors.
• Computational techniques for generating subset regression models
? All possible regression
? Stepwise regression methods
I Forward selection;
I Backward elimination;
I Stepwise regression.

Computational Techniques for Variables Selection 20/35


All possible regression
• This procedure requires that the analyst fit all the regression equations involving
one candidate regressor, two candidate regressors, and so on.
• If there are K candidate regressors, there are 2K total equations to be estimated
and examined. Clearly the number of equations to be examined increases rapidly
as the number of candidate regressors increases.
• Example 1: The Hald Cement Data

Computational Techniques for Variables Selection 21/35


• Clearly the least squares in estimate of an individual regression coefficient
depends heavily on the other regressors in the model.
• The large changes in the regression coefficients observed in the Hald cement
data are consistent with a serous problem with multicollinearity.

Computational Techniques for Variables Selection 22/35


Rp2 criterion
• From examining this display it is clear that
after two regressors are in the model, there
is little to be gained in terms of R 2 by
introducing additional variables;
• Both of the two regressor models (x1 , x2 )
and (x1 , x4 ) have essentially the same R 2
values, and in terms of this criterion, it
would make little difference which model is
selected as the final regression equation;
• If we take α = 0.05, 
4F0.04,4,8

R02 = 1 − (1 − R52 ) 1 +
8
 
4(3.84)
= 1 − (1 − 0.98238) 1 +
8
= 0.94855
Therefore, any subset regression model for
which Rp2 > R02 = 0.94855 is adequate
(0.05), that is, its R 2 is not significantly
different from RK2 +1 .

Computational Techniques for Variables Selection 23/35


Simple correlations

It is instructive to examine the pairwise correlation between xi and xj and between xi


and y .
little use if multicollinearity
• The pairs of regressor (x1 , x3 ) and (x2 , x4 ) are highly correlated;
• Consequently, adding further regressors when x1 and x2 or when x1 and x4 are
already in the model will be of little use since the information content in the
excluded regressors is essentially present in the regressors that are in the model.
• This correlative structure is partially responsible for the large changes in the
regression coefficients.

Computational Techniques for Variables Selection 24/35


MSRes (p ) vs. p
• The minimum residual mean square model
is (x1 , x2 , x4 ), with MSRes (4) = 5.3303
2
(RAdj = 0.97645);
• As expected, the model that minimizes
MSRes (p ) also maximizes the adjusted R 2 ;

• However, two of the other three-regressor


models [(x1 , x2 , x3 ) and (x1 , x3 , x4 )] and
the two-regressor models [(x1 , x2 ) and
(x1 , x4 )] have comparable values of the
residual mean square.
1. If either (x1 , x2 ) or (x1 , x4 ) is in the
model, there is little reduction in
residual mean square by adding
further regressors.
2. The subset model (x1 , x2 ) may be
more appropriate than (x1 , x4 )
because it has a smaller value of the
residual mean square.

Computational Techniques for Variables Selection 25/35


Cp (p ) vs. p
• Suppose we take σ̂2 = 5.9829 (MSRes from
the full model), then
SSRes (3)
C3 = − n + 2p
σ̂2
74.7621
= − 13 + 2(3) = 5.50
5.9829
• From examination of this plot we find that
there are four models that could be
acceptable: (x1 , x2 ), (x1 , x2 , x3 ),
(x1 , x2 , x4 ), and (x1 , x3 , x4 );
• Without considering additional factors
such as technical information about the
regressors or the costs of data collection, it
may be appropriate to choose the simplest
model (x1 , x2 ) as the final model because
it has the smallest Cp .

Computational Techniques for Variables Selection 26/35


• This example has illustrated the computational procedure associated with model
building with all possible regressions.
• Note that there is no clear-cut choice of the best regression equation.
• Very often we find that different criteria suggest different equations. For
example, the minimum Cp equation is (x1 , x2 ) and the minimum MSRes equation
is (x1 , x2 , x4 ).
• All “final” candidate models should be subjected to the usual tests for adequacy,
including investigation of leverage points, influence, and multicollinearity.

? This table examines the two models


(x1 , x2 ) and (x1 , x2 , x4 ) with respect to
PRESS and their variance inflation factors
(VIFs).
? Both models have very similar values of
PRESS (roughly twice the residual sum of
squares for the minimum MSRes equation),
and the R 2 for prediction computed from
PRESS is similar for both models.
? However, x2 and x4 are highly
multicollinear, as evidenced by the larger
variance inflation factors in (x1 , x2 , x4 ).

? Since both models have equivalent PRESS


statistics, we would recommend the model
with (x1 , x2 ) based on the lack of
multicollinearity in this model.

Computational Techniques for Variables Selection 27/35


Selection of variables: stepwise-type procedures

• Cases: when there are a large number of potential explanatory variables ( #=q);
not involve computing of all possible equations (2q ).
• Common feature: the variables are introduced or deleted from the equation one
at a time; exam only a subset of all possible equations (evaluate at most q + 1
equations)
• Procedure Categories:
? Forward selection procedure
I goes through the full set of variables and provides with q possible equations.
? Backward elimination procedure
I involves fitting at most q regression equations
? Stepwise regression (a modification of the FS procedure)
I a number of possible combinations of the afore two procedures.

Computational Techniques for Variables Selection 28/35


Forward selection procedure

• Starts with an equation containing only a constant term but no regressors.


• Step 1: The first variable x1 included in the equation: the one which has the
highest simple correlation (with y ).
keep ? Retain it when β 1 is significantly different from zero;
? Search for a second variable.

• Step 2: The second variable is the one which has the highest correlation with y ,
after y has been adjusted for the effect of the first variable.
? i.e. the variable has the highest simple correlation coefficient with the residuals from
step 1;
? Retain x2 when β 2 is significantly different from zero;
? Search for a second variable.
• ···
• Terminate the procedure: insignificant β q .
? Judged by the standard t-statistic computed from the latest equation;
? Mostly by a low t cutoff value for testing the coefficient of the newly entered variable.

Computational Techniques for Variables Selection 29/35


Backward elimination procedure
• Starts with the full equation and successively drops one variable at a time;
The variables are dropped on the basis of their contribution to the reduction of
SSres .
• Step 1: The first variable x1 deleted from the equation: the one which
contributes the smallest to the reduction of SSres .
? i.e. the variable has the smallest t ratio (the ratio of the regression coefficient to the
standard error of the coefficient).
? If all the t ratios are significant, all q regressors will be retained in the equation.
? If there are one or more variables with insignificant t ratios, drop the variables with
the smallest insignificant t ratios
• Step 2: The equation with the remaining q − 1 variables is fitted and the t ratios
for the new regression coefficients are examined.
• ···
• Terminate the procedure: when all the t ratios are significant or all but one
variable has been deleted.
? Judged by the standard t-statistic computed from the latest equation.
? Mostly by a high t cutoff value so that the procedure runs through the whole set of
variables.

Computational Techniques for Variables Selection 30/35


Stepwise method

• Essentially a forward selection procedure + proviso that at each stage the


possibility of deleting a variable is considered.
? A variable that entered in the earlier stages of selection may be eliminated
at later stages.
? The calculation made for inclusion and deletion of variables are the same
as the afore two methods.
? Requires two cutoff values, one for entering and one for removing variables.
I Frequently we choose t
IN
> tOUT , making it relatively more difficult to
add a regressor than to delete one.
? Often different levels of significance are assumed for inclusion and exclusion
of variables from the equation.
? Caution: the order in which the variables enter or leave the equation
should not be interpreted as reflecting the relative importance of the
variables. (intercorrelation affects!)

Computational Techniques for Variables Selection 31/35


Comparison and comments
forward selection

• FS and BE work in the opposite direction.


• A stopping rule: In FS, stop if minimum t ratio is less than 1; In BE, stop if
minimum t ratio is greater than 1.
• Partial F -statistics is alternative of t-statistics since tα2/2,ν = Fα,1,ν .
• BE is particularly favored by analysts who like to see the effect of including all
the candidate regressors, so that nothing obvious will be missed.
• All three work out nearly the same selection of variables with noncollinear data.
But they do not necessarily lead to the same choice of the final model.
• FE: once a regressor has been added, it can not be removed at a later step.
• BE is better able to handle multicollinearity than FS because it is often less
adversely affected by the correlative structure of the regressors than is FS.
• NONE generally guarantees that the best subset regression model of any size will
be identified.

Computational Techniques for Variables Selection 32/35


Introduction

Computational Techniques for Variables Selection

Strategy for variable selection and model building

Strategy for variable selection and model building 33/35


Strategy for variable selection and model building
Fit the full model

Perform
residual analysis

Do we need Yes Transform


a transformation? data

Perform all
possible regressions

No
Select models
for further analysis

Make
recommendations
Strategy for variable selection and model building 34/35
Variable selection for high-dim data

When high-dimensional or ultrahigh dimensional data is appeared in regression, i.e.


p >> n, ??
• LASSO
• SCAD
• Elastic-net
• MCP
• ···
• SIS

Strategy for variable selection and model building 35/35

You might also like