Lecture 2 - Model Selection and Regularisation
Statistical Learning (CFAS420)
Alex Gibberd
Lancaster University
18th Feb 2020
Outline
Learning Outcomes:
I Understand methods for performing model selection in linear
regression
I Understand the difference between convex and non-convex
optimisation and estimation problems
I Know when different variable selection or regularisation methods
may be appropriate
I Know how to form, and/or stabilise estimates when the number of
covariates is large
2
Adding Covariates
I In linear model f (X; β ) we can include (add), or exclude (remove)
covariates
I This lecture is all about how to estimate and decide which X1 , . . . , Xp
to include in a model
I As we know, adding a parameter βi increases the model complexity
– This can increase variance
– But decrease Bias
– Trade-off, we can compare models via AIC/BIC
I Lets look at some simple ways to add/subtract covariates to a model
Forwards/Backwards Selection 4
Forward Selection
I Forward selection is where we start off with an “empty” model
I May start with just a constant term Y = α + ε
I Now, we sequentially add new variables X1 , . . .
I At each stage, we test for the significance of the new variable
Y = α +ε vs Y = α + βi Xi + ε??
I There are p choices of variable to add next
I Add the one which is most significant
Forwards/Backwards Selection 5
Forward Selection
Forwards/Backwards Selection 6
Backwards Selection
I Similar procedure can be followed in reverse
– Start with all p covariates
– Remove one at a time until we don’t see any improvement
I Again, we can use the F-test, or AIC/BIC to assess improvement
I Pros: more likely to include covariates of interest
I Cons: unstable estimates for β̂ if p is large, can’t classically estimate
if p ≥ n
Forwards/Backwards Selection 7
Subset Selection
I This is similar in notion to backwards/forwards selection
I Can decide whether to add/subtract a covariate at each step.
I Example:
– We may add X1 at one step, but then later decide to remove it once
the model has expanded
I Pros: More parsimonious (simple) model, if some variables are not
required
I Cons: Computationally expensive. Infeasible for more than p ≈ 10.
Non-convex problem!! (we will see what this means later)
Forwards/Backwards Selection 8
What if we have large p?
I In Lecture 1, we saw that the variance increased as p increases
I What if we are in an extreme situation and p ≥ n
I Recall (or take my word for it) that minimising the OLS estimate
gives1
β̂ = arg min ky − Xβ k22
β
= (X X)−1 X > y ,
>
| {z }
?? if p≥n
I What happens to X > X when p ≥ n?
– Ans: No unique inverse exists (matrix is referred to as rank-deficient)
1 To find this, just differentiate the loss function and set equal to zero
High-Dimensional Regression 10
What if we have large p?
I To understand consider the following diagram, we have 2 covariates,
but they are completely correlated. Attempting to regress Y onto X
looks like:
High-Dimensional Regression 11
Regularisation (Bayesian)
I One way to solve this identification challenge is to invoke so-called
prior knowledge on the parameters β̂
I Usually (in statistics) cast as a Bayesian approach, this involves
assuming that β̂ follows a distribution, even before we see any data.
β̂ ∼ πprior (γ)
– Parameters on priors γ are known as hyper-parameters
I In the simplest case, we may assume that
β̂i ∼ N (0, 1/γ)
independently for each i = 1, . . . , p
I Can think of prior as being based on an estimate over some
imaginary data.
High-Dimensional Regression 12
Regularisation (Maximum a-posteriori)
I This extra imaginary data helps us solve the problem with (X > X)−1
I The posterior P(β̂ |X; γ) is the distribution of β̂ after updating for
data X
– We can find this using Bayes rule. But don’t worry about this here..
I Now consider that we select our (non random) estimate of β̂
according to
β̂γ = arg max P(β̂ |X; γ)
β̂
– This is known as the Maximum-a-posterior (MAP) estimator
– It is no longer a random quantity, instead this is known as a
point-estimator
High-Dimensional Regression 13
Ridge Regression
I It turns out2 , by assuming β̂ ∼ N (0, 1/γ) that
β̂ := arg max P(β̂ |X; γ)
β̂
1
= arg min ky − Xβ k22 + λγ kβ k2
β n
– Where there is some mapping γ 7→ λγ > 0
– Smaller γ =⇒ Bigger λγ
q
p
– Recall: λ kβ k2 = λ ∑i=1 |βi |2
2 See Section 3.4.1 [2] for details
High-Dimensional Regression 14
Ridge Regression
I How does this help us with (X > X)−1 ∈ Rp×p ?
I Ans: Taking derivatives again w.r.t β gives
β̂ = (X > X + λ I)−1 X > y
I Adding quantity to diagonal of X > X stabilises the inversion even
when p > n
I Method known as Ridge-Regression
High-Dimensional Regression 15
A Geometric View
I Essentially the penalty λ kβ k2 acts to shrink the estimate β̂ towards
zero
I We can interpret this geometrically as in the figure below
I The ball at the origin acts to constrain the un-penalised OLS
estimator (the centre of the ellipse) to where they intersect
High-Dimensional Regression 16
Convex Sets
I A set C is convex if the line segment between any two points in C
lies in C
I Consider two sets for (x1 , x2 ) ∈ R2
–
A = {x1 , x2 such that kxk2 ≤ 1}
–
B = {x1 = 0, |x2 | > 0} ∪ {|x1 | > 0, x2 = 0}
I One of these is convex, the other is not??
Convex vs Non-Convex Optimisation 18
Convex Sets
Convex vs Non-Convex Optimisation 19
Convex Sets
I These sets have important relationships to ridge-regression and
subset selection
– The ridge-regression constraint is over a set of form simillar to A
– The AIC penalty where k = #non − zero(β ) is a constraint of the form
B
I One problem is convex, the other is not...
Convex vs Non-Convex Optimisation 20
Consequences of Non-Convexity
I A function is convex if for any pair of points (x1 , f (x1 )) and
(x2 , f (x2 )) the line segment connecting the points is above f (x).
I If a function is non-convex is can have multiple local minima
I Trying to optimise wrt a non-convex function can lead us to get
trapped. We may not escape to the global optimum.
I Consequences: If an estimation problem is non-convex, it is sensitive
to the initial parameterisation (starting point).
I Need to have a start-value strategy for reproducibility!
Convex vs Non-Convex Optimisation 21
More info..
I For more details on convex optimisation the book by Boyd et al [1]
is highly recommended!
I Much (most) work in statistical/machine learning is in the formation
and optimisation of interesting cost functions
I Whether to use a convex/non-convex method will depend on the
application
– How much computing power do we have?
– How important is computational stability, i.e. finding global minima
– In practice, local minima can be very close together
Convex vs Non-Convex Optimisation 22
Least Absolute Shrinkage and Selection Operator
(LASSO)
I It turns out, we can actually perform selection while maintaining a
convex optimisation problem.
I The idea originally put forward in [3], and called the lasso, is to
utilise a different type of prior knowledge, and instead penalise the
OLS estimator with an `1 norm
1
β̂ = arg min ky − Xβ k22 + λ kβ k1 .
β n
– Recall: kβ k1 = |β1 | + . . . + |βp | i.e. the sum of the absolute coefficient
values.
I This optimisation problem is still convex, i.e. has single global
solution. However, it also selects a subset of the parameters, setting
multiple β̂i = 0
Selection via Regularisation 24
Geometric Interpretation
I Like ridge-regression, but now the constraint set is not smooth
I The sharp edges of the constraint enable selection, in this case, we
have regions (the grey ares) in R2 where β̂1 = 0 or β̂2 = 0.
Selection via Regularisation 25
Lasso in Practice
I In practice, for instance model of Lab 1. We can extend the model
by adding non-informative covariates X11,... Xp=100 where the true
coefficients are zero, i.e. β11 , . . . , β100 = 0
I Performing lasso regression via caret gives us
– Remarkably, even in the case where n = 50 < p we still recover close
to the true parameters.
– However, some shrinkage (bias) towards zero in active parameters
Selection via Regularisation 26
The Elastic Net
I The selection properties of the lasso only hold so far.
I If s ≥ n where s is number of non-zero parameters the estimator can
become unstable (the loss function becomes very flat)
I To avoid this, it has been suggested [4] to combine the lasso, and
ridge regression penalties.
1
β̂ := arg min ky − X > β k22 + λ (1 − α)kβ k1 + λ (α)kβ k2 .
β n
– The `2 ridge-regression penalty adds curvature to ensure problem is
always convex
– Additional parameter α ∈ (0, 1] selects combination of lasso and ridge
Selection via Regularisation 27
Selecting Tuning Parameters
I In order to select a good value of λ and α we generally use
cross-validation
I This is a powerful tool to help us evaluate how well the models
strike the bias-variance trade-off
I An example from the Lab is found below, in this case pure lasso
performs best
Selection via Regularisation 28
Summary
I Introduced several methods to select covariates to include via
deciding whether β̂i = 0 or otherwise
I Discussed the difference and importance of convex and non-convex
cost functions
I Introduced regularisation (ridge, lasso, elastic net) as motivated by
placing priors on parameters when p is large
I Demonstrated that lasso, ridge and elastic-net form convex
optimisation problems =⇒ global minima
I If parameterisation is sparse (lots of true β = 0) then lasso can
recover true structure
Selection via Regularisation 29
In The Lab
1. Implement forward/backward/subset selection in R and caret
2. Use elastic net to implement ridge regression, lasso, and their
combination
3. Use cross-validation to select tuning parameters via caret
4. Demonstrate the high-dimensional estimation properties of
regularised estimators
Selection via Regularisation 30
References I
S. Boyd and L. Vandenberghe.
Convex Optimization.
2004.
T. Hastie, R. Tibshirani, and J. Friedman.
The Elements of Statistical Learning.
Springer, 2009.
R. Tibshirani.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B, 1996.
H. Zou and T. Hastie.
Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B, 67(2):301–320,
2005.
Appendix 31