0% found this document useful (0 votes)

7 views27 pages

Slides 2

statistical learning_2

Uploaded by

Pasxalis Itsios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views27 pages

Slides 2

statistical learning_2

Uploaded by

Pasxalis Itsios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Lecture 2 - Model Selection and Regularisation

Statistical Learning (CFAS420)

Alex Gibberd

Lancaster University

18th Feb 2020

Outline

Learning Outcomes:
I Understand methods for performing model selection in linear
regression
I Understand the difference between convex and non-convex
optimisation and estimation problems
I Know when different variable selection or regularisation methods
may be appropriate
I Know how to form, and/or stabilise estimates when the number of
covariates is large

2
Adding Covariates

I In linear model f (X; β ) we can include (add), or exclude (remove)

covariates
I This lecture is all about how to estimate and decide which X1 , . . . , Xp
to include in a model
I As we know, adding a parameter βi increases the model complexity
– This can increase variance
– But decrease Bias
– Trade-off, we can compare models via AIC/BIC
I Lets look at some simple ways to add/subtract covariates to a model

Forwards/Backwards Selection 4
Forward Selection

I Forward selection is where we start off with an “empty” model

I May start with just a constant term Y = α + ε
I Now, we sequentially add new variables X1 , . . .
I At each stage, we test for the significance of the new variable

Y = α +ε vs Y = α + βi Xi + ε??

I There are p choices of variable to add next

I Add the one which is most significant

Forwards/Backwards Selection 5
Forward Selection

Forwards/Backwards Selection 6
Backwards Selection

I Similar procedure can be followed in reverse

– Start with all p covariates
– Remove one at a time until we don’t see any improvement
I Again, we can use the F-test, or AIC/BIC to assess improvement
I Pros: more likely to include covariates of interest
I Cons: unstable estimates for β̂ if p is large, can’t classically estimate
if p ≥ n

Forwards/Backwards Selection 7
Subset Selection

I This is similar in notion to backwards/forwards selection

I Can decide whether to add/subtract a covariate at each step.
I Example:
– We may add X1 at one step, but then later decide to remove it once
the model has expanded
I Pros: More parsimonious (simple) model, if some variables are not
required
I Cons: Computationally expensive. Infeasible for more than p ≈ 10.
Non-convex problem!! (we will see what this means later)

Forwards/Backwards Selection 8
What if we have large p?

I In Lecture 1, we saw that the variance increased as p increases

I What if we are in an extreme situation and p ≥ n
I Recall (or take my word for it) that minimising the OLS estimate
gives1

β̂ = arg min ky − Xβ k22

= (X X)−1 X > y ,
>
| {z }
?? if p≥n

I What happens to X > X when p ≥ n?

– Ans: No unique inverse exists (matrix is referred to as rank-deficient)

1 To find this, just differentiate the loss function and set equal to zero

High-Dimensional Regression 10
What if we have large p?

I To understand consider the following diagram, we have 2 covariates,

but they are completely correlated. Attempting to regress Y onto X
looks like:

High-Dimensional Regression 11
Regularisation (Bayesian)

I One way to solve this identification challenge is to invoke so-called

prior knowledge on the parameters β̂
I Usually (in statistics) cast as a Bayesian approach, this involves
assuming that β̂ follows a distribution, even before we see any data.

β̂ ∼ πprior (γ)

– Parameters on priors γ are known as hyper-parameters

I In the simplest case, we may assume that

β̂i ∼ N (0, 1/γ)

independently for each i = 1, . . . , p

I Can think of prior as being based on an estimate over some
imaginary data.

High-Dimensional Regression 12
Regularisation (Maximum a-posteriori)

I This extra imaginary data helps us solve the problem with (X > X)−1
I The posterior P(β̂ |X; γ) is the distribution of β̂ after updating for
data X
– We can find this using Bayes rule. But don’t worry about this here..
I Now consider that we select our (non random) estimate of β̂
according to
β̂γ = arg max P(β̂ |X; γ)
β̂

– This is known as the Maximum-a-posterior (MAP) estimator

– It is no longer a random quantity, instead this is known as a
point-estimator

High-Dimensional Regression 13
Ridge Regression

I It turns out2 , by assuming β̂ ∼ N (0, 1/γ) that

β̂ := arg max P(β̂ |X; γ)

β̂
1
= arg min ky − Xβ k22 + λγ kβ k2
β n

– Where there is some mapping γ 7→ λγ > 0

– Smaller γ =⇒ Bigger λγ
q
p
– Recall: λ kβ k2 = λ ∑i=1 |βi |2

2 See Section 3.4.1 [2] for details

High-Dimensional Regression 14
Ridge Regression

I How does this help us with (X > X)−1 ∈ Rp×p ?

I Ans: Taking derivatives again w.r.t β gives

β̂ = (X > X + λ I)−1 X > y

I Adding quantity to diagonal of X > X stabilises the inversion even

when p > n
I Method known as Ridge-Regression

High-Dimensional Regression 15
A Geometric View

I Essentially the penalty λ kβ k2 acts to shrink the estimate β̂ towards

zero
I We can interpret this geometrically as in the figure below

I The ball at the origin acts to constrain the un-penalised OLS

estimator (the centre of the ellipse) to where they intersect

High-Dimensional Regression 16
Convex Sets

I A set C is convex if the line segment between any two points in C

lies in C
I Consider two sets for (x1 , x2 ) ∈ R2
–
A = {x1 , x2 such that kxk2 ≤ 1}
–
B = {x1 = 0, |x2 | > 0} ∪ {|x1 | > 0, x2 = 0}

I One of these is convex, the other is not??

Convex vs Non-Convex Optimisation 18

Convex Sets

Convex vs Non-Convex Optimisation 19

Convex Sets

I These sets have important relationships to ridge-regression and

subset selection
– The ridge-regression constraint is over a set of form simillar to A
– The AIC penalty where k = #non − zero(β ) is a constraint of the form
B
I One problem is convex, the other is not...

Convex vs Non-Convex Optimisation 20

Consequences of Non-Convexity

I A function is convex if for any pair of points (x1 , f (x1 )) and

(x2 , f (x2 )) the line segment connecting the points is above f (x).
I If a function is non-convex is can have multiple local minima
I Trying to optimise wrt a non-convex function can lead us to get
trapped. We may not escape to the global optimum.
I Consequences: If an estimation problem is non-convex, it is sensitive
to the initial parameterisation (starting point).
I Need to have a start-value strategy for reproducibility!

Convex vs Non-Convex Optimisation 21

More info..

I For more details on convex optimisation the book by Boyd et al [1]

is highly recommended!
I Much (most) work in statistical/machine learning is in the formation
and optimisation of interesting cost functions
I Whether to use a convex/non-convex method will depend on the
application
– How much computing power do we have?
– How important is computational stability, i.e. finding global minima
– In practice, local minima can be very close together

Convex vs Non-Convex Optimisation 22

Least Absolute Shrinkage and Selection Operator
(LASSO)

I It turns out, we can actually perform selection while maintaining a

convex optimisation problem.
I The idea originally put forward in [3], and called the lasso, is to
utilise a different type of prior knowledge, and instead penalise the
OLS estimator with an `1 norm
1
β̂ = arg min ky − Xβ k22 + λ kβ k1 .
β n

– Recall: kβ k1 = |β1 | + . . . + |βp | i.e. the sum of the absolute coefficient

values.
I This optimisation problem is still convex, i.e. has single global
solution. However, it also selects a subset of the parameters, setting
multiple β̂i = 0

Selection via Regularisation 24

Geometric Interpretation

I Like ridge-regression, but now the constraint set is not smooth

I The sharp edges of the constraint enable selection, in this case, we
have regions (the grey ares) in R2 where β̂1 = 0 or β̂2 = 0.

Selection via Regularisation 25

Lasso in Practice

I In practice, for instance model of Lab 1. We can extend the model

by adding non-informative covariates X11,... Xp=100 where the true
coefficients are zero, i.e. β11 , . . . , β100 = 0
I Performing lasso regression via caret gives us

– Remarkably, even in the case where n = 50 < p we still recover close

to the true parameters.
– However, some shrinkage (bias) towards zero in active parameters
Selection via Regularisation 26
The Elastic Net

I The selection properties of the lasso only hold so far.

I If s ≥ n where s is number of non-zero parameters the estimator can
become unstable (the loss function becomes very flat)
I To avoid this, it has been suggested [4] to combine the lasso, and
ridge regression penalties.

1
β̂ := arg min ky − X > β k22 + λ (1 − α)kβ k1 + λ (α)kβ k2 .
β n

– The `2 ridge-regression penalty adds curvature to ensure problem is

always convex
– Additional parameter α ∈ (0, 1] selects combination of lasso and ridge

Selection via Regularisation 27

Selecting Tuning Parameters

I In order to select a good value of λ and α we generally use

cross-validation
I This is a powerful tool to help us evaluate how well the models
strike the bias-variance trade-off
I An example from the Lab is found below, in this case pure lasso
performs best

Selection via Regularisation 28

Summary

I Introduced several methods to select covariates to include via

deciding whether β̂i = 0 or otherwise
I Discussed the difference and importance of convex and non-convex
cost functions
I Introduced regularisation (ridge, lasso, elastic net) as motivated by
placing priors on parameters when p is large
I Demonstrated that lasso, ridge and elastic-net form convex
optimisation problems =⇒ global minima
I If parameterisation is sparse (lots of true β = 0) then lasso can
recover true structure

Selection via Regularisation 29

In The Lab

1. Implement forward/backward/subset selection in R and caret

2. Use elastic net to implement ridge regression, lasso, and their
combination
3. Use cross-validation to select tuning parameters via caret
4. Demonstrate the high-dimensional estimation properties of
regularised estimators

Selection via Regularisation 30

References I

S. Boyd and L. Vandenberghe.

Convex Optimization.
2004.
T. Hastie, R. Tibshirani, and J. Friedman.
The Elements of Statistical Learning.
Springer, 2009.
R. Tibshirani.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B, 1996.
H. Zou and T. Hastie.
Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B, 67(2):301–320,
2005.

Appendix 31

Introduction To The Practice of Statistics: Section 6.1 Homework Answers
No ratings yet
Introduction To The Practice of Statistics: Section 6.1 Homework Answers
3 pages
Lesson 3: Testing The Difference Between Two Population Means: Paired Sample
No ratings yet
Lesson 3: Testing The Difference Between Two Population Means: Paired Sample
2 pages
(Springer Series in Statistics) Jun Shao, Dongsheng Tu (Auth.) - The Jackknife and Bootstrap-Springer-Verlag New York (1995)
100% (1)
(Springer Series in Statistics) Jun Shao, Dongsheng Tu (Auth.) - The Jackknife and Bootstrap-Springer-Verlag New York (1995)
532 pages
EViews 9 Users Guide I (001-607) PDF
No ratings yet
EViews 9 Users Guide I (001-607) PDF
607 pages
Descriptive Statistics - Excel Exercises
No ratings yet
Descriptive Statistics - Excel Exercises
3 pages
CHAPTER 10 Extra
No ratings yet
CHAPTER 10 Extra
65 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
Latihan SEM PLS Output 1
No ratings yet
Latihan SEM PLS Output 1
138 pages
Basic Concepts in Sampling and Sampling Techniques
No ratings yet
Basic Concepts in Sampling and Sampling Techniques
22 pages
QDT Final
No ratings yet
QDT Final
107 pages
Untitled
No ratings yet
Untitled
26 pages
Fall Final Review MC 2015 - Ch. 1 - 3 - 4
No ratings yet
Fall Final Review MC 2015 - Ch. 1 - 3 - 4
58 pages
Memonetal JASEM Editorial V4 Iss2 June2020
No ratings yet
Memonetal JASEM Editorial V4 Iss2 June2020
21 pages
ISB ITPM Assignment 7.2 Manasa VLS
No ratings yet
ISB ITPM Assignment 7.2 Manasa VLS
3 pages
DEJI
No ratings yet
DEJI
31 pages
Cheat Sheet Tutorial
No ratings yet
Cheat Sheet Tutorial
2 pages
Probability and Distribution
No ratings yet
Probability and Distribution
18 pages
Manuscript-Human Resource - Kholid and Arifin
No ratings yet
Manuscript-Human Resource - Kholid and Arifin
16 pages
Instant Ebooks Textbook Practical Statistics For Nursing Using SPSS 1st Edition, (Ebook PDF) Download All Chapters
100% (3)
Instant Ebooks Textbook Practical Statistics For Nursing Using SPSS 1st Edition, (Ebook PDF) Download All Chapters
27 pages
Stat 235 Lab Assignment 2
No ratings yet
Stat 235 Lab Assignment 2
13 pages
PR 2 Quarter 02 Modules Student Copy 1
No ratings yet
PR 2 Quarter 02 Modules Student Copy 1
129 pages
Lecture - 3
No ratings yet
Lecture - 3
24 pages
Measure of Dispersion (Range Quartile & Mean Deviation)
No ratings yet
Measure of Dispersion (Range Quartile & Mean Deviation)
55 pages
Chapter 4 - Measures of Position
No ratings yet
Chapter 4 - Measures of Position
11 pages
Suggested Solution To Assignment 2 (2025, Allow With or Without Continuity Correction)
No ratings yet
Suggested Solution To Assignment 2 (2025, Allow With or Without Continuity Correction)
6 pages
Case Data
0% (1)
Case Data
6 pages
2930-Test1-2025w Solution
No ratings yet
2930-Test1-2025w Solution
4 pages
Lasso and Ridge Regression
No ratings yet
Lasso and Ridge Regression
30 pages
WINSEM2024-25_CSE3008_ETH_AP2024254000248_2025-01-24_Reference-Material-I (1) (1)
No ratings yet
WINSEM2024-25_CSE3008_ETH_AP2024254000248_2025-01-24_Reference-Material-I (1) (1)
27 pages
Handout5 Regularization
No ratings yet
Handout5 Regularization
20 pages
4lasso and Friends
No ratings yet
4lasso and Friends
36 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
Lecture 6 Model Selection and Regularization 11oct2023
No ratings yet
Lecture 6 Model Selection and Regularization 11oct2023
29 pages
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
No ratings yet
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
25 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
GDP Forecasting Using Time Series Analysis
No ratings yet
GDP Forecasting Using Time Series Analysis
15 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
UIUC ECON 490: Applied Machine Learning in Economics
No ratings yet
UIUC ECON 490: Applied Machine Learning in Economics
28 pages
Linear Model Selection and Regularization
No ratings yet
Linear Model Selection and Regularization
23 pages
Regularization and Feature Selection: Big Data For Economic Applications
No ratings yet
Regularization and Feature Selection: Big Data For Economic Applications
39 pages
Mean, Median, Mode, Variance & Standard Deviation: Subject: Statistics Created By: Marija Stanojcic Revised: 10/9/2018
No ratings yet
Mean, Median, Mode, Variance & Standard Deviation: Subject: Statistics Created By: Marija Stanojcic Revised: 10/9/2018
3 pages
Model Selection-Handout PDF
No ratings yet
Model Selection-Handout PDF
57 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
Module 3
No ratings yet
Module 3
35 pages
Feature Selection
No ratings yet
Feature Selection
19 pages
Unit II ML
No ratings yet
Unit II ML
14 pages
Class 9 After
No ratings yet
Class 9 After
38 pages
Report Sparse Group LASSO
No ratings yet
Report Sparse Group LASSO
10 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
Lecture 3
No ratings yet
Lecture 3
33 pages
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
No ratings yet
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
24 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
No ratings yet
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
29 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
Regularization and Feature Selectio N
No ratings yet
Regularization and Feature Selectio N
102 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Notes 14
No ratings yet
Notes 14
18 pages
21csc305p ML Unit 2
No ratings yet
21csc305p ML Unit 2
115 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Did DML
No ratings yet
Did DML
54 pages
A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
No ratings yet
A Convenient Approach For Penalty Parameter Selection in Robust Lasso Regression
12 pages
Lecture+Notes+-+Advanced+Regression
No ratings yet
Lecture+Notes+-+Advanced+Regression
12 pages
Topic One Linear Regression Regularization
No ratings yet
Topic One Linear Regression Regularization
68 pages
Model Selection
No ratings yet
Model Selection
11 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Model Selection Using Multi-Objective Optimization: Pared
No ratings yet
Model Selection Using Multi-Objective Optimization: Pared
4 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Lasoo Regression
No ratings yet
Lasoo Regression
8 pages
Intro To Regression
No ratings yet
Intro To Regression
4 pages
ML - Perplexity
No ratings yet
ML - Perplexity
71 pages
ML PYQs
No ratings yet
ML PYQs
32 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Lect 6
No ratings yet
Lect 6
10 pages
Module 4: Regression Shrinkage Methods
No ratings yet
Module 4: Regression Shrinkage Methods
5 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Penalized Regression
No ratings yet
Penalized Regression
19 pages
Bayesian Inference For Partially Identified Models Exploring The Limits of Limited Data 1st Edition Complete EPUB Ebook
100% (19)
Bayesian Inference For Partially Identified Models Exploring The Limits of Limited Data 1st Edition Complete EPUB Ebook
14 pages