L1L2_regularization_comparison

This report outlines the differences between L1 and L2 regularization, highlighting L1's ability to perform feature selection while L2 does not. L1 regularization encourages sparsity by setting some weights to zero, which can be beneficial for reducing dimensionality but may also lead to the loss of important features. The conclusion suggests using L1 when sparsity is desired and L2 when feature selection is not critical.

Uploaded by

alexburneralexx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views5 pages

L1L2_regularization_comparison

Uploaded by

alexburneralexx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Report on L1 / L2 regularization difference

Introduction
The report is dedicated to the difference between L1 And L2 regularization and the reasons
behind it.
Regularization is a technique used to prevent overfitting and improve the generalization
ability of models. When a model is overfitting, it has adapted to the training data too well and may
not perform well on new, unseen data. Speaking more formally, model’s error can be decomposed
into bias and variance, where bias error is an error from erroneous assumptions, and variance is an
error from sensitivity to small fluctuations (noise) in the training set. Model is considered overfit if
variance becomes too high while bias is low. It is often accompanied by weights having large
absolute values, as such weights tend to cause large changes in output for small changes in the
inputs. Regularization modifies loss function in order to limit weight growth.
L1 regularization adds the sum of the absolute values of the model's weights to the loss
function:

ModifiedLoss(w , X , y )=Loss (w , X , y )+ λ ∑ |wi| (1.1)

And L2 regularization adds the sum of the squared values of the model's weights:

ModifiedLoss(w , X , y )=Loss (w , X , y )+ λ ∑ w 2i (1.2)

Difference between L1 And L2 regularization

Optimisation
The most popular optimisation method in machine learning is, without a doubt, a gradient
descent. One of conditions for it to work is function’s differentiability. L2 regularizator meets this
condition, whereas L1 is not differentiable at zero, requiring to handle this case separately, for
example by assigning zero as result.

Feature selection
L1, contrary to L2, tends to perform feature selection. It means that due to L1 usage some
weights associated with irrelevant or less important features will be set to zero. To prove why L1
encourages it and L2 not, consider an example of linear regression: there are two features x 1 and x2,
β1 and β2 are corresponding weights, the loss function is qudratic:

Loss(w , X , y)=∑ ( y −β0 −β1 x 1−β2 x 2 )

2
(1.3)
Next, let’s take a look at a plane, formed by β1 and β2. Loss function’s contour lines (sets of points,
where function has the same value) will be ellipses, for L1 regularizator ( λ ∑ |w i| ) they will be
i

rhombuses, for L2 ( λ ∑ w 2
i ) - circles.
i

Figure 1: loss and regularizator (L1 on the left, L2 on the right) contour lines
λ
On this illustration β represents a minimum for loss function without regularization. With
regularization the minimum will actually be closer to (0, 0), and on below pictures we suppose it is
placed on the only painted contour line of regularizator. Let’s take a closer look.

Figure 2:example for L1

Figure 3: example for L2

We can see that for the same example L1 will set β2 to zero and L2 won’t. Obviously, there are
cases when L1 won’t set weight to zero, but it’s inuitively clear for geometrical reasons that for
contour lines of fixed form the number of points such that L1 will set one feature to zero is bigger
than for L2. Figure 4 shows an example for circular contour lines: red axes and grey area represent
λ
dots β where one feature will be set to zero by L2 and L1 correspondingly.

Figure 4: example for circular contour lines

Another question is why will the irrelevant feature be set to zero, and not the important one?
Well, the importance of feature influences how exactly the loss function contour lines will be
stretched. In figures 2 and 3 ellipses are strechted along the β 2 axis, meaning that big changes of β 2
value lead to small changes of loss function, compared to β 1 (thus, β2 is less important than β1).
λ
Again, for geometrical reasons it’s clear that with such ellipse there are more dots β on plane for
which the regularizator contour line will be crossed in its right or left angle, rather than upper or
lower. L1 also sparses heavily correlated features, for example consider 2 (almost) identical
λ
features, e.g. xi = xj. Then the countour lines for loss function will be circles, and β may be
arbitarily chosen from the line β2 + β1 = const, which is likely to result in one of features to be set to
λ
zero. On figure 5 green lines represent possible positions of β , red axes and grey area represent
λ
β positions where one feature will be set to zero by L2 and L1 correspondingly.

Figure 5: example for 2 identical features

A number of experiments justify L1’s tendency to favor sparsity, for example this one [1]:
given two features β1 and β2, quadratic loss function is generated:
2 2
Loss=a( β1−c 1 ) +b ( β2 −c 2) +c ( β1−c 1)( β2−c 2 ) , where every variable is uniformly distributed
(a ~ U(0, 10), b ~ U(0, 10), c ~ U(-2, 2), c 1 ~U(-10, 10), c 2 ~U(-10, 10)). Next, sum of Loss
and regularizator is minimised. After 5000 trials for each regularizator the following results have
been retrieved: one of the coefficients has become zero in 72% of runs for L1, 5% - for L2. Such
loss functions may not represent real data, but provide grounds for abovementioned intuition.
Conclusion: the reasoning has been provided for model with quadratric loss function and
two features. Whereas it works for feature spaces with more dimensions, different loss function may
not have ellipses as contour lines. However, loss functions are usually convex, resulting in convex
contour lines, which means that same logic can still be applied.
It should be noted that the L1 regularization's tendency to favor sparsity can be a
disadvantage. A ‘less important’ feature may still be important or even worse, when 2 features are
equally important, L1 may arbitrarily choose one to set its weight to zero.

Summary
The biggest advantage of L1 regularization is its ability to perform feature selection, but it
can be excessive, which leads to the following conclusion: L1 regularization should be used when
you explicitly need to sparse features, otherwise L2 is favored. The feature sparsity may be a
desired outcome in cases when feature engineering hasn’t been performed beforehand (thus, there is
a chance that some features are irrelevant), or when reducing the feature dimensionality is required,
for example in order to decrease computational costs.

References
[1]: https://github.com/parrt/website-explained.ai/tree/master/regularization/code – code for the
experiment. It was performed by Terence Parr, ex-Professor of computer/data science at the
university of San Francisco.

Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
3 pages
L1 Regularization (Lasso) & L2 Regularization (Ridge)
No ratings yet
L1 Regularization (Lasso) & L2 Regularization (Ridge)
4 pages
Regularization Induces Sparse Coefficients
No ratings yet
Regularization Induces Sparse Coefficients
2 pages
pr
No ratings yet
pr
6 pages
Regularization
No ratings yet
Regularization
45 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
DL_Unit-3
No ratings yet
DL_Unit-3
56 pages
ML Lec-8
No ratings yet
ML Lec-8
7 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
Lec9_10 (1)
No ratings yet
Lec9_10 (1)
4 pages
Regularization
No ratings yet
Regularization
2 pages
Unit 4
No ratings yet
Unit 4
62 pages
Regularization
No ratings yet
Regularization
46 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Regularization in Deep Learning (1)
No ratings yet
Regularization in Deep Learning (1)
49 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
Lecture6 Regularization
No ratings yet
Lecture6 Regularization
56 pages
12-Regularization for Deep Learning-17!08!2024
No ratings yet
12-Regularization for Deep Learning-17!08!2024
51 pages
unit4
No ratings yet
unit4
93 pages
4.Bias and Variance
No ratings yet
4.Bias and Variance
19 pages
Kkk
No ratings yet
Kkk
17 pages
07_regularization
No ratings yet
07_regularization
51 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Least Squares Optimization With L1-Norm Regularization
No ratings yet
Least Squares Optimization With L1-Norm Regularization
12 pages
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
No ratings yet
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
18 pages
05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
UNIT LV
No ratings yet
UNIT LV
8 pages
Regularization 1704650055
No ratings yet
Regularization 1704650055
32 pages
04. Chap 7-1 Regularization for Deep Learning-Keonwoo Noh
No ratings yet
04. Chap 7-1 Regularization for Deep Learning-Keonwoo Noh
41 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
Regularization
No ratings yet
Regularization
3 pages
NN&DL Unit-IV Regularization for Deep Learning
No ratings yet
NN&DL Unit-IV Regularization for Deep Learning
16 pages
Lecture 7 Loss Function and Regularization
No ratings yet
Lecture 7 Loss Function and Regularization
38 pages
UNIT IV NNHDL
No ratings yet
UNIT IV NNHDL
15 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Unit-2 L1 (3)
No ratings yet
Unit-2 L1 (3)
23 pages
Regularization
No ratings yet
Regularization
5 pages
Overfitting vs Underfitting
No ratings yet
Overfitting vs Underfitting
16 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
What is Regularization.
No ratings yet
What is Regularization.
10 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Unit -4-NNDL- Notes
No ratings yet
Unit -4-NNDL- Notes
14 pages
Slides Regu Geom l1
No ratings yet
Slides Regu Geom l1
5 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
5 pages
Roxana Rodriguez HW1
No ratings yet
Roxana Rodriguez HW1
3 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
No ratings yet
Types of Regularization in Machine Learning - by Aqeel Anwar - Towards Data Science
11 pages
Regularization
No ratings yet
Regularization
18 pages
Regularization Techniques
No ratings yet
Regularization Techniques
36 pages
02. Performance Tuning
No ratings yet
02. Performance Tuning
24 pages
Ridge Regression
No ratings yet
Ridge Regression
20 pages
Multi Class Learning With Individual Sparsity
No ratings yet
Multi Class Learning With Individual Sparsity
7 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Lecture 4.2. Generalization and Regularization
No ratings yet
Lecture 4.2. Generalization and Regularization
23 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Nptel Lec
No ratings yet
Nptel Lec
22 pages
Learning From Data: 9: Regularization
No ratings yet
Learning From Data: 9: Regularization
37 pages
Introduction to Logarithms and Exponentials
From Everand
Introduction to Logarithms and Exponentials
Simone Malacrida
No ratings yet
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Statistical Quality Control (Technical Proposal)
No ratings yet
Statistical Quality Control (Technical Proposal)
3 pages
20ECE633T Machine Learning in VLSI
No ratings yet
20ECE633T Machine Learning in VLSI
81 pages
R & Data Science Professional Course
No ratings yet
R & Data Science Professional Course
5 pages
Lecture-2: List of Fading Wireless Channels (72 Channels) Section-A: Multipath Fading Channel (28 Channels)
No ratings yet
Lecture-2: List of Fading Wireless Channels (72 Channels) Section-A: Multipath Fading Channel (28 Channels)
7 pages
Lampiran Output SPSS Analisis Deskriptif Tabulasi Silang
No ratings yet
Lampiran Output SPSS Analisis Deskriptif Tabulasi Silang
5 pages
Content & Writing of Chapter 2
No ratings yet
Content & Writing of Chapter 2
21 pages
Multiple Regression
No ratings yet
Multiple Regression
36 pages
Data Analytics With Python - Unit 5 - Week 2
No ratings yet
Data Analytics With Python - Unit 5 - Week 2
3 pages
Assignment 06 Summer 2021
No ratings yet
Assignment 06 Summer 2021
6 pages
Bu I 4 ANNOVA
No ratings yet
Bu I 4 ANNOVA
4 pages
Paired T-Test and CI: Size - Small Sedan, Size - Upscale Sedan
No ratings yet
Paired T-Test and CI: Size - Small Sedan, Size - Upscale Sedan
3 pages
Stats Spss Project
100% (12)
Stats Spss Project
11 pages
Assignment N0 3
No ratings yet
Assignment N0 3
3 pages
Unit V
No ratings yet
Unit V
130 pages
Statistics Sheet 4 PDF
No ratings yet
Statistics Sheet 4 PDF
3 pages
Business Statistics 2021
No ratings yet
Business Statistics 2021
3 pages
Repeatability & Reproducibility - Class
100% (1)
Repeatability & Reproducibility - Class
11 pages
Stochastic Processes, Fall 2010: Time and Place
No ratings yet
Stochastic Processes, Fall 2010: Time and Place
2 pages
Lecture 3 - Predictability
No ratings yet
Lecture 3 - Predictability
46 pages
TransCAD 19 TabulationsAndStatistics
No ratings yet
TransCAD 19 TabulationsAndStatistics
23 pages
Lecture5 Mar22 2024
No ratings yet
Lecture5 Mar22 2024
44 pages
Sample Proportions: Section 9.2
No ratings yet
Sample Proportions: Section 9.2
12 pages
Chapter 11 Response Surface Methods and Other Approaches To Process Optimization
No ratings yet
Chapter 11 Response Surface Methods and Other Approaches To Process Optimization
19 pages
Probability Final Threoms
No ratings yet
Probability Final Threoms
2 pages
MATH 6200 2013T UGRD Data Analysis Midterm Exam - PDF 1
No ratings yet
MATH 6200 2013T UGRD Data Analysis Midterm Exam - PDF 1
10 pages
L02 Possible Values of A Random Variable
No ratings yet
L02 Possible Values of A Random Variable
4 pages
Dependent and Independent Variables - Wikipedia
No ratings yet
Dependent and Independent Variables - Wikipedia
6 pages
Analysis of Marks of 3 Classes of Grade 9 Students For The Same Mathematics Test (Sample Data 35)
No ratings yet
Analysis of Marks of 3 Classes of Grade 9 Students For The Same Mathematics Test (Sample Data 35)
7 pages
Bootstrapping in Amos: Changya Hu, Ph.D. NCCU
No ratings yet
Bootstrapping in Amos: Changya Hu, Ph.D. NCCU
15 pages