0% found this document useful (0 votes)
4 views5 pages

L1L2_regularization_comparison

This report outlines the differences between L1 and L2 regularization, highlighting L1's ability to perform feature selection while L2 does not. L1 regularization encourages sparsity by setting some weights to zero, which can be beneficial for reducing dimensionality but may also lead to the loss of important features. The conclusion suggests using L1 when sparsity is desired and L2 when feature selection is not critical.

Uploaded by

alexburneralexx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

L1L2_regularization_comparison

This report outlines the differences between L1 and L2 regularization, highlighting L1's ability to perform feature selection while L2 does not. L1 regularization encourages sparsity by setting some weights to zero, which can be beneficial for reducing dimensionality but may also lead to the loss of important features. The conclusion suggests using L1 when sparsity is desired and L2 when feature selection is not critical.

Uploaded by

alexburneralexx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Report on L1 / L2 regularization difference

Introduction
The report is dedicated to the difference between L1 And L2 regularization and the reasons
behind it.
Regularization is a technique used to prevent overfitting and improve the generalization
ability of models. When a model is overfitting, it has adapted to the training data too well and may
not perform well on new, unseen data. Speaking more formally, model’s error can be decomposed
into bias and variance, where bias error is an error from erroneous assumptions, and variance is an
error from sensitivity to small fluctuations (noise) in the training set. Model is considered overfit if
variance becomes too high while bias is low. It is often accompanied by weights having large
absolute values, as such weights tend to cause large changes in output for small changes in the
inputs. Regularization modifies loss function in order to limit weight growth.
L1 regularization adds the sum of the absolute values of the model's weights to the loss
function:

ModifiedLoss(w , X , y )=Loss (w , X , y )+ λ ∑ |wi| (1.1)


i

And L2 regularization adds the sum of the squared values of the model's weights:

ModifiedLoss(w , X , y )=Loss (w , X , y )+ λ ∑ w 2i (1.2)


i

Difference between L1 And L2 regularization


Optimisation
The most popular optimisation method in machine learning is, without a doubt, a gradient
descent. One of conditions for it to work is function’s differentiability. L2 regularizator meets this
condition, whereas L1 is not differentiable at zero, requiring to handle this case separately, for
example by assigning zero as result.

Feature selection
L1, contrary to L2, tends to perform feature selection. It means that due to L1 usage some
weights associated with irrelevant or less important features will be set to zero. To prove why L1
encourages it and L2 not, consider an example of linear regression: there are two features x 1 and x2,
β1 and β2 are corresponding weights, the loss function is qudratic:

Loss(w , X , y)=∑ ( y −β0 −β1 x 1−β2 x 2 )


2
(1.3)
Next, let’s take a look at a plane, formed by β1 and β2. Loss function’s contour lines (sets of points,
where function has the same value) will be ellipses, for L1 regularizator ( λ ∑ |w i| ) they will be
i

rhombuses, for L2 ( λ ∑ w 2
i ) - circles.
i

Figure 1: loss and regularizator (L1 on the left, L2 on the right) contour lines
λ
On this illustration β represents a minimum for loss function without regularization. With
regularization the minimum will actually be closer to (0, 0), and on below pictures we suppose it is
placed on the only painted contour line of regularizator. Let’s take a closer look.

Figure 2:example for L1


Figure 3: example for L2

We can see that for the same example L1 will set β2 to zero and L2 won’t. Obviously, there are
cases when L1 won’t set weight to zero, but it’s inuitively clear for geometrical reasons that for
contour lines of fixed form the number of points such that L1 will set one feature to zero is bigger
than for L2. Figure 4 shows an example for circular contour lines: red axes and grey area represent
λ
dots β where one feature will be set to zero by L2 and L1 correspondingly.

Figure 4: example for circular contour lines


Another question is why will the irrelevant feature be set to zero, and not the important one?
Well, the importance of feature influences how exactly the loss function contour lines will be
stretched. In figures 2 and 3 ellipses are strechted along the β 2 axis, meaning that big changes of β 2
value lead to small changes of loss function, compared to β 1 (thus, β2 is less important than β1).
λ
Again, for geometrical reasons it’s clear that with such ellipse there are more dots β on plane for
which the regularizator contour line will be crossed in its right or left angle, rather than upper or
lower. L1 also sparses heavily correlated features, for example consider 2 (almost) identical
λ
features, e.g. xi = xj. Then the countour lines for loss function will be circles, and β may be
arbitarily chosen from the line β2 + β1 = const, which is likely to result in one of features to be set to
λ
zero. On figure 5 green lines represent possible positions of β , red axes and grey area represent
λ
β positions where one feature will be set to zero by L2 and L1 correspondingly.

Figure 5: example for 2 identical features

A number of experiments justify L1’s tendency to favor sparsity, for example this one [1]:
given two features β1 and β2, quadratic loss function is generated:
2 2
Loss=a( β1−c 1 ) +b ( β2 −c 2) +c ( β1−c 1)( β2−c 2 ) , where every variable is uniformly distributed
(a ~ U(0, 10), b ~ U(0, 10), c ~ U(-2, 2), c 1 ~U(-10, 10), c 2 ~U(-10, 10)). Next, sum of Loss
and regularizator is minimised. After 5000 trials for each regularizator the following results have
been retrieved: one of the coefficients has become zero in 72% of runs for L1, 5% - for L2. Such
loss functions may not represent real data, but provide grounds for abovementioned intuition.
Conclusion: the reasoning has been provided for model with quadratric loss function and
two features. Whereas it works for feature spaces with more dimensions, different loss function may
not have ellipses as contour lines. However, loss functions are usually convex, resulting in convex
contour lines, which means that same logic can still be applied.
It should be noted that the L1 regularization's tendency to favor sparsity can be a
disadvantage. A ‘less important’ feature may still be important or even worse, when 2 features are
equally important, L1 may arbitrarily choose one to set its weight to zero.

Summary
The biggest advantage of L1 regularization is its ability to perform feature selection, but it
can be excessive, which leads to the following conclusion: L1 regularization should be used when
you explicitly need to sparse features, otherwise L2 is favored. The feature sparsity may be a
desired outcome in cases when feature engineering hasn’t been performed beforehand (thus, there is
a chance that some features are irrelevant), or when reducing the feature dimensionality is required,
for example in order to decrease computational costs.

References
[1]: https://github.com/parrt/website-explained.ai/tree/master/regularization/code – code for the
experiment. It was performed by Terence Parr, ex-Professor of computer/data science at the
university of San Francisco.

You might also like