L1L2_regularization_comparison
L1L2_regularization_comparison
Introduction
The report is dedicated to the difference between L1 And L2 regularization and the reasons
behind it.
Regularization is a technique used to prevent overfitting and improve the generalization
ability of models. When a model is overfitting, it has adapted to the training data too well and may
not perform well on new, unseen data. Speaking more formally, model’s error can be decomposed
into bias and variance, where bias error is an error from erroneous assumptions, and variance is an
error from sensitivity to small fluctuations (noise) in the training set. Model is considered overfit if
variance becomes too high while bias is low. It is often accompanied by weights having large
absolute values, as such weights tend to cause large changes in output for small changes in the
inputs. Regularization modifies loss function in order to limit weight growth.
L1 regularization adds the sum of the absolute values of the model's weights to the loss
function:
And L2 regularization adds the sum of the squared values of the model's weights:
Feature selection
L1, contrary to L2, tends to perform feature selection. It means that due to L1 usage some
weights associated with irrelevant or less important features will be set to zero. To prove why L1
encourages it and L2 not, consider an example of linear regression: there are two features x 1 and x2,
β1 and β2 are corresponding weights, the loss function is qudratic:
rhombuses, for L2 ( λ ∑ w 2
i ) - circles.
i
Figure 1: loss and regularizator (L1 on the left, L2 on the right) contour lines
λ
On this illustration β represents a minimum for loss function without regularization. With
regularization the minimum will actually be closer to (0, 0), and on below pictures we suppose it is
placed on the only painted contour line of regularizator. Let’s take a closer look.
We can see that for the same example L1 will set β2 to zero and L2 won’t. Obviously, there are
cases when L1 won’t set weight to zero, but it’s inuitively clear for geometrical reasons that for
contour lines of fixed form the number of points such that L1 will set one feature to zero is bigger
than for L2. Figure 4 shows an example for circular contour lines: red axes and grey area represent
λ
dots β where one feature will be set to zero by L2 and L1 correspondingly.
A number of experiments justify L1’s tendency to favor sparsity, for example this one [1]:
given two features β1 and β2, quadratic loss function is generated:
2 2
Loss=a( β1−c 1 ) +b ( β2 −c 2) +c ( β1−c 1)( β2−c 2 ) , where every variable is uniformly distributed
(a ~ U(0, 10), b ~ U(0, 10), c ~ U(-2, 2), c 1 ~U(-10, 10), c 2 ~U(-10, 10)). Next, sum of Loss
and regularizator is minimised. After 5000 trials for each regularizator the following results have
been retrieved: one of the coefficients has become zero in 72% of runs for L1, 5% - for L2. Such
loss functions may not represent real data, but provide grounds for abovementioned intuition.
Conclusion: the reasoning has been provided for model with quadratric loss function and
two features. Whereas it works for feature spaces with more dimensions, different loss function may
not have ellipses as contour lines. However, loss functions are usually convex, resulting in convex
contour lines, which means that same logic can still be applied.
It should be noted that the L1 regularization's tendency to favor sparsity can be a
disadvantage. A ‘less important’ feature may still be important or even worse, when 2 features are
equally important, L1 may arbitrarily choose one to set its weight to zero.
Summary
The biggest advantage of L1 regularization is its ability to perform feature selection, but it
can be excessive, which leads to the following conclusion: L1 regularization should be used when
you explicitly need to sparse features, otherwise L2 is favored. The feature sparsity may be a
desired outcome in cases when feature engineering hasn’t been performed beforehand (thus, there is
a chance that some features are irrelevant), or when reducing the feature dimensionality is required,
for example in order to decrease computational costs.
References
[1]: https://github.com/parrt/website-explained.ai/tree/master/regularization/code – code for the
experiment. It was performed by Terence Parr, ex-Professor of computer/data science at the
university of San Francisco.