1
Regularization and
polynomial regression
CSCI-P 556
ZORAN TIGANJ
2
Reminders/Announcements
u Don’t forget the Quiz deadline on Wednesday
u More office hours available (check Canvas homepage)
u Make sure to follow instructions when signing up for groups in HW1
u Don’t create your own groups – those are not visible to us when grading, use our
groups
u Make sure to be in a group even if you’re doing the assignment alone
(otherwise, we won’t see your submission on Canvas)
3
Today
u Polynomial regression
u Regularization (Lass, Ridge, Elastic net, Early stopping)
4
Examples of linear models
Linear models have a linear relationship between dependent variable (y) and parameters
Description: This model predicts a response based on
Simple Linear Regression
a single predictor variable.
Example Application: Predicting the salary of an
employee based on years of experience.
Multiple Linear Regression Description: This model uses multiple predictor
variables to predict a response.
Example Application: Estimating the price of a house
based on its size, age, and location.
Description: Although it models non-linear
Polynomial Regression
relationships, it is considered a linear model because
it is linear in the parameters.
Example Application: Modeling the growth rate of
bacteria depending on the nutrient concentration
5
Examples of non-linear models
Non-Linear models have a non-linear relationship between dependent variable (y or p) and parameters
Description: Despite its name, logistic regression is a
Logistic Regression
non-linear model used for binary classification.
Example Application: Predicting whether a patient
has a disease (1) or not (0) based on their test results.
Exponential Growth Model Description used for growth processes where the
increase is proportional to the current amount.
Example Application: Predicting population growth.
Description: Fitting some known functional
Function fitting
relationship. E.g., sinusoidal is appropriate for data
exhibiting periodic fluctuations.
Example Application: Seasonal variations in
temperature or other cyclic phenomena.
6
Why does the difference between
linear and non-linear models matter?
u Linear models will usually have a closed-form solution in the form of a
Normal Equation and the loss function is convex (if we use MSE as a loss
function, also called Ordinary Least Squares method).
u Closed closed-form solution might not exist if we can’t compute the inverse in
the normal equation
u Non-linear models can represent more complex relationships. This ability
makes them suitable for dealing with real-world phenomena that
inherently exhibit non-linear dynamics, such as exponential growth,
saturation effects, and threshold effects.
u Non-linear models typically require iterative, numerical approaches, such as
gradient descent, which can be computationally intensive and require more
data to achieve stable estimates (loss function often not convex)
7
Polynomial Regression
u What if your data is more complex than a straight line?
u Surprisingly, you can use a linear model to fit nonlinear data.
u A simple way to do this is to add powers of each feature as new features,
then train a linear model on this extended set of features.
u This technique is called Polynomial Regression
8
Polynomial Regression
u Let’s generate some nonlinear data, based on a simple quadratic
equation:
9
Polynomial Regression
u Let’s use ScikitLearn’s PolynomialFeatures class to transform our training
data, adding the square (second-degree polynomial) of each feature in
the training set as a new feature (in this case there is just one feature):
10
Polynomial Regression
11
Polynomial Regression
u Note that when there are multiple features, Polynomial Regression is
capable of finding relationships between features.
u This is made possible by the fact that PolynomialFeatures also adds all
combinations of features up to the given degree.
u For example, if there were two features a and b, PolynomialFeatures with
degree=3 would not only add the features a2, a3, b2, and b3, but also the
combinations ab, a2b, and ab2.
12
Learning Curves
u If you perform high-degree Polynomial
Regression, you will likely fit the training data
much better than with plain Linear
Regression.
u This high-degree Polynomial Regression
model is severely overfitting the training
data, while the linear model is underfitting it.
u How can you tell that your model is
overfitting or underfitting the data?
13
Learning Curves: Underfitting
u Linear model: These learning curves are
typical of a model that’s underfitting. Both
curves have reached a plateau; they are
close and fairly high.
We cannot correct for underfitting with more training instances,
we need to make a more complex model
14
Learning Curves: Overfitting
u 10th degree polynomial:
u The error on the training data is much lower
than with the Linear Regression model.
u There is a gap between the curves. This means
that the model performs significantly better on
the training data than on the validation data,
which is the hallmark of an overfitting model.
u If you used a much larger training set,
however, the two curves would continue to
get closer.
15
Linear Regression Model
When high bias, reduce alpha
high bias: Underfit, reduce alpha 10th Degree Polynomial Model When high variance, increase alpha
high variance: Overfit, increase alpha
16
Regularized Linear Models
u A good way to reduce overfitting is to regularize the model (i.e., to constrain it):
the fewer degrees of freedom it has, the harder it will be for it to overfit the
data.
u A simple way to regularize a polynomial model is to reduce the number of
polynomial degrees.
u For a linear model, regularization is typically achieved by constraining the
weights of the model.
u We will now look at three different ways to constrain the weights:
u Ridge Regression,
u Lasso Regression,
u Elastic Net
17
Ridge Regression
u Ridge Regression (also called Tikhonov regularization) is a regularized
version of Linear Regression: a regularization term added to the cost
function is
u This forces the learning algorithm to not only fit the data but also keep the
model weights as small as possible.
u Loss function with regularization:
18
Ridge Regression
Note that the regularization
term should only be added
u Ridge Regression (also called Tikhonov regularization) is a regularized to the cost function during
version of Linear Regression: a regularization term added to the cost training. Once the model is
trained, you want to evaluate
function is the model’s performance
using the unregularized
performance measure.
u This forces the learning algorithm to not only fit the data but also keep the
model weights as small as possible.
u Loss function with regularization:
The higher the alpha, the flatter the line
u If 𝛼 is very large, then all weights end up very close to zero and the result is
a flat line going through the data’s mean (note that sum starts from 1).
19
Ridge Regression
20
Ridge Regression
u Ridge Regression closed-form solution
21
Ridge Regression
u And using Stochastic Gradient Descent:
22
Lasso Regression
Use Lasso if some features are useless
u Least Absolute Shrinkage and Selection Operator Regression (usually so Lasso makes some features zero or
close to zero.
simply called Lasso Regression) is another regularized version of Linear
Regression. Ridge reduces the weights of all
features ..
u Just like Ridge Regression, it adds a regularization term to the cost function,
but it uses the ℓ1 norm of the weight vector instead of half the square of
the ℓ2 norm.
23
Lasso Regression
24
Lasso Regression
u An important characteristic of Lasso
Regression is that it tends to eliminate the
weights of the least important features
(i.e.,set them to zero).
u In other words, Lasso Regression
automatically performs feature selection
and out puts a sparse model (i.e., with
few nonzero feature weights).
https://explained.ai/regularization/index.html
25
Lasso Regression
26
Elastic Net
u Elastic Net is a middle ground be tween Ridge Regression and Lasso
Regression.
u The regularization term is a simple mix of both Ridge and Lasso’s
regularization terms, and you can control the mix ratio r.
u when r = 0, Elastic Net is equivalent to Ridge Regression,
u when r = 1, it is equivalent to Lasso Regression
Ridge: weights as small as possible
Lasso: weights zero for unimporatant features
27
What regularization those choose
u So when should you use plain Linear Regression (i.e., without any
regularization), Ridge, Lasso, or Elastic Net?
u It is almost always preferable to have at least a little bit of regularization, so
generally you should avoid plain Linear Regression.
u Ridge is a good default, but if you suspect that only a few features are useful,
you should prefer Lasso or Elastic Net because they tend to reduce the useless
features’ weights down to zero.
u In general, Elastic Net is preferred over Lasso because Lasso may behave
erratically when the number of features is greater than the number of training
instances or when several features are strongly correlated.
Elastic net > Lasso
Lasso behaves irregularly when m > no. of training instances or when features are strongly correlated
28
Early Stopping
u A very different way to regularize iterative learning algorithms such as
Gradient Descent is to stop training as soon as the validation error reaches
a minimum.
u This is called early stopping.
29
Early Stopping
u With Stochastic and Mini-batch Gradient Descent, the curves are not so
smooth, and it may be hard to know whether you have reached the
minimum or not.
u One solution is to stop only after the validation error has been above the
minimum for some time (when you are confident that the model will not
do any better), then roll back the model parameters to the point where
the validation error was at a minimum.
30
Next time
u Logistic regression, from Chapter 4 from Hands on machine learning
textbook