Programming for Data Science
Lecture 7 – Supervised Learning, Continued.
Thomas Lavastida
University of Texas at Dallas
[email protected]
Spring 2023
Agenda
• Assignment 2 Review
• Quick review of Supervised Learning and Linear Regression
• Linear Regression in Python
• Start Regularization and Cross Validation
2
Assignment 2 Review
Supervised Learning and Regression Review
Supervised Learning
• Given – labelled data points
• – features, independent variables, predictors, columns, etc.
• – target, dependent variable, outcome, etc.
• Continuous -> then we call this regression
• Discrete/categorical -> then we call this classification
• Goal: Find a mapping/function from ’s to ’s such that
Linear Regression
• Simple class of regression models
• Let be independent variables
• Model parameters (one for each indep. variable)
• Predicted outcome computed via a linear function:
• Compute ’s by minimizing average squared error
Overfitting
• As model gets more complex it can fit data
more closely
• New data we see (and want to make
predictions about) may not be fit well (i.e.,
high error)
• This is called overfitting
• Main idea to deal with this -> split into train
and test set
• Training set – used to compute model
parameters
• Test set – used to estimate accuracy of model
on new data
PYTHON PRACTICE
Review: Overfitting
• Model with overfitting problem
• Nice performance for data in hand
• Poor predictive accuracy for new dataset
• Solution 1 – Splitting data
• Training set: train the model (get parameters)
• Test set: evaluate performance
• Solution 2 – Regularization
Regularization – Intuition
• Overfitting occurrence: Too many variables
• True relationship: 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
• Fit the data w/ 10th degree polynomial
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥 2 +… + 𝛽 10 𝑥 10 + 𝜀
Fewer variables
Regularization – Intuition (Cont.)
• Overfitting occurrence: Large variance/fluctuation
• Large coefficient => large fluctuation
• Under the same scale
• Green: 4 3 2
𝑓 ( 𝑥 ) =− 𝑥 +7 𝑥 − 5 𝑥 − 31 𝑥 +30
• Blue: 1
𝑔 ( 𝑥 )=− 𝑓 ( 𝑥)
5
Smaller coefficients
https://www.datacamp.com/community/tutorials/towards-preventing-overfitting-regularization
Regularization – Intuition (Cont.)
• What we need
• Smaller coefficients (coefficient closer to 0)
• Fewer variables (coefficient = 0)
• Penalize the magnitude of coefficients
• Regularization
• Modify our original linear regression model
• Add terms to penalize the magnitude of coefficients
Regularization
• Linear regression (fit only)
• Minimize the error between actual and predicted value
𝑛
𝑓 (𝝎)=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) )
2
𝑖=1
• Regularization (fit and overfit)
• Minimize the error between predicted and actual examples
• Penalize the coefficient magnitude of features
𝑛
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦(𝝎)
2
𝑖=1
Regularization – Two Methods
𝑛
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦(𝝎)
2
𝑖=1
Shrinkage Penalty
• Two formulation of shrinkage penalty
• L2 regularization: equivalent to the square of coefficient magnitude
=> Ridge regression
• L1 regularization: equivalent to absolute value of coefficient magnitude
=> Lasso regression
Ridge Regression
• Linear regression with L2 regularization (square of parameters)
• Minimize function:
𝑛 𝑘
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝜆 ∑ 𝜔
2 2 Shrinkage Penalty
𝑗
𝑖=1 𝑗=1
where
• Large magnitude increases
• the amount of penalty
Ridge Regression – Tuning Parameter
𝑛 𝑘
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝜆 ∑ 𝜔
2 2
𝑗
𝑖=1 𝑗=1
• the amount of penalty
• => A linear regression
• => All coefficients would be zero
• Higher , more penalty, smaller coefficients
• – hyperparameter
• NOT estimated with other parameters
• Set “manually” before model estimation
LASSO
• Linear regression with L1 regularization (absolute value of parameters)
𝑛 𝑘
𝑓 ( 𝝎 )=∑ ( 𝑦 𝑖 − ( 𝝎 𝑥 𝑖 +𝑏 ) ) + 𝜆 ∑ |𝜔 𝑗|
2
𝑖=1 𝑗=1
where .
• L1 penalty can force some coefficient estimates to be exactly zero
• Combines the shrinking advantage of ridge regression with variable selection
• LASSO: Least absolute shrinkage and selection operator
Hyperparameter Tuning and Cross
Validation
Hyperparameter Tuning
• Hyperparameters – set before running the model
• Examples
• LASSO and Ridge –
• Polynomial – degree of polynomial ()
• Intuition of tuning (polynomial case)
• Start by some potential values,
• For each , run the model
• Select the model with the best performance
Tuning Method – Grid Search
• Try all possible hyperparameters of interest
• Most commonly used method for hyperparameter tuning
• Polynomial regression case
• Define a set of potential polynomial degrees
• Estimate, evaluate, choose
Degree MSE Values
Lowest value, selected model
…
• Select the model with best performance … on which dataset?
Data Splitting – Model Training
• Model selection?
Labeled Data
• For each model, get performance
measure in test set
Training Set Test Set • Select model with best performance
in test data
Data Data
• Problem
Model Prediction and • “best model?”
Training Evaluation • “best fit for test set!”
Parameter
Estimates
• Overfitting test set
Performance measure (e.g., MSE) in test set • Solution: more splits
is unbiased (untouched new data)
Data Splitting – Model Selection
Original Training Set Test Set
Training Set Validation Set
• Validation set:
Data Data • Used for model selection (e.g.,
hyperparameter tuning)
Model Model
Training Selection • Test set:
Parameter • Untouched for training and selection
Estimates
• Used for model assessment
(generalizability)
Limitations of Single Splitting (Partition)
• Data waste: method applies to less data
• If not enough data – unreliable result
• Small training set
• Small test set
• Solution: Cross Validation
K-Fold Cross Validation
• Randomly cut dataset into segments
• Use the th segment as test set, the rest as training set
• Obtain , the mean squared error on the th segment (test set)
• After iterations, calculate mean of
𝑀𝑆 𝐸 1 𝑀𝑆 𝐸 2 𝑀𝑆 𝐸 5
1 1 1 1
2 2 2 2
3 3 3 … 3
4 4 4 4
5 5 5 5
K-Fold Cross Validation
• No data put to waste
• Small dataset
• Involves more data to train the model
• Reliable by taking the mean of multiple
• Model selection
• Using more data to evaluate performance of each model
CV for Model Selection
• Combine CV with grid search
• Example:
• Polynomial, grid search for degree, CV
• Leave a portion for test set
• Set grid for hyperparameter (let n be polynomial degree)
• Select model from CV
Degree MSE Values
Apply to test set
Lowest CV score
…
Grid Search with CV
• Manually set a grid of discrete hyperparameter values
• Set a metric for model performance
• Search exhaustively through the grid
• For each set of hyperparameters, evaluate each model’s CV score
• The optimal hyperparameters are those of the model achieving the best CV score
Tuning is expensive
• Run model repetitively
• N grid, K-fold CV => NK iterations
• Example: 20 grid, 5-fold CV
• Computationally expensive
• Sometimes very slight improvement
PYTHON PRACTICE