CSC380: Principles of Data Science
Linear Models
Prof. Jason Pacheco
TA: Enfa Rose George TA: Saiful Islam Salim
Outline
Linear Regression
Least Squares Estimation
Regularized Least Squares
Logistic Regression
Outline
Linear Regression
Least Squares Estimation
Regularized Least Squares
Logistic Regression
Linear Regression
Regression Learn a function that
predicts outputs from inputs,
Outputs y are real-valued
OUTPUT: Y
Linear Regression As the name
suggests, uses a linear function:
We will add noise later…
INPUT: X
Linear Regression
Where is linear regression useful?
Trendlines Stock Prediction Climate Models
Massie and Rose (1997)
Used anywhere a linear relationship is assumed
between continuous inputs / outputs
Line Equation
Recall the equation for a line has a
slope and an intercept,
Slope Intercept
• Intercept (b) indicates where line crosses y-axis
• Slope controls angle of line
• Positive slope (w) Line goes up left-to-right
• Negative slope Line goes down left-to-right
Moving to higher dimensions…
In higher dimensions Line Plane
Multiple ways to define a plane, we
will use:
Normal Vector In-Plane Vector
(controls orientation) (handles offset)
Regression weights will take place
of normal vector
Source: [Link]
Inner Products
Recall the definition of an inner product:
Equivalently, projection of one vector onto another,
where
Vector Norm
Linear Regression
[ Image: Murphy, K. (2012) ]
For D-dimensional input vector the
plane equation,
Often we simplify this by including the intercept
into the weight vector,
Since:
Linear Regression
Input-output mapping is not exact, so we will add
zero-mean Gaussian noise, Multivariate Normal
OUTPUT: Y
(uncorrelated)
where
This is equivalent to the likelihood function,
INPUT: X
Because Adding a constant to a Normal RV is still a Normal RV,
In the case of linear regression and
Great, we’re done right?
Data – We have this
We need to fit it to
data by learning the
regression weights Random; Can’t do
anything about it
How to do this?
Don’t know these;
What makes good need to learn them
weights?
Learning Linear Regression Models
There are several ways to think about fitting regression:
• Intuitive Find a plane/line that is close to data
• Functional Find a line that minimizes the least squares loss
• Estimation Find maximum likelihood estimate of parameters
They are all the same thing…
Learning Linear Regression Models
There are several ways to think about fitting regression:
• Intuitive Find a plane/line that is close to data
• Functional Find a line that minimizes the least squares loss
• Estimation Find maximum likelihood estimate of parameters
They are all the same thing…
Fitting Linear Regression
Intuition Find a line that is as
close as possible to every
training data point
The distance from each point
to the line is the residual
Training Output Prediction
[Link]
Outline
Linear Regression
Least Squares Estimation
Regularized Least Squares
Logistic Regression
Least Squares Solution
Functional Find a line that
minimizes the sum of
squared residuals
Over training all the data,
Least squares regression
[Link]
Least Squares
This is just a quadratic function…
• Convex, unique minimum
• Minimum given by zero-derivative
• Can find a closed-form solution
Let’s see for scalar case with no bias,
Least Squares : Simple Case
Derivative (+ chain rule)
Distributive Property
Algebra
Least Squares in Higher Dimensions
Things are a bit more complicated in higher [ Image: Murphy, K. (2012) ]
dimensions and involve more linear algebra,
Design Matrix Vector of
( each training input on a column ) Training labels
Can write regression over all training data more compactly…
Nx1 Vector
Least Squares in Higher Dimensions
Least squares can also be written more [ Image: Murphy, K. (2012) ]
compactly,
Some slightly more advanced linear algebra
gives us a solution,
Derivation a bit advanced for this class, but…
• We know it has a closed-form and why
• We can evaluate it
• Generally know where it comes from
Ordinary Least Squares (OLS) solution
Learning Linear Regression Models
There are several ways to think about fitting regression:
• Intuitive Find a plane/line that is close to data
• Functional Find a line that minimizes the least squares loss
• Estimation Find maximum likelihood estimate of parameters
They are all the same thing…
Learning Linear Regression Models
There are several ways to think about fitting regression:
• Intuitive Find a plane/line that is close to data
• Functional Find a line that minimizes the least squares loss
• Estimation Find maximum likelihood estimate of parameters
They are all the same thing…
MLE for Linear Regression
Given training data likelihood function
is given by,
OUTPUT: Y
Recall that the likelihood is Gaussian:
INPUT: X
So MLE maximizes the log-likelihood over the whole data as,
Univariate Gaussian (Normal) Distribution
Gaussian (a.k.a. Normal) distribution with
mean (location) and variance (scale)
parameters,
PDF
The logarithm of the PDF if just a negative
quadratic,
Log- PDF
Constant in mean Quadratic Function of mean
Notation
Likelihood of linear basic regression model…
…we will just look at learning mean parameter for now
MLE of Gaussian Mean
Assume data are i.i.d. univariate Gaussian,
Variance is known
Log-likelihood function:
Constant doesn’t
depend on mean
MLE doesn’t change when we:
1) Drop constant terms (in )
MLE estimate is least squares estimator: 2) Minimize negative log-likelihood
MLE of Linear Regression
Substitute linear regression
prediction into MLE solution
and we have,
So for Linear Regression,
MLE = Least Squares
Estimation
[Link]
Multivariate Gaussian Distribution
We have only seen scalar (1-dimensional) X, but MLE is still least
squares for higher-dimensional X…
Let with mean and positive semidefinite covariance
matrix then the PDF is,
Again, the logarithm is a negative quadratic form,
Constant (in mean) Quadratic Function of mean
Multivariate Quadratic Form
Quadratic form for vectors is
given by inner product,
For iid data MLE of Gaussian
mean is once-again least
squares,
• Strongly convex
• Differentiable
• Unique optimizer at zero gradient
Notation
Substitute multi-dimensional linear regression…
…brings us back to the least squares solution
MLE of Linear Regression
Using previous results, MLE is equivalent to [ Image: Murphy, K. (2012) ]
minimizing squared residuals,
Some slightly more advanced linear algebra
gives us a solution,
Derivation a bit advanced for this class, but…
• We know it has a closed-form and why
• We can evaluate it
• Generally know where it comes from
Ordinary Least Squares (OLS) solution
Linear Regression Summary
1. Definition of linear regression model,
where
2. For N iid training data fit using least squares,
3. Equivalent to maximum likelihood solution
Linear Regression Summary
Ordinary least squares solution
Is solved in closed-form using the Normal equations,
Design Matrix Vector of QUESTIONS?
( each training input on a column ) Training labels
A word on matrix inverses…
Least squares solution requires inversion of the term,
What are some issues with this?
1. Requires time for D input features
2. May be numerically unstable (or even non-invertible)
Small numerical errors in input
can lead to large errors in solution
Pseudoinverse
The Moore-Penrose pseudoinverse is denoted,
• Generalization of the standard matrix inverse
• Exists even for non-invertible XTX
• Directly computable in most libraries
• In Numpy it is: [Link]
Linear Regression in Scikit-Learn
For Evaluation
Load your libraries,
Load data,
Train / Test Split:
Linear Regression in Scikit-Learn
Train (fit) and predict,
Plot regression line with the test set,
Outline
Linear Regression
Least Squares Estimation
Regularized Least Squares
Logistic Regression
Outliers
How does an outlier affect the estimator?
Squared Error
Outliers
How does an outlier affect the estimator?
Squared Error
Outliers in Linear Regression
Outlier “pulls”
regression line away
from inlier data
Y
Need a way to ignore or
to down-weight impact
of outlier
X
[Link]
Dealing with Outliers
Too many outliers can indicate many things: non-Gaussian
(heavy-tailed) data, corrupt data, bad data collection, …
A few ways to handle outliers…
1. Use a heavy-tailed noise distribution (Student’s T)
Fitting regression becomes difficult
2. Identify outliers and discard them
NP-Hard and throwing away data is generally bad
3. Penalize large weights to avoid overfitting (Regularization)
Regularization
Recall, regularization helps avoid overfitting training data…
Regularization Regularization Penalty
Strength
Y Red model is without regularization
Green model includes regularization
X
Regularized Least Squares
Ordinary least-squares estimation (no regularizer),
Already know how
solve this…
Quadratic Penalty
L2-regularized Least-Squares (Ridge)
L1-regularized Least-Squares (LASSO) Absolute Value (L1) Penalty
A word on vector norms…
The L2-norm (Euclidean norm) of a vector w is,
The L1-norm (absolute value) of a vector w is,
They are not the same functions…
Other Regularization Terms
q<1 is not a norm, L1 is non- L2 Regularization
and thus not convex differentiable
A more general regularization penalty,
Administrative Items
• HW7 out Thursday (Due next Thursday)
• HW6 due tonight
• Also, I saw this ad…
Regularized Least Squares
A couple regularizers are so common they have specific names
L2 Regularized Linear Regression
• Ridge Regression
• Tikhonov Regularization
L1 Regularized Linear Regression
• LASSO
• Stands for: Least Absolute Shrinkage and Selection Operator
L2 Regularized Least Squares
Quadratic
Quadratic
Quadratic + Quadratic = Quadratic
• Differentiable
• Convex
• Unique optimum
• Closed form solution
L2 Regularized Least Squares : Simple Case
Derivative (+ chain rule)
Distributive Property
Algebra
L2 Regularized Linear Regression – Ridge Regression
Source: Kevin Murphy’s Textbook
After some algebra…
Compare to ordinary least squares:
Regularized least-squares includes
pseudocount in weighting similar to
Gaussian mean estimator
Notes on L2 Regularization
• Feature weights are “shrunk” towards zero (and each other) –
statisticians often call this a “shrinkage” method
• Typically do not penalize bias (y-intercept, w0) parameter,
• Penalizing w0 would make solution depend on origin for Y – adding a
constant c to Y would not add a constant to solution weights
• Can fit bias in a two-step procedure, by centering features
then bias estimate is
• Solutions are not invariant to scaling, so typically we standardize (e.g.
Z-score) features before fitting model ( Sklearn StandardScaler )
Scikit-Learn : L2 Regularized Regression
Alpha is what we have been calling
Scikit-Learn : L2 Regularized Regression
Define and fit OLS and L2 regression,
Plot results,
L2 (Ridge) reduces impact of any single data point
Choosing Regularization Strength
We need to tune regularization strength to avoid over/under fitting…
Recall bias/variance tradeoff
Error = Irreducible error + Bias2 + Variance
High regularization reduces model
complexity: increases bias / decreases
variance
How should we properly tune ?
Cross-Validation
N-fold Cross Validation Partition training
data into N “chunks” and for each run
select one chunk to be validation data
For each run, fit to training data (N-1
chunks) and measure accuracy on
validation set. Average model error
across all runs.
Drawback Need to perform training N times.
Source: Bishop, C. PRML
Model Selection for Linear Regression
A couple of common metrics for model selection…
Residual Sum-of-squared Errors The total squared residual
error on the held-out validation set,
Coefficient of Determination Also called R-squared or R2.
Fraction of variation explained by the model.
Model selection metrics are known as “goodness of fit” measures
Coefficient of Determination R2
Predicted Variance Residual Sum-of-Squares
Total variance
in dataset Variance using avg. prediction
Where: is the average output
Coefficient of Determination R2
Maximum value R2=1.0 means
model explains all variation in the
data R2 > 0
R2 = 0
Maximum value R2=0 means model is
as good as predicting average
response
R2<0 means model worse than
predicting average output
“Shrinkage” Feature Selection
Down-weight features that are not useful for prediction…
Quadratic penalty down-weights
(shrinks) features that are not useful for
prediction
Example Prostate Cancer Dataset measures
prostate-specific cancer antigen with features:
age, log-prostate weight (lweight), log-benign
prostate hyperplasia (lbph), Gleason score
(gleason), seminal vesical invasion (svi), etc.
L2 regularization learns zero-weight
for log capsular penetration (lcp)
[ Source: Hastie et al. (2001) ]
Constrained Optimization Perspective
Intuition Find best model (lowest
RSS) given constraint on total
feature weights…
Squared Error
Total Weight
There exists a mathematically
Norm equivalent formulation for some
function
Optimal Model
L2 penalized regression rarely
learns feature weight that are
exactly zero…
[ Source: Hastie et al. (2001) ]
Regularized Least Squares
Ordinary least-squares estimation (no regularizer),
Quadratic Penalty
L2-regularized Least-Squares (Ridge)
L1-regularized Least-Squares (LASSO) Absolute Value (L1) Penalty
L1 Regularized Least-Squares
Squared Error
Optimal Model
Learns w2 = 0
Able to zero-out weights that are not predictive…
Feature Weight Profiles
Varying regularization
parameter moderates
shrinkage factor
For moderate regularization
strength weights for many
features go to zero
• Induces feature sparsity
• Ideal for high-dimensional settings
• Gracefully handles p>N case, for p
features and N training data
Feature Weight Profiles
L1 Penalty L2 Penalty
Learning L1 Regularized Least-Squares
Not differentiable…
…doesn’t exist at x=0
Can’t set derivatives to zero as
in the L2 case!
Learning L1 Regularized Least-Squares
• Not differentiable, no closed-form solution
• But it is convex! Can be solved by quadratic programming
(beyond the scope of this class…)
• Efficient optimization algorithms exist
• Least Angle Regression (LAR) computes full solution path for
a range of values
• Can be solved as efficiently as L2 regression
Specialized methods for cross-validation…
Computes solution using coordinate descent
Uses least angle regression (LARS) to compute solution path
L1 Regression Cross-Validation
Perform L1 Least Squares (LASSO) 20-fold cross-validation,
or
Plot solution path for range of alphas,
All alphas_
Learned alpha_ (no “s”… annoying…)
Example: Prostate Cancer Dataset
Best LASSO model learns to
ignore several features (age, lcp,
gleason, pgg45).
Wait…Is age really not a
significant predictor of prostate
cancer? What’s going on here?
Age is highly correlated with other
factors and thus not significant in
the presence of those factors
Administrative Items
HW7 will be posted tonight
• Ordinary least squares regression
• Ridge regression
• Lasso
• Feature selection
Due next Thursday (11/11)
• A bit more is left up to the student compared to HW5 / HW6
Best-Subset Selection
L1 / L2 shrinkage offer approximate feature selection…
The optimal strategy for p features looks at models over all possible
combinations of features,
For k in 1,…,p:
subset = Compute all subset of k-features (p-choose-k)
For kfeat in subset:
model = Train model on kfeat features
score = Evaluate model using cross-validation
Choose the model with best cross-validation score
Best-Subset Selection : Prostate Cancer Dataset
Each marker is the cross-val
R2 score of a trained model
for a subset of features
Data have 8 features, there
are 8-choose-k subsets for
each k=1,…,8 for a total of
255 models
Using 10-fold cross-val
requires 10 x 255 = 2,550
training runs!
Feature Selection: Prostate Cancer Dataset
Best subset has highest test accuracy (lowest
variance) with just 2 features
[ Source: Hastie et al. (2001) ]
Comparing Feature Selection Methods
Notation Change Least
squares weights are
rather than .
Forward Sequential Selection
An efficient method adds the most predictive feature one-by-one
featSel = empty
featUnsel = All features
For iter in 1,…,p:
For kfeat in featUnsel:
thisFeat = featSel + kfeat
model = Train model on thisFeat features
score = Evaluate model using cross-validation
featSel = featSel + best scoring feature
featUnsel = featUnsel - best scoring feature
Choose the model with best cross-validation score
Backward Sequential Selection
Backwards approach starts with all features and removes one-by-one
featSel = All features
For iter in 1,…,p:
For kfeat in featSel:
thisFeat = featSel - kfeat
model = Train model on thisFeat features
score = Evaluate model using cross-validation
featSel = featSel – worst scoring feature
Choose the model with best cross-validation score
Comparing Feature Selection Methods
Sequential selection is greedy, but often performs well…
Example Feature selection on synthetic
model with p=30 features with pairwise
correlations (0.85). True feature
weights are all zero except for 10
features, with weights drawn from
N(0,6.25).
Sequential selection with p features
takes O(p2) time, compared to
exponential time for best subset
Sequential feature selection available in Scikit-Learn under:
feature_selection.SequentialFeatureSelector
Outline
Linear Regression
Least Squares Estimation
Regularized Least Squares
Logistic Regression
Classification as Regression
Suppose our response variables are binary y={0,1}. How can we use
linear regression ideas to solve this classification problem?
[Link]
Classification as Regression
Idea Fit a regression function to the
data (red). Classify points based on
whether they are above or below the
midpoint (green).
• This is a discriminant function, since it discriminates between classes
• It is a linear function and so is a linear discriminant
• Green line is the decision boundary (also linear)
[Link]
Multiclass Classification as Regression
Suppose we have K classes. Training outputs
for each class are a set of indicator vectors,
With if class k, e.g. Y=(0,0,…,1,0,0).
For N training inputs create NxK matrix of outputs and solve,
W is NxK matrix of K linear
regression models, one for
each class
• Compute fitted output a K-vector
This is an instance of
• Identify largest component and classify as, multi-output linear
regression
[ Image: Hastie et al. (2001) ]
Linear Probability Models
Binary Classification Linear model approximates
probability of class assignment,
Multiclass Classification Multiple decision boundaries,
each approximated by the class-specific linear model,
Where is kth row
Approximates probability of class assignment,
What’s the rational?
Recall the linear regression model,
So linear regression models the expected value,
We can call this
approach least
For discrete values we have that, squares classification
Can easily verify that they sum to 1,
But they are not guaranteed to be positive!
Logistic Regression
Idea Distort the response variable in
some way to map to [0,1] so that it is
actually a probability.
Uses the logistic function,
• Logistic function is a type of sigmoid or squashing function, since it maps any
value to the range [0,1]
• Predictor variable now actually maps to a valid probability mass function (PMF),
[Link]
Logistic Regression : Decision Boundary
Binary classification decisions are
based on the posterior odds ratio,
If this ratio is greater than 1.0 then
classify as C=1, otherwise C=0
In practice, we use the (natural) logarithm of the posterior odds ratio,
This is a linear decision boundary
Logistic regression is a linear classifier
Logistic vs. Logit Transformations
Logistic Function Logit Function
Maps to [0,1] Maps [0,1] to
Logistic also widely used in Neural Networks – for classification last
layer is typically just a logistic regression
Logistic vs. Logit Transformations
Logistic function maps the linear regression to the interval [0,1],
Logit function is defined for probability values p in [0,1] as,
Logit is the inverse of the logistic function, Logit is also the log-likelihood
ratio, and thus decision boundary
for our binary classifier
Multiclass Logistic Regression
Classification decision based on log-ratio compared to final class,
K-1 log-odds (or logit)
transformations ensures
probabilities sum to 1
Choice of denominator class is arbitrary, but use K by convention
Least Squares vs. Logistic Regression
Least Squares
Logistic Regression
• Both models learn a linear decision boundary
• Least squares can be solved in closed-form (convex objective)
• Least squares is sensitive to outliers (need to do regularization)
[Source: Bishop “PRML”]
Least Squares vs. Logistic Regression
Similar results in 1-dimension
[Link]
Least Squares vs. Logistic Regression
Least Squares Logistic Regression
[Source: Bishop “PRML”]
Fitting Logistic Regression
Fit by maximum likelihood—start with the binary case
Posterior probability of class assignment is Bernoulli,
Given N iid training data pairs the log-likelihood function is,
Fitting Logistic Regression
Computing the derivatives with respect to each element wd,
• For D features this gives us D equations and D unknowns
• But equations are nonlinear and can’t be solved
• Need to use gradient-based optimization to solve (Newton’s method)
• Beyond scope of this class; but know that it is an iterative process
Iteratively Reweighted Least Squares
• Given some estimate of the weights update by solving,
Design Matrix NxN Diagonal
(NxD) Weight matrix
Where z is the gradient direction, P(y=1|x) for each
training point
• Essentially solving a reweighted version of least squares,
Each iteration changes W
and p so need to resolve
Choice of Optimizer
Since Logistic regression
requires an optimizer, there are
more parameters to consider
The choice of optimizer and
parameters can effect time to
fit model (especially if there are
many features)
[Link]
Scikit-Learn Logistic Regression
Function predict_proba(X) returns prediction of class
assignment probabilities (just a number in binary case)
[Link]
Using Logistic Regression
The role of Logistic Regression differs in ML and Data Science,
• In Machine Learning we use Logistic Regression for building predictive
classification models
• In Data Science we use it for understanding how features relate to data
classes / categories
Example South African Heart Disease (Hastie et al. 2001)
Data result from Coronary Risk-Factor Study in 3 rural areas of South
Africa. Data are from white men 15-64yrs and response is
presence/absence of myocardial infraction (MI). How predictive are
each of the features?
Looking at Data
Each scatterplot shows
pair of risk factors. Cases
with MI (red) and without
(cyan)
Features
• Systolic blood pressure
• Tobacco use
• Low density lipoprotein (ldl)
• Family history (discrete)
• Obesity
• Alcohol use
• Age
[Source: Hastie et al. (2001)]
Example: African Heart Disease
Fit logistic regression to the
data using MLE estimate via
iteratively reweighted least
squares
Standard error is estimated
standard deviation of the
learned coefficients
Recall, Z-score of weights is a random variable from standard Normal,
Thus anything with Z-score > 2 is significant at 5% confidence level
Example: African Heart Disease
Finding Systolic blood
pressure (sbp) is not a
significant predictor
Obesity is not significant and
negatively correlated with heart
disease in the model
Remember All correlations / significance of features are based
on presence of other features. We must always consider that
features are strongly correlated.
Example: African Heart Disease
Doing some feature selection
we find a model with 4
features: tobacco, ldl, family
history, and age
How to interpret coefficients?
(e.g. tobacco 0.081)
• Tobacco is measured in total lifetime usage (in kg)
• Thus, increase of 1kg of lifetime tobacco yields
Or 8.4% increase in odds of coronary heart disease
• 95% CI is 3% to 14% since