Lecture slides - Linear Regression (2025)
Lecture slides - Linear Regression (2025)
Amilcar Soares
9 april 2025
▶ Find a function
70
y = f (x) that fits a
given dataset 60
y
make a prediction
yi = f (xi ) for an 30
unknow input xi
20
▶ (x1 , y1 ), (x2 , y2 ), ... is
our training set 10
▶ y = f (x) is our 0
0 5 10 15 20 25
model x
In general
A regression problem aims to find a function y = f (X ) that fits a given dataset (X , y )
where X is a n × p matrix and y is a n × 1 vector, where n = number of samples, p =
number of features.
Introduction
Introduction to Machine Learning 3(45)
From the previous lecture: k-NN regression
▶ y: Price in dollars
12
▶ Samples: 200
price.
2
Linear (1D) Regression: Find a
function
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500
y = β1 + β2 X House Area (Square feet)
(y − yi )2 = ((β1 + β2 xi ) − yi )2 .
Remaining Task: Find β1 and β2 that minimises the cost function J(β1 , β2 )
Introduction
Introduction to Machine Learning 6(45)
Basic Calculus - Find minimum values
From your basic calculus course
Problem: Find x that minimises a given function f (x)
Solution: Differentiate f (x) with respect x and set it to zero:
df
=0
dx
and solve resulting equation for x ⇒ extreme values for f (x).
In our case
Problem: Find β1 and β2 that minimises the cost function J(β1 , β2 )
Solution: Differentiate J(β1 , β2 ) with respect to β1 and β2 and set them to zero:
∂J
= 0
∂β1
∂J
= 0
∂β2
12
Solution
10
▶ Linear fit on the training set
Introduction
Introduction to Machine Learning 9(45)
Linear (Degree 1) Regression – Summary
▶ A dataset (xi , yi ) with n samples
▶ Model (or hypothesis): y = β1 + β2 x (Polynomial of degree 1)
▶ Problem: Find β1 and β2 that minimises the cost function
n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2 (= MSE )
n i=1
yn 1 xn
▶ y is a n × 1 vector, Xext is a n × 2 matrix and β is a 2 × 1 vector
▶ Model: y = Xext β (corresponds to y = β1 + β2 x)
▶ Problem: Find β that minimises the cost function
1
J(β) = (Xext β − y )T (Xext β − y )
n
▶ Solution (Normal Equation)
T
β = (Xext Xext )−1 Xext
T
y
An easy-to-read reference for deriving the Normal Equation:
https://dzone.com/articles/derivation-normal-equation
▶ Computing β is O(p 3 ) ⇒ might be problematic for large number of features p.
Introduction
Introduction to Machine Learning 11(45)
Vectorization in Python
Cost Function and Normal Equation from the previous slide, respectively:
1
J(β) = (Xext β − y )T (Xext β − y )
n
T
β = (Xext Xext )−1 Xext
T
y
Implement your solution with Numpy
▶ Assume X , y are Numpy (np) arrays
▶ How to extend X?
Xe = np.c_[np.ones((n,1)),X]
▶ How to run the model?
np.dot(Xe,beta)
▶ How to implement the cost function?
J = (j.T.dot(j))/n, where j = np.dot(Xe,beta)-y
▶ How to implement the normal Equation?
beta = np.linalg.inv(Xe.T.dot(Xe)).dot(Xe.T).dot(y)
The Numpy syntax is a bit strange at first so take your time to learn and use it
Introduction
Introduction to Machine Learning 12(45)
Gradient Descent - Motivation
We will often face the following minimization problem:
In general
▶ Analytical solutions like the Normal Equation are not always possible.
▶ In high-dimensional datasets, the matrix X T X can become extremely large, making
the computation of its inverse computationally intensive and memory-intensive.
▶ Additionally, the inversion of such a large matrix may not even be possible due to
numerical instability or limited computational resources.
▶ Therefore, we need to use numerical methods to solve the problem
▶ In this lecture:
▶ Simplest possible algorithm - (Batch) Gradient Descent
▶ Feature normalization ⇒ a method to speed up the gradient descent
procedure
▶ Later on:
▶ Other variants of Gradient Descent
▶ Available support in Python
Gradient Descent
Introduction to Machine Learning 13(45)
Gradient Descent - Introduction
Problem: Find x that minimises f (x)
Solution
1. Select a start value x 0 and (small) learning rate γ
df
2. Apply repeatedly x j+1 = x j − γ dx
3. Stop when |x j+1 − x j | < ε or after a fix number of iterations
Notice
▶ Will, in general, find a local min
▶ Will find global min if f (x) convex
▶ Pros: Simple and fast for strongly convex problems
▶ Cons: Slow ⇒ requires many iterations and a small γ in many realistic cases
f (x) is convex if the line segment between any two points on the function’s graph lies
above the graph.
Gradient Descent
Introduction to Machine Learning 14(45)
Selecting learning rate
Our choice of learnig rate influence the convergence rate
Gradient Descent
Introduction to Machine Learning 15(45)
Gradient Descent for J(β1 , β2 )
The cost function as a function of β1 , β2
n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2
n i=1
Thus
▶ J(β1 , β2 ) is an upward facing parabolic “bowl” (i.e. convex)
▶ ... ⇒ will have a unique min (β1min , β2min )
▶ ... ⇒ we can take any starting point (β10 , β20 ) in gradient descent
2D Gradient Descent
n
∂J 2γ X j
β1j+1 = β1j − γ = β1j − (β + β2j xi − yi )
∂β1 n i=1 1
n
∂J 2γ X
β2j+1 = β2j − γ = β2j − xi (β1j + β2j xi − yi )
∂β2 n i=1
Gradient Descent
Introduction to Machine Learning 16(45)
Gradient Descent for J(β) – Vectorised
1 x1 y1
1 x2 y2
β1
X = . .. , y = .. , β = β
.. . . 2
1 xn yn
1
Cost function: J(β) = n
(X β − y )T (X β − y )
2D Gradient Descent
2γ T
β j+1 = β j − γ∇J(β) = β j − X (X β j − y )
n
where the gradient ∇J(β) is defined as
∂J
∂β1
∇J(β) =
∂J
∂β2
Gradient Descent
Introduction to Machine Learning 17(45)
Gradient Descent for J(β)
Dataset:
x = [1, 2, 3]
y = [3, 4, 6]
Parameters:
β1 = 0
β2 = 0
Cost function:
3
1X
J(β) = (β1 + β2 xi − yi )2
3 i=1
Gradient Descent update rule:
∂J(β)
β1j+1 = β1j − γ
∂β1
∂J(β)
β2j+1 = β2j − γ
∂β2
Gradient Descent
Introduction to Machine Learning 18(45)
Gradient Descent for J(β) (continued)
Gradient Descent iteration 1:
0 0
Initial values: β1 = 0, β2 = 0, learning rate η = 0.02
For i = 1 : 0 + 0 · 1 − 3 = −3
For i = 2 : 0 + 0 · 2 − 4 = −4
For i = 3 : 0 + 0 · 3 − 6 = −6
∂J 2 2
= (−3 − 4 − 6) = (−13) = −8.67
∂β1 3 3
For i = 1 : 1 · (−3) = −3
For i = 2 : 2 · (−4) = −8
For i = 3 : 3 · (−6) = −18
∂J 2 2
= (−3 − 8 − 18) = (−29) = −19.33
∂β2 3 3
1
β1 = 0 − 0.02 · (−8.67) = 0.1734
1
β2 = 0 − 0.02 · (−19.33) = 0.3866
Gradient Descent
Introduction to Machine Learning 19(45)
Gradient Descent for J(β) (continued)
Gradient Descent iteration 2:
1 1
Given: β1 = 0.17, β2 = 0.39, learning rate η = 0.02
∂J 2
= (−2.44 − 3.05 − 4.66)
∂β1 3
2
= (−10.15) = −6.77
3
∂J 2
= (−2.44 − 6.10 − 13.98)
∂β2 3
2
= (−22.52) = −15.01
3
2
β1 = 0.17 − 0.02 · (−6.77) = 0.17 + 0.1354 = 0.3054
2
β2 = 0.39 − 0.02 · (−15.01) = 0.39 + 0.3002 = 0.6902
Gradient Descent
Introduction to Machine Learning 20(45)
Gradient Descent for J(β) (continued)
Gradient Descent for 9 iterations...
The small differences in the β values are due to the floating point implementation in the code
(more precise when compared to the equation solving in the previous slides).
Gradient Descent
Introduction to Machine Learning 21(45)
Gradient Descent in Practice
Initial Steps
1. Number of iterations N = 10, α = 0.00001, β 0 = (0, 0)
2. Repeat β j+1 = β j − αX T (X β j − y )
3. Print/plot J(β) vs N to make sure it is decreasing for each iteration
That is, we select a small α = 2γ/n and make a few iterations.
- J(β) steadily decreasing ⇒ α small enough (maybe too small)
- J(β) fluctuating or increasing ⇒ must decrease α
Gradient Descent
Introduction to Machine Learning 22(45)
House Prices in Oregon – Gradient Descent
10
5 Result - Gradient Descent 10 Gradient Descent Cost with alpha = 1.0E-07
10
14 4.5
Training data
Gradient Descent
Normal Equation
12
4
10
3.5
House Price (Dollar)
Cost Function
3
2.5
2
2
1.5
0
-2 1
0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 140 160 180 200
House Area (Square feet) Iterations
β j+1 = β j − αX T (X β j − y )
2γ
that makes use of the entire dataset X , y in each iteration and α = n .
Gradient Descent
Introduction to Machine Learning 25(45)
Multivariate Linear Regression – Introduction
Previously: y (price) as a function of x (area, one feature)
Model: y = β1 + β2 x
Now: y as a function of multiple features x1 , x2 , ...xp
Model: y = β0 + β1 x1 + β2 x2 + ... + βp xp (multiple features, still linear)
Example: A girls height as a function of mom and dad height (two features)
Dataset: 214 observations of mom, dad, and girl heights
1
Problem Find β minimising cost function: J(β) = n
(X β − y )T (X β − y )
(65,70): 65.425
66
64
Gradient Descent
▶ α = 0.0002, 62
Niter = 20000000 60
Notice
▶ All feature values centered around 0
▶ All features have the same spread (standard deviation 1)
Feature Normalization
Introduction to Machine Learning 31(45)
Feature Normalization
▶ Feature normalization turns every “bowl” into a uniform one (the yellow)
that is strongly convex ⇒ the gradient descent iteration converges rapidly
▶ The girls height “bowl” looks a bit like the black one. Rather strongly
convex for β2 , β3 and very flat for the intercept coefficient β1
▶ Strongly convex for β2 , β3 ⇒ must use a small learning rate (step size) α
▶ Small α ⇒ very slow convergence for β1 ⇒ We need 20 million iterations
Feature Normalization
Introduction to Machine Learning 32(45)
Normalized Girls Height – Result
Normal Equation 2500
Cost J
Gradient Descent 1000
▶ J = 4.048
Height of girl with parents 0
0 100 200 300 400 500 600 700 800 900 1000
Notice Gradient descent requires just 1000 iterations! ⇒ less than a second.
Also, parents (65,70) must be normalized to (0.4898, 0.1821) before computing
the height using our new β.
Feature Normalization
Introduction to Machine Learning 33(45)
Polynomial – Dataset
Entire dataset, 1000 samples
80
70
60
50
40
y
30
20
10
0
0 5 10 15 20 25
Algorithm
70
1. Read data ⇒ 60
vectors X and y
50
2. Extend X ⇒ 40
y
X = [1, X , X 2 , X 3 ]
30
3. Compute β as 20
β = (X T X )−1 X T y 10
4. Plot xβ vs x 0
0 5 10 15 20 25
x
60 60 70
60
50 50
50
40 40
40
30 30 30
20 20 20
-2 0 2 -2 0 2 -2 0 2
60 60 60
50 50 50
40 40 40
30 30 30
20 20 20
-2 0 2 -2 0 2 -2 0 2
or vectorized
1
MSE = (X β − y )T (X β − y )
n
Training vs Test
▶ The Mean Squared Error (MSE) computed using the training set is
referred to as the Training MSE estimation.
▶ The MSE computed using a separate test set is the Test MSE estimation.
▶ A test set resembles the training set but is not utilized in model
construction, remaining unseen by the model during training.
▶ A high-performing model exhibits a low Test MSE, indicating strong
performance on unseen data.
Over- and underfitting
Introduction to Machine Learning 39(45)
Training vs Test – polynomial.csv
▶ We used a small sample of 50 points to train various polynomial models
▶ We use the remaining 950 points to evaluate the model
Degree MSEtrain MSEtest
1 55.7 69.1
2 53.4 75.4
3 22.3 27.6
4 21.4 30.6
5 21.3 31.3
6 18.5 39.4
7 17.7 67.6
8 16.7 197.7
9 16.6 284.5
10 16.5 558.7
Conclusions
▶ MSEtrain is steadily decreasing ⇒ higher order polynomials can always
better adapt to the training data
▶ MSEtest has a minimum at degree 3 ⇒ best to handle unseen data
⇒ Degree 3 gives the best fit!
Over- and underfitting
Introduction to Machine Learning 40(45)
Reducible and Irreducible Errors
We generated data from
y = f (x) = 5 + 12x − x 2 + 0.025x 3 + normrnd(0, 5)
| {z }
= ε (Noise)
▶ The models cannot capture the true underlying relationship between the
features (X) and the target variable (y).
Over- and underfitting
Introduction to Machine Learning 43(45)
▶ The models exhibit high variability in their predictions when trained on different subsets of
the data.
▶ Despite using the same degree of polynomial features, the models show significant
differences in their shapes and predictions across different training datasets.
▶ This variability arises from the sensitivity of the models to the specific samples in the training
data, leading to different learned patterns and resulting in a wide range of predictions.
▶ Consequently, the models demonstrate high sensitivity to variations in the training data,
indicating high variance.
▶ In other words, the models tend to overfit the training data, capturing noise and
idiosyncrasies specific to each training set rather than the true underlying relationship
between the features and the target variable.
Over- and underfitting
Introduction to Machine Learning 44(45)
Regression Summary
All scenarios (linear, multivariate, polynomial) are treated the same way
▶ Dataset: [X , y ]
▶ Extend X to fit scenario Xext = [1, X , ...]
▶ Model: y = Xext β
▶ Problem: Find β that minimises the cost function
1
J(β) = (Xext β − y )T (Xext β − y )
n
▶ Exact Solution (Normal Equation)
T
β = (Xext Xext )−1 Xext
T
y
▶ Approximative Solution (Gradient Descent)
2γ T
β j+1 = β j − γ∇J(β) = β j − X (Xext β j − y )
n ext
▶ Apply feature normalization of X i to speed up gradient descent
1. Compute mean µi and standard deviation σi
2. Compute normalized X i as Xni = (X i − µi )/σi