0% found this document useful (0 votes)
10 views

Lecture slides - Linear Regression (2025)

This lecture focuses on linear and polynomial regression as part of an introduction to machine learning. It covers the mathematical concepts, the least squares method for fitting models, and the use of gradient descent for optimization. Datasets and practical examples, such as predicting house prices, are also discussed to illustrate the application of these techniques.

Uploaded by

trol.man890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture slides - Linear Regression (2025)

This lecture focuses on linear and polynomial regression as part of an introduction to machine learning. It covers the mathematical concepts, the least squares method for fitting models, and the use of gradient descent for optimization. Datasets and practical examples, such as predicting house prices, are also discussed to illustrate the application of these techniques.

Uploaded by

trol.man890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Introduction to Machine Learning

Lecture 2 - Linear and Polynomial Regression

Amilcar Soares

[email protected]

Slides and used datasets are available in Moodle


A big thanks to Dr. Jonas Lundberg for providing most of the slides for this Lecture.

9 april 2025

Introduction to Machine Learning 1(45)


Agenda - Linear Regression
Reading Instructions

▶ Lindholm, A., Wahlstrom, N., Lindsten, F., & Schon, T. B. (2022).


Machine learning: a first course for engineers and scientists. Cambridge
University Press.
▶ Chapter 3: Basic Parametric Models and a Statistical Perspective on
Learning. Pages 37 to 45.
▶ This lecture and slides focus on mathematics (high-level), concepts, and
understanding of concepts.

Datasets: house_prices.csv, girls_height.csv, polynomial.csv

Introduction to Machine Learning 2(45)


Regression – Introduction
Regression 80
Polynomial fit, 1000 samples

▶ Find a function
70
y = f (x) that fits a
given dataset 60

(x1 , y1 ), (x2 , y2 ), ...


50

▶ Once f (x) found,


40

y
make a prediction
yi = f (xi ) for an 30

unknow input xi
20
▶ (x1 , y1 ), (x2 , y2 ), ... is
our training set 10

▶ y = f (x) is our 0
0 5 10 15 20 25
model x

In general
A regression problem aims to find a function y = f (X ) that fits a given dataset (X , y )
where X is a n × p matrix and y is a n × 1 vector, where n = number of samples, p =
number of features.
Introduction
Introduction to Machine Learning 3(45)
From the previous lecture: k-NN regression

▶ The figure above shows a k = 5 fit to a given polynomial dataset (xi , yi )


▶ It looks ugly, but it serves its purpose: to compute y for an arbitrary X .
▶ To build the plot:
▶ Divide x-axis interval [1, 25] into e.g., 200 equidistant points Xj
▶ For each Xj : find the 5 data points in the dataset closest to Xj
▶ Compute the average for the corresponding y − value for the 5 selected
data points ⇒ Yj
▶ Create the the Plot Xj , Yj
Introduction
Introduction to Machine Learning 4(45)
Regression Example: House Prices in Oregon
Dataset
▶ X: House area in square feet 14
10 5

▶ y: Price in dollars
12
▶ Samples: 200

Q: What is the price for a 3500 10


square feet house in Oregon?

House Price (Dollar)


8
Solution: Find a function
y = f (X ) 6

that fits the data and then com-


pute f (3500) to find the predicted 4

price.
2
Linear (1D) Regression: Find a
function
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500
y = β1 + β2 X House Area (Square feet)

that fits the data.


Introduction
Introduction to Machine Learning 5(45)
Linear (1D) Regression - Introduction

▶ Assumption: y = β1 + β2 x (our model)


▶ Goal: Find β1 , β2 making y = β1 + β2 x the best possible fit

The Least Square Method


The squared vertical distance between (xi , yi ) and assumption y = β1 + β2 x is

(y − yi )2 = ((β1 + β2 xi ) − yi )2 .

The mean squared distance for the entire training set is


n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2 .
n i=1

Remaining Task: Find β1 and β2 that minimises the cost function J(β1 , β2 )

Introduction
Introduction to Machine Learning 6(45)
Basic Calculus - Find minimum values
From your basic calculus course
Problem: Find x that minimises a given function f (x)
Solution: Differentiate f (x) with respect x and set it to zero:
df
=0
dx
and solve resulting equation for x ⇒ extreme values for f (x).
In our case
Problem: Find β1 and β2 that minimises the cost function J(β1 , β2 )
Solution: Differentiate J(β1 , β2 ) with respect to β1 and β2 and set them to zero:

∂J
= 0
∂β1
∂J
= 0
∂β2

and solve resulting system of equations for β1 and β2 .


Introduction
Introduction to Machine Learning 7(45)
Linear (1D) Regression - Finding β
n
1X 2
J(β1 , β2 ) = ((β1 + β2 xi ) − yi ) .
n i=1
We differentiate J(β1 , β2 ) with respect to β1 and β2 and set them to zero:
n
∂J 2X
= (β1 + β2 xi − yi ) = 0
∂β1 n i=1
n
∂J 2X
= xi (β1 + β2 xi − yi ) = 0
∂β2 n i=1
Simplifying
n
X n
X n
X
β1 1 + β2 xi = yi
i=1 i=1 i=1
n
X n
X n
2
X
β1 xi + β2 xi = yi xi
i=1 i=1 i=1
Solving for β1 , β2 gives
Sxx Sy − Sx Sxy nSxy − Sx Sy
β1 = , β2 =
nSxx − Sx Sx nSxx − Sx Sx
where
n
X n
X n
X n
2
X
Sx = xi , Sy = yi , Sxx = xi , Sxy = xi yi
i=1 i=1 i=1 i=1
Introduction
Introduction to Machine Learning 8(45)
Example: House Prices in Oregon
10 5 Result - Normal Equation
14
Q: What is the price for a 3500 square feet
Training data
house in Oregon? Linear regression

12

Solution
10
▶ Linear fit on the training set

House Price (Dollar)


▶ price = β1 + β2 ×area 8

Computing β1 and β2 as in previous slide 6


▶ β1 = −40259, β2 = 223.77
4
Answer
The price for a 3500 sqft house in Oregon 2
is $742,946
0
The exact solution to the minimisation
problem presented in the previous slide is -2
called the Normal Equation. 0 500 1000 1500 2000 2500 3000 3500 4000 4500

House Area (Square feet)

Introduction
Introduction to Machine Learning 9(45)
Linear (Degree 1) Regression – Summary
▶ A dataset (xi , yi ) with n samples
▶ Model (or hypothesis): y = β1 + β2 x (Polynomial of degree 1)
▶ Problem: Find β1 and β2 that minimises the cost function
n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2 (= MSE )
n i=1

▶ The Normal Equation Solution


Sxx Sy − Sx Sxy nSxy − Sx Sy
β1 = , β2 =
nSxx − Sx Sx nSxx − Sx Sx
where
n
X n
X n
X n
X
Sx = xi , Sy = yi , Sxx = xi2 , Sxy = xi yi
i=1 i=1 i=1 i=1

Linear regression is a parametric approach since it boils down to finding a few


parameters β1 and β2 .
Introduction
Introduction to Machine Learning 10(45)
Linear (Degree 1) Regression – Vectorised
   
y1 1 x1
y2  1 x2   
β1
y =  .  , Xext =  . ..  , β = β
   
 ..   .. . 2

yn 1 xn
▶ y is a n × 1 vector, Xext is a n × 2 matrix and β is a 2 × 1 vector
▶ Model: y = Xext β (corresponds to y = β1 + β2 x)
▶ Problem: Find β that minimises the cost function
1
J(β) = (Xext β − y )T (Xext β − y )
n
▶ Solution (Normal Equation)
T
β = (Xext Xext )−1 Xext
T
y
An easy-to-read reference for deriving the Normal Equation:
https://dzone.com/articles/derivation-normal-equation
▶ Computing β is O(p 3 ) ⇒ might be problematic for large number of features p.

Introduction
Introduction to Machine Learning 11(45)
Vectorization in Python
Cost Function and Normal Equation from the previous slide, respectively:
1
J(β) = (Xext β − y )T (Xext β − y )
n
T
β = (Xext Xext )−1 Xext
T
y
Implement your solution with Numpy
▶ Assume X , y are Numpy (np) arrays
▶ How to extend X?
Xe = np.c_[np.ones((n,1)),X]
▶ How to run the model?
np.dot(Xe,beta)
▶ How to implement the cost function?
J = (j.T.dot(j))/n, where j = np.dot(Xe,beta)-y
▶ How to implement the normal Equation?
beta = np.linalg.inv(Xe.T.dot(Xe)).dot(Xe.T).dot(y)

The Numpy syntax is a bit strange at first so take your time to learn and use it
Introduction
Introduction to Machine Learning 12(45)
Gradient Descent - Motivation
We will often face the following minimization problem:

Find β that minimises a given cost function J(β).

The gradient descent is a numerical method to solve minimization problems.

In general
▶ Analytical solutions like the Normal Equation are not always possible.
▶ In high-dimensional datasets, the matrix X T X can become extremely large, making
the computation of its inverse computationally intensive and memory-intensive.
▶ Additionally, the inversion of such a large matrix may not even be possible due to
numerical instability or limited computational resources.
▶ Therefore, we need to use numerical methods to solve the problem
▶ In this lecture:
▶ Simplest possible algorithm - (Batch) Gradient Descent
▶ Feature normalization ⇒ a method to speed up the gradient descent
procedure
▶ Later on:
▶ Other variants of Gradient Descent
▶ Available support in Python
Gradient Descent
Introduction to Machine Learning 13(45)
Gradient Descent - Introduction
Problem: Find x that minimises f (x)
Solution
1. Select a start value x 0 and (small) learning rate γ
df
2. Apply repeatedly x j+1 = x j − γ dx
3. Stop when |x j+1 − x j | < ε or after a fix number of iterations

Notice
▶ Will, in general, find a local min
▶ Will find global min if f (x) convex
▶ Pros: Simple and fast for strongly convex problems
▶ Cons: Slow ⇒ requires many iterations and a small γ in many realistic cases

f (x) is convex if the line segment between any two points on the function’s graph lies
above the graph.

Gradient Descent
Introduction to Machine Learning 14(45)
Selecting learning rate
Our choice of learnig rate influence the convergence rate

▶ learning rate too small ⇒ slow convergence ⇒ many iterations


▶ learning rate too large ⇒ no convergence!

Gradient Descent
Introduction to Machine Learning 15(45)
Gradient Descent for J(β1 , β2 )
The cost function as a function of β1 , β2
n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2
n i=1

= a + bβ1 + cβ2 + dβ12 + eβ1 β2 + f β22 .


for some constants a, b,P
c, d, e, f that depends on xi , yi .
Also, d = 1 and f = n1 ni=1 xi2 will both be positive.

Thus
▶ J(β1 , β2 ) is an upward facing parabolic “bowl” (i.e. convex)
▶ ... ⇒ will have a unique min (β1min , β2min )
▶ ... ⇒ we can take any starting point (β10 , β20 ) in gradient descent

2D Gradient Descent
n
∂J 2γ X j
β1j+1 = β1j − γ = β1j − (β + β2j xi − yi )
∂β1 n i=1 1
n
∂J 2γ X
β2j+1 = β2j − γ = β2j − xi (β1j + β2j xi − yi )
∂β2 n i=1

Gradient Descent
Introduction to Machine Learning 16(45)
Gradient Descent for J(β) – Vectorised

1 x1 y1
   
1 x2 y2   
β1

X = . ..  , y =  ..  , β = β
   
 .. . . 2

1 xn yn

1
Cost function: J(β) = n
(X β − y )T (X β − y )

2D Gradient Descent
2γ T
β j+1 = β j − γ∇J(β) = β j − X (X β j − y )
n
where the gradient ∇J(β) is defined as
 
∂J
∂β1
∇J(β) = 
 

∂J
∂β2

Gradient Descent
Introduction to Machine Learning 17(45)
Gradient Descent for J(β)
Dataset:
x = [1, 2, 3]
y = [3, 4, 6]
Parameters:
β1 = 0
β2 = 0
Cost function:
3
1X
J(β) = (β1 + β2 xi − yi )2
3 i=1
Gradient Descent update rule:
∂J(β)
β1j+1 = β1j − γ
∂β1
∂J(β)
β2j+1 = β2j − γ
∂β2

Gradient Descent
Introduction to Machine Learning 18(45)
Gradient Descent for J(β) (continued)
Gradient Descent iteration 1:
0 0
Initial values: β1 = 0, β2 = 0, learning rate η = 0.02

For i = 1 : 0 + 0 · 1 − 3 = −3
For i = 2 : 0 + 0 · 2 − 4 = −4
For i = 3 : 0 + 0 · 3 − 6 = −6

∂J 2 2
= (−3 − 4 − 6) = (−13) = −8.67
∂β1 3 3

For i = 1 : 1 · (−3) = −3
For i = 2 : 2 · (−4) = −8
For i = 3 : 3 · (−6) = −18

∂J 2 2
= (−3 − 8 − 18) = (−29) = −19.33
∂β2 3 3

1
β1 = 0 − 0.02 · (−8.67) = 0.1734
1
β2 = 0 − 0.02 · (−19.33) = 0.3866

Gradient Descent
Introduction to Machine Learning 19(45)
Gradient Descent for J(β) (continued)
Gradient Descent iteration 2:
1 1
Given: β1 = 0.17, β2 = 0.39, learning rate η = 0.02

For i = 1 : 0.17 + 0.39 · 1 − 3 = −2.44


For i = 2 : 0.17 + 0.39 · 2 − 4 = −3.05
For i = 3 : 0.17 + 0.39 · 3 − 6 = −4.66

∂J 2
= (−2.44 − 3.05 − 4.66)
∂β1 3
2
= (−10.15) = −6.77
3

For i = 1 : 1 · (−2.44) = −2.44


For i = 2 : 2 · (−3.05) = −6.10
For i = 3 : 3 · (−4.66) = −13.98

∂J 2
= (−2.44 − 6.10 − 13.98)
∂β2 3
2
= (−22.52) = −15.01
3

2
β1 = 0.17 − 0.02 · (−6.77) = 0.17 + 0.1354 = 0.3054
2
β2 = 0.39 − 0.02 · (−15.01) = 0.39 + 0.3002 = 0.6902

Gradient Descent
Introduction to Machine Learning 20(45)
Gradient Descent for J(β) (continued)
Gradient Descent for 9 iterations...
The small differences in the β values are due to the floating point implementation in the code
(more precise when compared to the equation solving in the previous slides).

Gradient Descent
Introduction to Machine Learning 21(45)
Gradient Descent in Practice
Initial Steps
1. Number of iterations N = 10, α = 0.00001, β 0 = (0, 0)
2. Repeat β j+1 = β j − αX T (X β j − y )
3. Print/plot J(β) vs N to make sure it is decreasing for each iteration
That is, we select a small α = 2γ/n and make a few iterations.
- J(β) steadily decreasing ⇒ α small enough (maybe too small)
- J(β) fluctuating or increasing ⇒ must decrease α

Fine Tuning (Try and error)


1. Modify N and α such that J(β) rapidly decreases
2. ... and finally stabilizes at a certain minimum value
3. Stable J(β) ⇒ You have found β that minimises J(β)
A plot J(β) vs N is a good way to manually see if J(β) has stabilized.

Gradient Descent
Introduction to Machine Learning 22(45)
House Prices in Oregon – Gradient Descent
10
5 Result - Gradient Descent 10 Gradient Descent Cost with alpha = 1.0E-07
10
14 4.5
Training data
Gradient Descent
Normal Equation
12
4

10
3.5
House Price (Dollar)

Cost Function
3

2.5

2
2

1.5
0

-2 1
0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 140 160 180 200
House Area (Square feet) Iterations

β1 , β2 = −0.01996, 203.54 (compared with -40259, 223.77)


Price for a 3500 sq-ft house: $712404 (compared with $742946)
Increase number of iterations ⇒ we will get closer to Normal Equation result
Skip first 10-20 iterations in the J(β) vs N plot to better capture asymptotic details.
Gradient Descent
Introduction to Machine Learning 23(45)
Gradient Descent Alternatives
We used Batch Gradient Descent

β j+1 = β j − αX T (X β j − y )

that makes use of the entire dataset X , y in each iteration and α = n .

Alternative Optimization Approaches


▶ Mini-batch gradient descent uses only a sample of the dataset to
speed up the computations
▶ Stochastic gradient descent uses a randomly chosen single
observation to speed up the computations
▶ Adaptive methods varies the step length γ depending on the gradient
▶ .. and many more fancy approaches that are designed to handle
special cases
Python also comes with several predefined optimization methods. A few
of these will presented later on.
Gradient Descent
Introduction to Machine Learning 24(45)
10 minute break

Coffee Break ...

Gradient Descent
Introduction to Machine Learning 25(45)
Multivariate Linear Regression – Introduction
Previously: y (price) as a function of x (area, one feature)
Model: y = β1 + β2 x
Now: y as a function of multiple features x1 , x2 , ...xp
Model: y = β0 + β1 x1 + β2 x2 + ... + βp xp (multiple features, still linear)

Example: A girls height as a function of mom and dad height (two features)
Dataset: 214 observations of mom, dad, and girl heights

Linear Multivariate Regression and Feature Normalization


Introduction to Machine Learning 26(45)
Girls Height – Dataset
▶ Dataset: girls_height.csv, Samples: 214 girls
▶ Q: Predicted height for a girl who’s mom is 65 inches and dad is 70 inches?
Dataset Plot

Linear Multivariate Regression and Feature Normalization


Introduction to Machine Learning 27(45)
Multivariate Linear Regression – Setup
Assumption: Height = a + b×MomHeight+c×DadHeight
Model: y = β1 + β2 X1 + β3 X2

Vectorised Approach: Model y = X β


y1 1 x11 x12
   
 
y2  1 x21 2
x2  β1
y =  .  , X = . .. ..  , β = β2 
   
 ..   .. . . β3
yn 1 xn1 xn2

1
Problem Find β minimising cost function: J(β) = n
(X β − y )T (X β − y )

Exact Solution β = (X T X )−1 X T y (Normal Equation)

Approximative Solution (Gradient Descent)


2γ T
β j+1 = β j − γ∇J(β) = β j − X (X β j − y )
n
Notice
▶ The vectorized Cost Function, Normal Equation, and Gradient Descent are
identical to the case y = β1 + β2 x ⇒ A proper Python solution can be reused
Linear Multivariate Regression and Feature Normalization
Introduction to Machine Learning 28(45)
Girls Height – Result
Normal Equation
70
▶ β = 18.50, 0.303, 0.388
Height of girl with parents 68

(65,70): 65.425
66

64
Gradient Descent
▶ α = 0.0002, 62

Niter = 20000000 60

▶ β = 18.48, 0.304, 0.388


58
80
Height of girl with parents 75
70 65
70
65 60
(65,70): 65.426 60 55

Notice: Similar results for gradient descent required 20 million iterations!


⇒ a few minutes to compute!

Linear Multivariate Regression and Feature Normalization


Introduction to Machine Learning 29(45)
Feature Normalization
▶ Gradient descent required 20 million iterations to compute β in the Girls
Height example.
▶ This is quite common on multivariate problems. Especially if the values in
different features are vastly different. E.g. in range [0, 1] in one feature,
and in range [1000, 5000] in another.
▶ Typical solutions
▶ Normalize data ⇒ all features of similar size
▶ Replace Gradient Descent with more advanced optimization methods
Feature Normalization For each feature X i (but not for the one-column)
1. Compute mean µi
2. Compute standard deviation σi
3. Compute normalized Xni as Xni = (X i − µi )/σi
4. Build extended matrix Xne = [1, Xn ] and continue
After normalization each feature Xni will have a mean value of 0, and a
standard deviation of 1.
Feature Normalization
Introduction to Machine Learning 30(45)
Girls Height – Feature Normalization
After normalization with µ = [63.63, 69.41] and σ = [2.79, 3.22]

Notice
▶ All feature values centered around 0
▶ All features have the same spread (standard deviation 1)
Feature Normalization
Introduction to Machine Learning 31(45)
Feature Normalization

▶ Feature normalization turns every “bowl” into a uniform one (the yellow)
that is strongly convex ⇒ the gradient descent iteration converges rapidly
▶ The girls height “bowl” looks a bit like the black one. Rather strongly
convex for β2 , β3 and very flat for the intercept coefficient β1
▶ Strongly convex for β2 , β3 ⇒ must use a small learning rate (step size) α
▶ Small α ⇒ very slow convergence for β1 ⇒ We need 20 million iterations
Feature Normalization
Introduction to Machine Learning 32(45)
Normalized Girls Height – Result
Normal Equation 2500

▶ β = [64.8, 0.845, 1.26]


▶ J = 4.048 2000

Height of girl with parents


(65,70): 65.425 1500

Cost J
Gradient Descent 1000

▶ α = 0.01, Niter = 1000


▶ β = [64.8, 0.845, 1.26] 500

▶ J = 4.048
Height of girl with parents 0
0 100 200 300 400 500 600 700 800 900 1000

(65,70): 65.422 Number of iterations

Notice Gradient descent requires just 1000 iterations! ⇒ less than a second.
Also, parents (65,70) must be normalized to (0.4898, 0.1821) before computing
the height using our new β.
Feature Normalization
Introduction to Machine Learning 33(45)
Polynomial – Dataset
Entire dataset, 1000 samples
80

70

60

50

40
y

30

20

10

0
0 5 10 15 20 25

▶ Dataset: polynomial.csv with 1000 observations


▶ Artificial dataset generated by Jonas Lundberg
▶ Q: What is a suitable model?
Polynomial Linear Regression
Introduction to Machine Learning 34(45)
Polynomial Regression – Setup
Observation: A polynomial of degree 3 could handle the up-down-up scenario
Model: y = β1 + β2 X + β3 X 2 + β4 X 3

Vectorised Approach Model y = Xβ


1 x1 x12 x13 y1
     
β1
1 x2 x22 x2 3  y2  β2 
X = . .. .. ..  , y =  ..  , β = β 
    
 .. . . . . 3
β4
1 xn x22 xn3 yn

In Python X is replaced by X = np.c_[np.ones((n,1)),X,X**2,X**3]


1
Problem Find β minimising cost function: J(β) = n
(X β − y )T (X β − y )

Exact Solution β = (X T X )−1 X T y (Normal Equation)


Approximative Solution (Gradient Descent)
2γ T
β j+1 = β j − γ∇J(β) = β j − X (X β j − y )
n
Notice
▶ Once again, no change in the vectorised versions
▶ ⇒ A proper Python solution can be reused
Polynomial Linear Regression
Introduction to Machine Learning 35(45)
Polynomial – Result
Polynomial fit, 1000 samples
80

Algorithm
70

1. Read data ⇒ 60

vectors X and y
50

2. Extend X ⇒ 40

y
X = [1, X , X 2 , X 3 ]
30

3. Compute β as 20

β = (X T X )−1 X T y 10

4. Plot xβ vs x 0
0 5 10 15 20 25
x

The dataset was generated from:

y = f (x) = 5 + 12x − x 2 + 0.025x 3 + normrnd(0, 5)


| {z }
= ε (Noise)

Polynomial fit: β = [5.50, 11.7, −0.981, 0.0246] OK or ...?


Polynomial Linear Regression
Introduction to Machine Learning 36(45)
Which one is the best fit?
Degree 1 Degree 3 Degree 5
70 70 80

60 60 70

60
50 50
50
40 40
40
30 30 30

20 20 20
-2 0 2 -2 0 2 -2 0 2

Degree 7 Degree 9 Degree 11


70 70 70

60 60 60

50 50 50

40 40 40

30 30 30

20 20 20
-2 0 2 -2 0 2 -2 0 2

A small feature normalized subsample of 50 observations from polynomial.csv


▶ Question: Which one is the best fit?
▶ Question: How do we compare different models?

Over- and underfitting


Introduction to Machine Learning 37(45)
Mean Square Error
▶ In regression we use a training set X ,y to compute a β
that makes y = X β a good fit to the training data.

▶ y = X β is our model, the result of the training phase

▶ The mean square error (MSE) for a given β is defined as


n
1X
MSE = (yi − f (xi ))2 where f (xi ) = xi β
n i=0

or vectorized
1
MSE = (X β − y )T (X β − y )
n

▶ That is, the mean of the vertical squared distances


▶ Notice that MSE is the same as the cost function J(β) for linear regression.
▶ Obviously, low MSE is a good fit and high MSE is a bad fit

Over- and underfitting


Introduction to Machine Learning 38(45)
Train and Test Error
Assume a training set Xtrain , ytrain and a test set Xtest , ytest
▶ Training phase: Xtrain , ytrain ⇒ Estimate β ⇒ create a model y = X β

▶ Training error: Compute MSE = n1 (X β − y )T (X β − y ) with Xtrain ,ytrain


using β from the training phase.

▶ Test error: Compute MSE = n1 (X β − y )T (X β − y ) with Xtest ,ytest


using β from the training phase.

Training vs Test
▶ The Mean Squared Error (MSE) computed using the training set is
referred to as the Training MSE estimation.
▶ The MSE computed using a separate test set is the Test MSE estimation.
▶ A test set resembles the training set but is not utilized in model
construction, remaining unseen by the model during training.
▶ A high-performing model exhibits a low Test MSE, indicating strong
performance on unseen data.
Over- and underfitting
Introduction to Machine Learning 39(45)
Training vs Test – polynomial.csv
▶ We used a small sample of 50 points to train various polynomial models
▶ We use the remaining 950 points to evaluate the model
Degree MSEtrain MSEtest
1 55.7 69.1
2 53.4 75.4
3 22.3 27.6
4 21.4 30.6
5 21.3 31.3
6 18.5 39.4
7 17.7 67.6
8 16.7 197.7
9 16.6 284.5
10 16.5 558.7
Conclusions
▶ MSEtrain is steadily decreasing ⇒ higher order polynomials can always
better adapt to the training data
▶ MSEtest has a minimum at degree 3 ⇒ best to handle unseen data
⇒ Degree 3 gives the best fit!
Over- and underfitting
Introduction to Machine Learning 40(45)
Reducible and Irreducible Errors
We generated data from
y = f (x) = 5 + 12x − x 2 + 0.025x 3 + normrnd(0, 5)
| {z }
= ε (Noise)

Our polynomial fit gave the model.


ŷ = fˆ(x) = 5.5 + 11.7x − 0.98x 2 + 0.0246x 3

▶ The error can be divided into two parts

E (y − ŷ )2 = [f (X ) − fˆ(X )]2 + Var (ε)


2
▶ E (y − ŷ ) is the total error of our regression
▶ [f (X ) − fˆ(X )]2 is the error due to our model fˆ(X ) (reducible)
▶ Var (ε) is the error due to the noise (irreducible)
▶ A better model can reduce the error; a worse one (e.g., β1 + β2 x) may
increase it.
▶ We can never eliminate the irreducible error Var (ε) due to noise.
Over- and underfitting
Introduction to Machine Learning 41(45)
Variance and Bias, Over- and Underfitting
The reducible error for model fˆ(x) can further be divided into two parts
Reducible Error = Bias + Variance
▶ Bias refers to the error introduced by approximating a real-world problem
with a simplified model.
▶ It measures how far off the model’s predictions are from the true
values.
▶ A model with high bias pays little attention to the training data and
oversimplifies the problem, leading to underfitting.
▶ Variance refers to the amount the model’s prediction would change if we
trained it on a different dataset.
▶ It measures the model’s sensitivity to the fluctuations in the training
data.
▶ A model with high variance fits the training data too closely and
captures noise and underlying patterns, leading to overfitting.
▶ Variance and Bias can not be avoided for realistic datasets
▶ Achieving a balance between Variance and Bias is essential to minimize
errors, known as the Bias-Variance Trade-off.
Over- and underfitting
Introduction to Machine Learning 42(45)
Bias

▶ The models cannot capture the true underlying relationship between the
features (X) and the target variable (y).
Over- and underfitting
Introduction to Machine Learning 43(45)
▶ The models exhibit high variability in their predictions when trained on different subsets of
the data.
▶ Despite using the same degree of polynomial features, the models show significant
differences in their shapes and predictions across different training datasets.
▶ This variability arises from the sensitivity of the models to the specific samples in the training
data, leading to different learned patterns and resulting in a wide range of predictions.
▶ Consequently, the models demonstrate high sensitivity to variations in the training data,
indicating high variance.
▶ In other words, the models tend to overfit the training data, capturing noise and
idiosyncrasies specific to each training set rather than the true underlying relationship
between the features and the target variable.
Over- and underfitting
Introduction to Machine Learning 44(45)
Regression Summary
All scenarios (linear, multivariate, polynomial) are treated the same way
▶ Dataset: [X , y ]
▶ Extend X to fit scenario Xext = [1, X , ...]
▶ Model: y = Xext β
▶ Problem: Find β that minimises the cost function
1
J(β) = (Xext β − y )T (Xext β − y )
n
▶ Exact Solution (Normal Equation)
T
β = (Xext Xext )−1 Xext
T
y
▶ Approximative Solution (Gradient Descent)
2γ T
β j+1 = β j − γ∇J(β) = β j − X (Xext β j − y )
n ext
▶ Apply feature normalization of X i to speed up gradient descent
1. Compute mean µi and standard deviation σi
2. Compute normalized X i as Xni = (X i − µi )/σi

Over- and underfitting


Introduction to Machine Learning 45(45)

You might also like