0% found this document useful (0 votes)

10 views

Lecture slides - Linear Regression (2025)

This lecture focuses on linear and polynomial regression as part of an introduction to machine learning. It covers the mathematical concepts, the least squares method for fitting models, and the use of gradient descent for optimization. Datasets and practical examples, such as predicting house prices, are also discussed to illustrate the application of these techniques.

Uploaded by

trol.man890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lecture slides - Linear Regression (2025)

Uploaded by

trol.man890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Introduction to Machine Learning

Lecture 2 - Linear and Polynomial Regression

Amilcar Soares

[email protected]

Slides and used datasets are available in Moodle

A big thanks to Dr. Jonas Lundberg for providing most of the slides for this Lecture.

9 april 2025

Introduction to Machine Learning 1(45)

Agenda - Linear Regression
Reading Instructions

▶ Lindholm, A., Wahlstrom, N., Lindsten, F., & Schon, T. B. (2022).

Machine learning: a first course for engineers and scientists. Cambridge
University Press.
▶ Chapter 3: Basic Parametric Models and a Statistical Perspective on
Learning. Pages 37 to 45.
▶ This lecture and slides focus on mathematics (high-level), concepts, and
understanding of concepts.

Datasets: house_prices.csv, girls_height.csv, polynomial.csv

Introduction to Machine Learning 2(45)

Regression – Introduction
Regression 80
Polynomial fit, 1000 samples

▶ Find a function
70
y = f (x) that fits a
given dataset 60

(x1 , y1 ), (x2 , y2 ), ...

▶ Once f (x) found,

y
make a prediction
yi = f (xi ) for an 30

unknow input xi
20
▶ (x1 , y1 ), (x2 , y2 ), ... is
our training set 10

▶ y = f (x) is our 0
0 5 10 15 20 25
model x

In general
A regression problem aims to find a function y = f (X ) that fits a given dataset (X , y )
where X is a n × p matrix and y is a n × 1 vector, where n = number of samples, p =
number of features.
Introduction
Introduction to Machine Learning 3(45)
From the previous lecture: k-NN regression

▶ The figure above shows a k = 5 fit to a given polynomial dataset (xi , yi )

▶ It looks ugly, but it serves its purpose: to compute y for an arbitrary X .
▶ To build the plot:
▶ Divide x-axis interval [1, 25] into e.g., 200 equidistant points Xj
▶ For each Xj : find the 5 data points in the dataset closest to Xj
▶ Compute the average for the corresponding y − value for the 5 selected
data points ⇒ Yj
▶ Create the the Plot Xj , Yj
Introduction
Introduction to Machine Learning 4(45)
Regression Example: House Prices in Oregon
Dataset
▶ X: House area in square feet 14
10 5

▶ y: Price in dollars
12
▶ Samples: 200

Q: What is the price for a 3500 10

square feet house in Oregon?

House Price (Dollar)

8
Solution: Find a function
y = f (X ) 6

that fits the data and then com-

pute f (3500) to find the predicted 4

price.
2
Linear (1D) Regression: Find a
function
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500
y = β1 + β2 X House Area (Square feet)

that fits the data.

Introduction
Introduction to Machine Learning 5(45)
Linear (1D) Regression - Introduction

▶ Assumption: y = β1 + β2 x (our model)

▶ Goal: Find β1 , β2 making y = β1 + β2 x the best possible fit

The Least Square Method

The squared vertical distance between (xi , yi ) and assumption y = β1 + β2 x is

(y − yi )2 = ((β1 + β2 xi ) − yi )2 .

The mean squared distance for the entire training set is

n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2 .
n i=1

Remaining Task: Find β1 and β2 that minimises the cost function J(β1 , β2 )

Introduction
Introduction to Machine Learning 6(45)
Basic Calculus - Find minimum values
From your basic calculus course
Problem: Find x that minimises a given function f (x)
Solution: Differentiate f (x) with respect x and set it to zero:
df
=0
dx
and solve resulting equation for x ⇒ extreme values for f (x).
In our case
Problem: Find β1 and β2 that minimises the cost function J(β1 , β2 )
Solution: Differentiate J(β1 , β2 ) with respect to β1 and β2 and set them to zero:

∂J
= 0
∂β1
∂J
= 0
∂β2

and solve resulting system of equations for β1 and β2 .

Introduction
Introduction to Machine Learning 7(45)
Linear (1D) Regression - Finding β
n
1X 2
J(β1 , β2 ) = ((β1 + β2 xi ) − yi ) .
n i=1
We differentiate J(β1 , β2 ) with respect to β1 and β2 and set them to zero:
n
∂J 2X
= (β1 + β2 xi − yi ) = 0
∂β1 n i=1
n
∂J 2X
= xi (β1 + β2 xi − yi ) = 0
∂β2 n i=1
Simplifying
n
X n
X n
X
β1 1 + β2 xi = yi
i=1 i=1 i=1
n
X n
X n
2
X
β1 xi + β2 xi = yi xi
i=1 i=1 i=1
Solving for β1 , β2 gives
Sxx Sy − Sx Sxy nSxy − Sx Sy
β1 = , β2 =
nSxx − Sx Sx nSxx − Sx Sx
where
n
X n
X n
X n
2
X
Sx = xi , Sy = yi , Sxx = xi , Sxy = xi yi
i=1 i=1 i=1 i=1
Introduction
Introduction to Machine Learning 8(45)
Example: House Prices in Oregon
10 5 Result - Normal Equation
14
Q: What is the price for a 3500 square feet
Training data
house in Oregon? Linear regression

Solution
10
▶ Linear fit on the training set

House Price (Dollar)

▶ price = β1 + β2 ×area 8

Computing β1 and β2 as in previous slide 6

▶ β1 = −40259, β2 = 223.77
4
Answer
The price for a 3500 sqft house in Oregon 2
is $742,946
0
The exact solution to the minimisation
problem presented in the previous slide is -2
called the Normal Equation. 0 500 1000 1500 2000 2500 3000 3500 4000 4500

House Area (Square feet)

Introduction
Introduction to Machine Learning 9(45)
Linear (Degree 1) Regression – Summary
▶ A dataset (xi , yi ) with n samples
▶ Model (or hypothesis): y = β1 + β2 x (Polynomial of degree 1)
▶ Problem: Find β1 and β2 that minimises the cost function
n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2 (= MSE )
n i=1

▶ The Normal Equation Solution

Sxx Sy − Sx Sxy nSxy − Sx Sy
β1 = , β2 =
nSxx − Sx Sx nSxx − Sx Sx
where
n
X n
X n
X n
X
Sx = xi , Sy = yi , Sxx = xi2 , Sxy = xi yi
i=1 i=1 i=1 i=1

Linear regression is a parametric approach since it boils down to finding a few

parameters β1 and β2 .
Introduction
Introduction to Machine Learning 10(45)
Linear (Degree 1) Regression – Vectorised
   
y1 1 x1
y2  1 x2 
β1
y =  .  , Xext =  . ..  , β = β
   
 ..   .. . 2

yn 1 xn
▶ y is a n × 1 vector, Xext is a n × 2 matrix and β is a 2 × 1 vector
▶ Model: y = Xext β (corresponds to y = β1 + β2 x)
▶ Problem: Find β that minimises the cost function
1
J(β) = (Xext β − y )T (Xext β − y )
n
▶ Solution (Normal Equation)
T
β = (Xext Xext )−1 Xext
T
y
An easy-to-read reference for deriving the Normal Equation:
https://dzone.com/articles/derivation-normal-equation
▶ Computing β is O(p 3 ) ⇒ might be problematic for large number of features p.

Introduction
Introduction to Machine Learning 11(45)
Vectorization in Python
Cost Function and Normal Equation from the previous slide, respectively:
1
J(β) = (Xext β − y )T (Xext β − y )
n
T
β = (Xext Xext )−1 Xext
T
y
Implement your solution with Numpy
▶ Assume X , y are Numpy (np) arrays
▶ How to extend X?
Xe = np.c_[np.ones((n,1)),X]
▶ How to run the model?
np.dot(Xe,beta)
▶ How to implement the cost function?
J = (j.T.dot(j))/n, where j = np.dot(Xe,beta)-y
▶ How to implement the normal Equation?
beta = np.linalg.inv(Xe.T.dot(Xe)).dot(Xe.T).dot(y)

The Numpy syntax is a bit strange at first so take your time to learn and use it
Introduction
Introduction to Machine Learning 12(45)
Gradient Descent - Motivation
We will often face the following minimization problem:

Find β that minimises a given cost function J(β).

The gradient descent is a numerical method to solve minimization problems.

In general
▶ Analytical solutions like the Normal Equation are not always possible.
▶ In high-dimensional datasets, the matrix X T X can become extremely large, making
the computation of its inverse computationally intensive and memory-intensive.
▶ Additionally, the inversion of such a large matrix may not even be possible due to
numerical instability or limited computational resources.
▶ Therefore, we need to use numerical methods to solve the problem
▶ In this lecture:
▶ Simplest possible algorithm - (Batch) Gradient Descent
▶ Feature normalization ⇒ a method to speed up the gradient descent
procedure
▶ Later on:
▶ Other variants of Gradient Descent
▶ Available support in Python
Gradient Descent
Introduction to Machine Learning 13(45)
Gradient Descent - Introduction
Problem: Find x that minimises f (x)
Solution
1. Select a start value x 0 and (small) learning rate γ
df
2. Apply repeatedly x j+1 = x j − γ dx
3. Stop when |x j+1 − x j | < ε or after a fix number of iterations

Notice
▶ Will, in general, find a local min
▶ Will find global min if f (x) convex
▶ Pros: Simple and fast for strongly convex problems
▶ Cons: Slow ⇒ requires many iterations and a small γ in many realistic cases

f (x) is convex if the line segment between any two points on the function’s graph lies
above the graph.

Gradient Descent
Introduction to Machine Learning 14(45)
Selecting learning rate
Our choice of learnig rate influence the convergence rate

▶ learning rate too small ⇒ slow convergence ⇒ many iterations

▶ learning rate too large ⇒ no convergence!

Gradient Descent
Introduction to Machine Learning 15(45)
Gradient Descent for J(β1 , β2 )
The cost function as a function of β1 , β2
n
1X
J(β1 , β2 ) = ((β1 + β2 xi ) − yi )2
n i=1

= a + bβ1 + cβ2 + dβ12 + eβ1 β2 + f β22 .

for some constants a, b,P
c, d, e, f that depends on xi , yi .
Also, d = 1 and f = n1 ni=1 xi2 will both be positive.

Thus
▶ J(β1 , β2 ) is an upward facing parabolic “bowl” (i.e. convex)
▶ ... ⇒ will have a unique min (β1min , β2min )
▶ ... ⇒ we can take any starting point (β10 , β20 ) in gradient descent

2D Gradient Descent
n
∂J 2γ X j
β1j+1 = β1j − γ = β1j − (β + β2j xi − yi )
∂β1 n i=1 1
n
∂J 2γ X
β2j+1 = β2j − γ = β2j − xi (β1j + β2j xi − yi )
∂β2 n i=1

Gradient Descent
Introduction to Machine Learning 16(45)
Gradient Descent for J(β) – Vectorised

1 x1 y1
   
1 x2 y2 
β1

X = . ..  , y =  ..  , β = β
   
 .. . . 2

1 xn yn

1
Cost function: J(β) = n
(X β − y )T (X β − y )

2D Gradient Descent
2γ T
β j+1 = β j − γ∇J(β) = β j − X (X β j − y )
n
where the gradient ∇J(β) is defined as
 
∂J
∂β1
∇J(β) = 
 

∂J
∂β2

Gradient Descent
Introduction to Machine Learning 17(45)
Gradient Descent for J(β)
Dataset:
x = [1, 2, 3]
y = [3, 4, 6]
Parameters:
β1 = 0
β2 = 0
Cost function:
3
1X
J(β) = (β1 + β2 xi − yi )2
3 i=1
Gradient Descent update rule:
∂J(β)
β1j+1 = β1j − γ
∂β1
∂J(β)
β2j+1 = β2j − γ
∂β2

Gradient Descent
Introduction to Machine Learning 18(45)
Gradient Descent for J(β) (continued)
Gradient Descent iteration 1:
0 0
Initial values: β1 = 0, β2 = 0, learning rate η = 0.02

For i = 1 : 0 + 0 · 1 − 3 = −3
For i = 2 : 0 + 0 · 2 − 4 = −4
For i = 3 : 0 + 0 · 3 − 6 = −6

∂J 2 2
= (−3 − 4 − 6) = (−13) = −8.67
∂β1 3 3

For i = 1 : 1 · (−3) = −3
For i = 2 : 2 · (−4) = −8
For i = 3 : 3 · (−6) = −18

∂J 2 2
= (−3 − 8 − 18) = (−29) = −19.33
∂β2 3 3

1
β1 = 0 − 0.02 · (−8.67) = 0.1734
1
β2 = 0 − 0.02 · (−19.33) = 0.3866

Gradient Descent
Introduction to Machine Learning 19(45)
Gradient Descent for J(β) (continued)
Gradient Descent iteration 2:
1 1
Given: β1 = 0.17, β2 = 0.39, learning rate η = 0.02

For i = 1 : 0.17 + 0.39 · 1 − 3 = −2.44

For i = 2 : 0.17 + 0.39 · 2 − 4 = −3.05
For i = 3 : 0.17 + 0.39 · 3 − 6 = −4.66

∂J 2
= (−2.44 − 3.05 − 4.66)
∂β1 3
2
= (−10.15) = −6.77
3

For i = 1 : 1 · (−2.44) = −2.44

For i = 2 : 2 · (−3.05) = −6.10
For i = 3 : 3 · (−4.66) = −13.98

∂J 2
= (−2.44 − 6.10 − 13.98)
∂β2 3
2
= (−22.52) = −15.01
3

2
β1 = 0.17 − 0.02 · (−6.77) = 0.17 + 0.1354 = 0.3054
2
β2 = 0.39 − 0.02 · (−15.01) = 0.39 + 0.3002 = 0.6902

Gradient Descent
Introduction to Machine Learning 20(45)
Gradient Descent for J(β) (continued)
Gradient Descent for 9 iterations...
The small differences in the β values are due to the floating point implementation in the code
(more precise when compared to the equation solving in the previous slides).

Gradient Descent
Introduction to Machine Learning 21(45)
Gradient Descent in Practice
Initial Steps
1. Number of iterations N = 10, α = 0.00001, β 0 = (0, 0)
2. Repeat β j+1 = β j − αX T (X β j − y )
3. Print/plot J(β) vs N to make sure it is decreasing for each iteration
That is, we select a small α = 2γ/n and make a few iterations.
- J(β) steadily decreasing ⇒ α small enough (maybe too small)
- J(β) fluctuating or increasing ⇒ must decrease α

Fine Tuning (Try and error)

1. Modify N and α such that J(β) rapidly decreases
2. ... and finally stabilizes at a certain minimum value
3. Stable J(β) ⇒ You have found β that minimises J(β)
A plot J(β) vs N is a good way to manually see if J(β) has stabilized.

Gradient Descent
Introduction to Machine Learning 22(45)
House Prices in Oregon – Gradient Descent
10
5 Result - Gradient Descent 10 Gradient Descent Cost with alpha = 1.0E-07
10
14 4.5
Training data
Gradient Descent
Normal Equation
12
4

10
3.5
House Price (Dollar)

Cost Function
3

2.5

2
2

1.5
0

-2 1
0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 20 40 60 80 100 120 140 160 180 200
House Area (Square feet) Iterations

β1 , β2 = −0.01996, 203.54 (compared with -40259, 223.77)

Price for a 3500 sq-ft house: $712404 (compared with $742946)
Increase number of iterations ⇒ we will get closer to Normal Equation result
Skip first 10-20 iterations in the J(β) vs N plot to better capture asymptotic details.
Gradient Descent
Introduction to Machine Learning 23(45)
Gradient Descent Alternatives
We used Batch Gradient Descent

β j+1 = β j − αX T (X β j − y )
2γ
that makes use of the entire dataset X , y in each iteration and α = n .

Alternative Optimization Approaches

▶ Mini-batch gradient descent uses only a sample of the dataset to
speed up the computations
▶ Stochastic gradient descent uses a randomly chosen single
observation to speed up the computations
▶ Adaptive methods varies the step length γ depending on the gradient
▶ .. and many more fancy approaches that are designed to handle
special cases
Python also comes with several predefined optimization methods. A few
of these will presented later on.
Gradient Descent
Introduction to Machine Learning 24(45)
10 minute break

Coffee Break ...

Gradient Descent
Introduction to Machine Learning 25(45)
Multivariate Linear Regression – Introduction
Previously: y (price) as a function of x (area, one feature)
Model: y = β1 + β2 x
Now: y as a function of multiple features x1 , x2 , ...xp
Model: y = β0 + β1 x1 + β2 x2 + ... + βp xp (multiple features, still linear)

Example: A girls height as a function of mom and dad height (two features)
Dataset: 214 observations of mom, dad, and girl heights

Linear Multivariate Regression and Feature Normalization

Introduction to Machine Learning 26(45)
Girls Height – Dataset
▶ Dataset: girls_height.csv, Samples: 214 girls
▶ Q: Predicted height for a girl who’s mom is 65 inches and dad is 70 inches?
Dataset Plot

Linear Multivariate Regression and Feature Normalization

Introduction to Machine Learning 27(45)
Multivariate Linear Regression – Setup
Assumption: Height = a + b×MomHeight+c×DadHeight
Model: y = β1 + β2 X1 + β3 X2

Vectorised Approach: Model y = X β

y1 1 x11 x12
   
 
y2  1 x21 2
x2  β1
y =  .  , X = . .. ..  , β = β2 
   
 ..   .. . . β3
yn 1 xn1 xn2

1
Problem Find β minimising cost function: J(β) = n
(X β − y )T (X β − y )

Exact Solution β = (X T X )−1 X T y (Normal Equation)

Approximative Solution (Gradient Descent)

2γ T
β j+1 = β j − γ∇J(β) = β j − X (X β j − y )
n
Notice
▶ The vectorized Cost Function, Normal Equation, and Gradient Descent are
identical to the case y = β1 + β2 x ⇒ A proper Python solution can be reused
Linear Multivariate Regression and Feature Normalization
Introduction to Machine Learning 28(45)
Girls Height – Result
Normal Equation
70
▶ β = 18.50, 0.303, 0.388
Height of girl with parents 68

(65,70): 65.425
66

64
Gradient Descent
▶ α = 0.0002, 62

Niter = 20000000 60

▶ β = 18.48, 0.304, 0.388

58
80
Height of girl with parents 75
70 65
70
65 60
(65,70): 65.426 60 55

Notice: Similar results for gradient descent required 20 million iterations!

⇒ a few minutes to compute!

Linear Multivariate Regression and Feature Normalization

Introduction to Machine Learning 29(45)
Feature Normalization
▶ Gradient descent required 20 million iterations to compute β in the Girls
Height example.
▶ This is quite common on multivariate problems. Especially if the values in
different features are vastly different. E.g. in range [0, 1] in one feature,
and in range [1000, 5000] in another.
▶ Typical solutions
▶ Normalize data ⇒ all features of similar size
▶ Replace Gradient Descent with more advanced optimization methods
Feature Normalization For each feature X i (but not for the one-column)
1. Compute mean µi
2. Compute standard deviation σi
3. Compute normalized Xni as Xni = (X i − µi )/σi
4. Build extended matrix Xne = [1, Xn ] and continue
After normalization each feature Xni will have a mean value of 0, and a
standard deviation of 1.
Feature Normalization
Introduction to Machine Learning 30(45)
Girls Height – Feature Normalization
After normalization with µ = [63.63, 69.41] and σ = [2.79, 3.22]

Notice
▶ All feature values centered around 0
▶ All features have the same spread (standard deviation 1)
Feature Normalization
Introduction to Machine Learning 31(45)
Feature Normalization

▶ Feature normalization turns every “bowl” into a uniform one (the yellow)
that is strongly convex ⇒ the gradient descent iteration converges rapidly
▶ The girls height “bowl” looks a bit like the black one. Rather strongly
convex for β2 , β3 and very flat for the intercept coefficient β1
▶ Strongly convex for β2 , β3 ⇒ must use a small learning rate (step size) α
▶ Small α ⇒ very slow convergence for β1 ⇒ We need 20 million iterations
Feature Normalization
Introduction to Machine Learning 32(45)
Normalized Girls Height – Result
Normal Equation 2500

▶ β = [64.8, 0.845, 1.26]

▶ J = 4.048 2000

Height of girl with parents

(65,70): 65.425 1500

Cost J
Gradient Descent 1000

▶ α = 0.01, Niter = 1000

▶ β = [64.8, 0.845, 1.26] 500

▶ J = 4.048
Height of girl with parents 0
0 100 200 300 400 500 600 700 800 900 1000

(65,70): 65.422 Number of iterations

Notice Gradient descent requires just 1000 iterations! ⇒ less than a second.
Also, parents (65,70) must be normalized to (0.4898, 0.1821) before computing
the height using our new β.
Feature Normalization
Introduction to Machine Learning 33(45)
Polynomial – Dataset
Entire dataset, 1000 samples
80

40
y

0
0 5 10 15 20 25

▶ Dataset: polynomial.csv with 1000 observations

▶ Artificial dataset generated by Jonas Lundberg
▶ Q: What is a suitable model?
Polynomial Linear Regression
Introduction to Machine Learning 34(45)
Polynomial Regression – Setup
Observation: A polynomial of degree 3 could handle the up-down-up scenario
Model: y = β1 + β2 X + β3 X 2 + β4 X 3

Vectorised Approach Model y = Xβ

1 x1 x12 x13 y1
     
β1
1 x2 x22 x2 3  y2  β2 
X = . .. .. ..  , y =  ..  , β = β 
    
 .. . . . . 3
β4
1 xn x22 xn3 yn

In Python X is replaced by X = np.c_[np.ones((n,1)),X,X2,X3]

1
Problem Find β minimising cost function: J(β) = n
(X β − y )T (X β − y )

Exact Solution β = (X T X )−1 X T y (Normal Equation)

Approximative Solution (Gradient Descent)
2γ T
β j+1 = β j − γ∇J(β) = β j − X (X β j − y )
n
Notice
▶ Once again, no change in the vectorised versions
▶ ⇒ A proper Python solution can be reused
Polynomial Linear Regression
Introduction to Machine Learning 35(45)
Polynomial – Result
Polynomial fit, 1000 samples
80

Algorithm
70

1. Read data ⇒ 60

vectors X and y
50

2. Extend X ⇒ 40

y
X = [1, X , X 2 , X 3 ]
30

3. Compute β as 20

β = (X T X )−1 X T y 10

4. Plot xβ vs x 0
0 5 10 15 20 25
x

The dataset was generated from:

y = f (x) = 5 + 12x − x 2 + 0.025x 3 + normrnd(0, 5)

| {z }
= ε (Noise)

Polynomial fit: β = [5.50, 11.7, −0.981, 0.0246] OK or ...?

Polynomial Linear Regression
Introduction to Machine Learning 36(45)
Which one is the best fit?
Degree 1 Degree 3 Degree 5
70 70 80

60 60 70

60
50 50
50
40 40
40
30 30 30

20 20 20
-2 0 2 -2 0 2 -2 0 2

Degree 7 Degree 9 Degree 11

70 70 70

60 60 60

50 50 50

40 40 40

30 30 30

20 20 20
-2 0 2 -2 0 2 -2 0 2

A small feature normalized subsample of 50 observations from polynomial.csv

▶ Question: Which one is the best fit?
▶ Question: How do we compare different models?

Over- and underfitting

Introduction to Machine Learning 37(45)
Mean Square Error
▶ In regression we use a training set X ,y to compute a β
that makes y = X β a good fit to the training data.

▶ y = X β is our model, the result of the training phase

▶ The mean square error (MSE) for a given β is defined as

n
1X
MSE = (yi − f (xi ))2 where f (xi ) = xi β
n i=0

or vectorized
1
MSE = (X β − y )T (X β − y )
n

▶ That is, the mean of the vertical squared distances

▶ Notice that MSE is the same as the cost function J(β) for linear regression.
▶ Obviously, low MSE is a good fit and high MSE is a bad fit

Over- and underfitting

Introduction to Machine Learning 38(45)
Train and Test Error
Assume a training set Xtrain , ytrain and a test set Xtest , ytest
▶ Training phase: Xtrain , ytrain ⇒ Estimate β ⇒ create a model y = X β

▶ Training error: Compute MSE = n1 (X β − y )T (X β − y ) with Xtrain ,ytrain

using β from the training phase.

▶ Test error: Compute MSE = n1 (X β − y )T (X β − y ) with Xtest ,ytest

using β from the training phase.

Training vs Test
▶ The Mean Squared Error (MSE) computed using the training set is
referred to as the Training MSE estimation.
▶ The MSE computed using a separate test set is the Test MSE estimation.
▶ A test set resembles the training set but is not utilized in model
construction, remaining unseen by the model during training.
▶ A high-performing model exhibits a low Test MSE, indicating strong
performance on unseen data.
Over- and underfitting
Introduction to Machine Learning 39(45)
Training vs Test – polynomial.csv
▶ We used a small sample of 50 points to train various polynomial models
▶ We use the remaining 950 points to evaluate the model
Degree MSEtrain MSEtest
1 55.7 69.1
2 53.4 75.4
3 22.3 27.6
4 21.4 30.6
5 21.3 31.3
6 18.5 39.4
7 17.7 67.6
8 16.7 197.7
9 16.6 284.5
10 16.5 558.7
Conclusions
▶ MSEtrain is steadily decreasing ⇒ higher order polynomials can always
better adapt to the training data
▶ MSEtest has a minimum at degree 3 ⇒ best to handle unseen data
⇒ Degree 3 gives the best fit!
Over- and underfitting
Introduction to Machine Learning 40(45)
Reducible and Irreducible Errors
We generated data from
y = f (x) = 5 + 12x − x 2 + 0.025x 3 + normrnd(0, 5)
| {z }
= ε (Noise)

Our polynomial fit gave the model.

ŷ = fˆ(x) = 5.5 + 11.7x − 0.98x 2 + 0.0246x 3

▶ The error can be divided into two parts

E (y − ŷ )2 = [f (X ) − fˆ(X )]2 + Var (ε)

2
▶ E (y − ŷ ) is the total error of our regression
▶ [f (X ) − fˆ(X )]2 is the error due to our model fˆ(X ) (reducible)
▶ Var (ε) is the error due to the noise (irreducible)
▶ A better model can reduce the error; a worse one (e.g., β1 + β2 x) may
increase it.
▶ We can never eliminate the irreducible error Var (ε) due to noise.
Over- and underfitting
Introduction to Machine Learning 41(45)
Variance and Bias, Over- and Underfitting
The reducible error for model fˆ(x) can further be divided into two parts
Reducible Error = Bias + Variance
▶ Bias refers to the error introduced by approximating a real-world problem
with a simplified model.
▶ It measures how far off the model’s predictions are from the true
values.
▶ A model with high bias pays little attention to the training data and
oversimplifies the problem, leading to underfitting.
▶ Variance refers to the amount the model’s prediction would change if we
trained it on a different dataset.
▶ It measures the model’s sensitivity to the fluctuations in the training
data.
▶ A model with high variance fits the training data too closely and
captures noise and underlying patterns, leading to overfitting.
▶ Variance and Bias can not be avoided for realistic datasets
▶ Achieving a balance between Variance and Bias is essential to minimize
errors, known as the Bias-Variance Trade-off.
Over- and underfitting
Introduction to Machine Learning 42(45)
Bias

▶ The models cannot capture the true underlying relationship between the
features (X) and the target variable (y).
Over- and underfitting
Introduction to Machine Learning 43(45)
▶ The models exhibit high variability in their predictions when trained on different subsets of
the data.
▶ Despite using the same degree of polynomial features, the models show significant
differences in their shapes and predictions across different training datasets.
▶ This variability arises from the sensitivity of the models to the specific samples in the training
data, leading to different learned patterns and resulting in a wide range of predictions.
▶ Consequently, the models demonstrate high sensitivity to variations in the training data,
indicating high variance.
▶ In other words, the models tend to overfit the training data, capturing noise and
idiosyncrasies specific to each training set rather than the true underlying relationship
between the features and the target variable.
Over- and underfitting
Introduction to Machine Learning 44(45)
Regression Summary
All scenarios (linear, multivariate, polynomial) are treated the same way
▶ Dataset: [X , y ]
▶ Extend X to fit scenario Xext = [1, X , ...]
▶ Model: y = Xext β
▶ Problem: Find β that minimises the cost function
1
J(β) = (Xext β − y )T (Xext β − y )
n
▶ Exact Solution (Normal Equation)
T
β = (Xext Xext )−1 Xext
T
y
▶ Approximative Solution (Gradient Descent)
2γ T
β j+1 = β j − γ∇J(β) = β j − X (Xext β j − y )
n ext
▶ Apply feature normalization of X i to speed up gradient descent
1. Compute mean µi and standard deviation σi
2. Compute normalized X i as Xni = (X i − µi )/σi

Over- and underfitting

Introduction to Machine Learning 45(45)

Bobcat 753 Service Manual
71% (7)
Bobcat 753 Service Manual
502 pages
Eastern Project Report
80% (5)
Eastern Project Report
107 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Lec 03
No ratings yet
Lec 03
42 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
3
No ratings yet
3
14 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
CH 4
No ratings yet
CH 4
41 pages
Linear Regression With Multiple Features
No ratings yet
Linear Regression With Multiple Features
7 pages
lec6_7_Linear_regression
No ratings yet
lec6_7_Linear_regression
38 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
GR_1_report_week_7
No ratings yet
GR_1_report_week_7
6 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Regression
No ratings yet
Regression
30 pages
Updating_Weight
No ratings yet
Updating_Weight
9 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Day 1
No ratings yet
Day 1
41 pages
APSC 258 Midterm Study Guide
No ratings yet
APSC 258 Midterm Study Guide
4 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
ML03
No ratings yet
ML03
14 pages
Machine Learning - Home - Week 2 - Notes - Coursera
No ratings yet
Machine Learning - Home - Week 2 - Notes - Coursera
10 pages
Cs229 Ml Notes
No ratings yet
Cs229 Ml Notes
192 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
04 LinearRegression PDF
No ratings yet
04 LinearRegression PDF
61 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
Cost Function
No ratings yet
Cost Function
17 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
Lecture15 Regression
No ratings yet
Lecture15 Regression
15 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
M146-Lec3-sidenotes-S25
No ratings yet
M146-Lec3-sidenotes-S25
29 pages
Linear Regression: Jia-Bin Huang Virginia Tech
No ratings yet
Linear Regression: Jia-Bin Huang Virginia Tech
59 pages
Berkeley Machine Learning
No ratings yet
Berkeley Machine Learning
185 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
2 - Multiple Linear Regression
No ratings yet
2 - Multiple Linear Regression
71 pages
03 Linear Regression
No ratings yet
03 Linear Regression
54 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Machine Learning Notes by Standard Andrew Ng
No ratings yet
Machine Learning Notes by Standard Andrew Ng
142 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
From Everand
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
CSPacademic
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Null Space
No ratings yet
Null Space
2 pages
Lecture 7 - Kernels and Support Vector
No ratings yet
Lecture 7 - Kernels and Support Vector
66 pages
Miscellaneous
No ratings yet
Miscellaneous
6 pages
What Happened to the Ancient Library of Alexandria - Edited by Mostafa El-Abbadi
No ratings yet
What Happened to the Ancient Library of Alexandria - Edited by Mostafa El-Abbadi
282 pages
Buddhist Scriptures - An Overview - Naomi Appleton
No ratings yet
Buddhist Scriptures - An Overview - Naomi Appleton
15 pages
Lecture4
No ratings yet
Lecture4
47 pages
Lecture2
No ratings yet
Lecture2
36 pages
Homework_2_2025
No ratings yet
Homework_2_2025
2 pages
Lecture6 (1)
No ratings yet
Lecture6 (1)
28 pages
NGDPV english year 2
No ratings yet
NGDPV english year 2
2 pages
Why The Propeht Was Not The Author of The Quran.
No ratings yet
Why The Propeht Was Not The Author of The Quran.
91 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
Ash-Sharh Wal-Ibānah Alā Usūl As-Sunnah Wad-Diyānah Wa Mujānabah Al-Mukhālifīn Wa Mubāyanah Ahlil-Ahwā Al-Māriqīn Also Known... (Ibn Battah Al - Ukbarī Abū Hājar (Translator) )
100% (1)
Ash-Sharh Wal-Ibānah Alā Usūl As-Sunnah Wad-Diyānah Wa Mujānabah Al-Mukhālifīn Wa Mubāyanah Ahlil-Ahwā Al-Māriqīn Also Known... (Ibn Battah Al - Ukbarī Abū Hājar (Translator) )
296 pages
Theroy of Computation
No ratings yet
Theroy of Computation
43 pages
Acceptance by Empty Stack, Deterministic Pda: S. No Enrollment Number Admission Number Student Name Degree / Branch Sem
No ratings yet
Acceptance by Empty Stack, Deterministic Pda: S. No Enrollment Number Admission Number Student Name Degree / Branch Sem
13 pages
unknowncitizenQUESTIONS AND ANSWERS
100% (3)
unknowncitizenQUESTIONS AND ANSWERS
3 pages
Statement 11-MAY-23 AC 40256846 13042507
No ratings yet
Statement 11-MAY-23 AC 40256846 13042507
4 pages
Download full Vertically Integrated Architectures: Versioned Data Models, Implicit Services, and Persistence-Aware Programming 1st Edition Jos Jong ebook all chapters
100% (1)
Download full Vertically Integrated Architectures: Versioned Data Models, Implicit Services, and Persistence-Aware Programming 1st Edition Jos Jong ebook all chapters
55 pages
Introduction To Anatomy Trans
No ratings yet
Introduction To Anatomy Trans
19 pages
10 3 Way Propensity Matching
No ratings yet
10 3 Way Propensity Matching
9 pages
L4 - Processor Status - Flag Registers
No ratings yet
L4 - Processor Status - Flag Registers
16 pages
(Ebook) Introduction to research methods in psychology by Dennis Howitt; Duncan Cramer ISBN 9780132051637, 013205163X - Download the ebook today and own the complete version
100% (1)
(Ebook) Introduction to research methods in psychology by Dennis Howitt; Duncan Cramer ISBN 9780132051637, 013205163X - Download the ebook today and own the complete version
61 pages
100% Free AI Channel
100% (1)
100% Free AI Channel
3 pages
Solar Energy Types and Uses
No ratings yet
Solar Energy Types and Uses
12 pages
Te Brochure Uk 12apr22 Screen
No ratings yet
Te Brochure Uk 12apr22 Screen
52 pages
Best Bitsler Bot
50% (4)
Best Bitsler Bot
7 pages
JAVA Internal
No ratings yet
JAVA Internal
4 pages
SDDocument MGCGBF6004
No ratings yet
SDDocument MGCGBF6004
2 pages
Corn Pone Opinions Thesis
100% (3)
Corn Pone Opinions Thesis
7 pages
ISAT
100% (1)
ISAT
9 pages
Anti-Drone Systems An Attention Based Improved YOLOv7 Model For A Real-Time Detection and Identification of Multi-Airborne Target
No ratings yet
Anti-Drone Systems An Attention Based Improved YOLOv7 Model For A Real-Time Detection and Identification of Multi-Airborne Target
15 pages
An Analysis of Demographic and Behavior Trends Using Social Media, Facebook, Twitter, and Instagram
No ratings yet
An Analysis of Demographic and Behavior Trends Using Social Media, Facebook, Twitter, and Instagram
22 pages
Deep Limit Order Book Trading - Half-A-Second, Please! - 1647041664887001aTDL
No ratings yet
Deep Limit Order Book Trading - Half-A-Second, Please! - 1647041664887001aTDL
24 pages
Science - Worm Farm
No ratings yet
Science - Worm Farm
12 pages
Field Work No. 1 Pacing in Level Ground: Mapúa Institute of Technology
75% (4)
Field Work No. 1 Pacing in Level Ground: Mapúa Institute of Technology
10 pages
Tespa IPG Catalog
No ratings yet
Tespa IPG Catalog
4 pages
Annajulia Preble: Professional Goal
No ratings yet
Annajulia Preble: Professional Goal
2 pages
Updtaed CV - Y.vikas Singla-2018
No ratings yet
Updtaed CV - Y.vikas Singla-2018
6 pages
R-102 Restaurant Fire Supression System (Estandar UL-300 Listed)
No ratings yet
R-102 Restaurant Fire Supression System (Estandar UL-300 Listed)
141 pages
Bagmati Irrigation Project: Department of Water Resources and Irrigation
No ratings yet
Bagmati Irrigation Project: Department of Water Resources and Irrigation
32 pages
Training - Manual - TBN Operation (Rotor Dynamic & Vibration)
100% (1)
Training - Manual - TBN Operation (Rotor Dynamic & Vibration)
19 pages

Lecture slides - Linear Regression (2025)

Uploaded by

Lecture slides - Linear Regression (2025)

Uploaded by

Introduction to Machine Learning

Lecture 2 - Linear and Polynomial Regression

Slides and used datasets are available in Moodle

Introduction to Machine Learning 1(45)

▶ Lindholm, A., Wahlstrom, N., Lindsten, F., & Schon, T. B. (2022).

Datasets: house_prices.csv, girls_height.csv, polynomial.csv

Introduction to Machine Learning 2(45)

(x1 , y1 ), (x2 , y2 ), ...

▶ Once f (x) found,

▶ The figure above shows a k = 5 fit to a given polynomial dataset (xi , yi )

Q: What is the price for a 3500 10

House Price (Dollar)

that fits the data and then com-

that fits the data.

▶ Assumption: y = β1 + β2 x (our model)

The Least Square Method

The mean squared distance for the entire training set is

and solve resulting system of equations for β1 and β2 .

House Price (Dollar)

Computing β1 and β2 as in previous slide 6

House Area (Square feet)

▶ The Normal Equation Solution

Linear regression is a parametric approach since it boils down to finding a few

Find β that minimises a given cost function J(β).

The gradient descent is a numerical method to solve minimization problems.

▶ learning rate too small ⇒ slow convergence ⇒ many iterations

= a + bβ1 + cβ2 + dβ12 + eβ1 β2 + f β22 .

For i = 1 : 0.17 + 0.39 · 1 − 3 = −2.44

For i = 1 : 1 · (−2.44) = −2.44

Fine Tuning (Try and error)

β1 , β2 = −0.01996, 203.54 (compared with -40259, 223.77)

Alternative Optimization Approaches

Coffee Break ...

Linear Multivariate Regression and Feature Normalization

Linear Multivariate Regression and Feature Normalization

Vectorised Approach: Model y = X β

Exact Solution β = (X T X )−1 X T y (Normal Equation)

Approximative Solution (Gradient Descent)

▶ β = 18.48, 0.304, 0.388

Notice: Similar results for gradient descent required 20 million iterations!

Linear Multivariate Regression and Feature Normalization

▶ β = [64.8, 0.845, 1.26]

Height of girl with parents

▶ α = 0.01, Niter = 1000

(65,70): 65.422 Number of iterations

▶ Dataset: polynomial.csv with 1000 observations

Vectorised Approach Model y = Xβ

In Python X is replaced by X = np.c_[np.ones((n,1)),X,X**2,X**3]

Exact Solution β = (X T X )−1 X T y (Normal Equation)

The dataset was generated from:

y = f (x) = 5 + 12x − x 2 + 0.025x 3 + normrnd(0, 5)

Polynomial fit: β = [5.50, 11.7, −0.981, 0.0246] OK or ...?

Degree 7 Degree 9 Degree 11

A small feature normalized subsample of 50 observations from polynomial.csv

Over- and underfitting

▶ y = X β is our model, the result of the training phase

▶ The mean square error (MSE) for a given β is defined as

▶ That is, the mean of the vertical squared distances

Over- and underfitting

▶ Training error: Compute MSE = n1 (X β − y )T (X β − y ) with Xtrain ,ytrain

▶ Test error: Compute MSE = n1 (X β − y )T (X β − y ) with Xtest ,ytest

Our polynomial fit gave the model.

▶ The error can be divided into two parts

E (y − ŷ )2 = [f (X ) − fˆ(X )]2 + Var (ε)

Over- and underfitting

You might also like

In Python X is replaced by X = np.c_[np.ones((n,1)),X,X2,X3]