Lecture 7 - Perceptron and Linear Regression
Lecture 7 - Perceptron and Linear Regression
Systems
Lecture 7
Perceptron and Linear
Regression
6.1 Overview
Intro to AI Systems
2
This week: Two main themes
Perceptron (classifier) Linear Regression
• Assign a numerical value to
an
observation
3
In addition: Vectors, matrices,
NumPy
• Efficient code: both writing and
execution
• A@B can replace three nested loops
• GPUs – parallel processing
• NumPy:
• Based on vectors and matrices
• Used by Marsland
• Libraries for ML, including Deep Learning
• Necessary for a deeper understanding
• of complex neural networks
• Tensor generalizes vectors and matrices
4
Overview
6.2 Brain and
A.1 Vectors
the
perceptron
6.3 The
A.2 Matrices
perceptron
algorithm
Compendium
Slides
Slides “Geometry
Lecture 06
V&M and Linear
6.4 Linear
Algebra”
regression
5
6.2 The Brain and
the Perceptron
Intro to AI Systems
Inspiration for AI and ML
• Psychology
• Ask people how they think
• Observe how humans behave
• Logic
• How should you think
• ‘’Hardware’’:
• ‘’If we want to make a machine replicating humans, it should be built on
similar hardware’’
• … and more …
The human brain
Rough figures
• 1.5 kilos
• 1011 Neurons
• 1014 Synapses
• cells
computer: 1 GHz =
• Compared to
10−9 seconds
"Medical gallery of Blausen Medical 2014". WikiJournal of Medicine 1 (2)
.
DOI:10.15347/wjm/2014.010. ISSN 2002-4436.
8
Neuron
• Axon
• Transports signals to other cells
• Dendrites
• Receive signals from other cells'
axons at the synapses
• Soma (cell body):
• "Sums" the signals from the
dendrites
• When membrane potential passes a https://simple.wikipedia.org /wiki/Neuron#/media/File:Neuron.s
threshold, vg
https://en.wikipedia.org
/wiki/File:Mark_I_perceptron.jpeg 10
The Perceptron
6.3 The Perceptron
Algorithm
Intro to AI Systems
12
The Perceptron
13
The Perceptron
The Perceptron is a type of artificial neuron used in machine learning for binary classification
tasks. It serves as the foundational building block for more com +6plex neural network
architectures.
1
• We see that the positive class corresponds to 𝑥1 > 3, or 3 𝑥1 >
1 1
𝑤1 = 3
yield the desired outcome: 𝑤1𝑥1 > 1 iff 𝑥1 > 3
15
Adjusting the threshold – example
• Assume the same fixed threshold, say 𝜃 = 1
contd.
• Let blue be the positive class, and red the negative class
16
The Bias Term
• The bias term is a constant added to the output of a model, allowing it to shift the function up or down.
It ensures that the model can better fit the training data by adjusting the output independently of the
input features.
• Including a bias term is essential for capturing patterns in the data that do not pass through the origin.
For instance, if you have a linear relationship but the line of best fit does not start at (0,0), the bias term
enables the model to accommodate that offset.
• In the Context of Linear Regression:
• In a simple linear regression model represented by the equation:
• y = w₀ + w₁ * x
• Here, w0w₀w0is the bias term (intercept), and w1w₁w1is the weight for the input feature xxx.
• Without the bias term (w0w₀w0), the model would be forced to pass through the origin (0,0), which may
not accurately reflect the relationship in the data.
17
The Bias
Term 𝑥0 =
𝑤𝑖 𝑥 𝑖 + 𝑤0 𝑥0 = σ 𝑚
�
σ𝑖=1
𝑚 as
•same 𝑥
Provided 𝑥 = −1 (and 𝑤
𝑖=0
𝑤𝑖 𝑥 >0
0 0 2
=𝑖 𝜃)
•
2
• Inductive bias refers to the set of assumptions a model uses to generalize beyond the training data.
It affects how a model makes predictions on unseen data.
Examples:
• Linear Models:
• Assume relationships between variables are linear, meaning the change in the output is
proportional to the change in the input.
• Decision Trees:
• Assume the data can be split hierarchically into subsets, which allows the model to capture
complex interactions between features through a series of binary decisions.
19
Redefined objective
• Label: 𝑡, which is 1 or 0
𝑦=𝑜 =𝑔 𝑖= σ
of𝑚the
If 𝑦 = 𝑡, do nothing, if 𝑦 ≠ 𝑡, update
• Calculate the output
0
𝑤𝑖 𝑥𝑖
• perceptron
weights
21
−𝑤
0
Update weights
𝜂 > 0 is the fixed learning rate
if 𝑡 = 1, 𝑦 if 𝑡 = 0, 𝑦
= 0 = 1
• increase σ 𝑖=
𝑚
𝑤𝑖 • decrease σ 𝑖=
𝑚
𝑤𝑖
0 0
𝑥𝑖 𝑤 𝑖 𝑥 𝑖 : 𝑥𝑖 𝑤 𝑖 𝑥 𝑖 :
• by increasing each • by decreasing each
• Learning rate: 𝜂 =
of training
0.1
(−1, −1)
• Current weights:
• Consider the point P=(-1,2):
•ℎ 𝑃 = 1 −2
<0
• i.e., positive class
• ℎ = −𝑤0 + 𝑤1𝑥1
provided
•𝑤 0 = 𝑤0 − η 𝑦 − 𝑡 𝑥0 =
• Wrongly classified
=1−𝑥 1 > 0
−1 − 0.1 0 − 1 −1 =
• Consider the point T=(-1,4.6):
•ℎ 𝑇𝑥= 1 − 4.6 < 0
Update:
• 1> 1 −1.1
• 𝑤1 = 𝑤1 − η 𝑦 − 𝑡 𝑥1 =
• Do nothing
−1 − 0.1 0 − 1 2 = 23
Example
• We are in the middle
• Learning rate: 𝜂 =
of training
0.1 1 − 𝑥1 =
0
(−1, −1)
• Current weights:
• Consider the point P=(-1,2):
• i.e., positive class •ℎ 𝑃 = 1 −2 < 0
• ℎ = −𝑤0 + 𝑤1𝑥1
provided • Wrongly classified
• 𝑤0 = 𝑤0 − η 𝑦 − 𝑡 𝑥0
• Update:
=1−𝑥 1 > 0
=
• Consider the point T=(-1,4.6):
•ℎ1> 𝑇𝑥= 1 − 4.6 < 0
• 1 −1 − 0.1 0 − 1 −1
• Do nothing = −1.1
• 𝑤1 = 𝑤1 − η 𝑦 − 𝑡 𝑥1 = 24
Observe
1 − 𝑥1 =
0
• But
• 𝑤0 = 1.1, 𝑤1 = 0.8
• Many possible solutions
• 𝑤0 = −1.1, 𝑤1 =
−0.8 • Same line
• 𝑤0 = −2.2, 𝑤1 =
−1.6
• But swaps the two classes!
• 𝑤0 = −5.5, 𝑤1 =
−4.0
• Same line 25
• With only one training item
• the algorithm will sooner or later
• classify the item correctly
• and no longer update
• When there are several training items, there
might be disagreement:
Properties
weight 𝑤𝑖
• one item will increase a certain
• What then?
26
Linear
separability
• A set is linearly separable if there is a
straight line in the feature plane such that
all points in one class fall on one side and
all points in the other class fall on the
other side
• For more than two features, this
generalizes to a hyper-plane
• (With one dimension to a point, cf. the
example so far)
27
Linear classifier
• A linear classifier will always
propose a linear decision boundary
• (point, line, plane, hyper-plane)
• whether the set is linearly
• separable or not
• The perceptron is a linear
classifier
28
Perceptron
Convergence Theorem
29
• The brain, neuron and synapsis
Perceptron
• The perceptron
• The bias term
summary •
•
The perceptron algorithm
Linear classifiers
• Linear separability
30
6.4 Linear
Regression
Intro to AI Systems
31
Supervised learning – two types
Classification Regression
• Assign a label (class) from a • Assign a numerical value to an
finite set of labels to an observation
observation • e.g., the temperature
tomorrow
32
Supervised learning
• Each observation (datapoint) is described as a feature vector
• 𝒙𝑗
𝑥𝑗,1, 𝑥𝑗,2,
= "input"
• There is a…well-defined
, 𝑥𝑗,𝑚
• The
set of possible target values, T
• The goal is for an input to predict a target value 𝑓(𝒙𝑗 ) from
T
• For supervised learning, we have a training set
• 𝒙1, 𝑡1 , 𝒙2, 𝑡2 , … 𝒙𝑁 , 𝑡𝑁
• We try to learn the function 𝑓 from the training set
33
Linear Regression
35
An example
• Given, the following data, can
we find the value of the
output when x = 0.44?
36
Linear regression
• We need some idea regarding
the kind of functions we are
looking for
• The simplest is to assume a
•𝑓 𝒙 = 𝑓 (𝑥1, 𝑥2 ,
linear function
… , 𝑥𝑚 ) =
𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 +
…𝑤𝑚𝑥𝑚
Of course, this isn't always a good fit, but linear regression may also
be adopted to some non-linear functions by feature engineering.
37
Inductive bias
• To learn from data, you must have some idea regarding how the data are
distributed.
• You choose a model.
• You try to find parameters which makes the model fit the training data well.
• Models carry with them inductive biases, e.g.,
• Linear regression can only learn
• straight lines.
• Perceptron can only learn linear
• decision boundaries
38
Linear regression
Simple Linear Regression (One input variable):
s.
• RMSE has the ’’right scale’’,
but
not suited for finding min.
• Mean (MSE or RMSE) is needed
for comparison across
different training/test sets
45
Goal
• The goal is to find the 𝑤0, 𝑤1, … 𝑤𝑚 that minimizes the
MSE
46
Minimizing: one variable
• We assume you know how to
minimize with one variable:
• 𝑓𝑥 = (𝑥 − 3)2+2
• 𝑓 ′ (𝑥) = 2 𝑥 − 3
• 𝑓 ′ (𝑥) = 0 iff 𝑥 = 3
47
The minimization problem
• (Here as a maximization
problem, ‘’upside-down’’, easier
to draw.)
• Looking for a point where the
tangent plane is horizontal:
• Where all the derivatives are 0
• The MSE is convex:
• There is only one global
minimum
• No problem with local optima
48
Tangent plane
• In 2D, the derivative of a function gives the slope of the tangent to the
curve at a specific point.
• In higher dimensions (more variables), partial derivatives determine
the slopes of tangents to the surface, but each tangent is parallel to one
of the axes (x, y, etc.).
• These tangents combine to form a tangent (hyper-)plane, representing
the local behavior of the surface around a point.
• The steepest direction in this plane is given by the gradient:
• In gradient ascent, we follow this direction to maximize the
function.
• In gradient descent, we move in the opposite direction to minimize
the function.
Partial derivatives
Many machine learning models (like linear regression, neural networks) involve
functions with several variables. For example, the error function (or cost
function) depends on multiple parameters (weights).
To minimize the error (i.e., to find the best model), we need to understand how
changes in each variable affect the error. Partial derivatives show how the
function changes to one variable at a time while keeping others constant.
In summary, partial derivatives are crucial for finding the optimal values of
parameters, guiding optimization algorithms, and analyzing how changes in one
variable affect the overall outcome in functions with multiple variables.
Partial derivatives
• We assume you know
• If 𝑓 𝑥 = 𝑎 + 𝑏𝑥
• 𝑓, ′then
𝑥 𝑓 𝑥 = 2 𝑎 + 𝑏𝑥
2 �
𝑏
𝑑�
= 𝑥
• Extended to more dimensions,
we can construct partial
derivatives, e.g.
•𝑔 𝑥, 𝑦 = (𝑎 + 𝑏𝑥
+𝛛 𝑐𝑦)2
= 2 𝑎 + 𝑏𝑥 +
•𝛛 𝑔
𝑥𝑥, 𝑦 𝑐𝑦
• 𝛛𝛛 𝑔
𝑏
𝑦 𝑥, 𝑦
= 2 𝑎 + 𝑏𝑥 +
https://www.wikihow.com/Image:OyXsh.png
Minimizing the MSE
•Minimizing the Mean Squared Error (MSE) is the process of finding the best parameters (weights) that make the
predicted values as close as possible to the actual values.
•Key steps:
• Calculate MSE: First, we compute the MSE, which measures the average squared differences between predicted
and actual values.
• Use Gradient Descent: To minimize the MSE, we apply an optimization algorithm like gradient descent. This
involves:
• Computing partial derivatives of the MSE with respect to each parameter.
• Updating the parameters in small steps in the direction that reduces the MSE (opposite to the gradient).
• Iterate until convergence: Continue updating the parameters until the MSE is minimized, meaning the model fits
the data as closely as possible.
Some good and some
bad news
• This has a closed-form solution, i.e., there is a
recipe for calculating the solution:
• This works fine for low dimensions (few features
for each observation)
number of dimensions.
• In ML, we may have millions of
features/dimensions
• And since it does not require ML, we will not
investigate it.
• But we can always use gradient descent
53
6.5 Applying Gradient
Descent
Intro to AI Systems
54
Gradient descent
in one variable
• (Lecture 2)
• Start with 𝑥0
Iteratively find, 𝑥1, 𝑥2,.., 𝑥𝑖,…
with decreasing 𝑓 ( 𝑥 𝑖 ) by setting
•
•
𝑥𝑖+1 = 𝑥𝑖 − 𝛾𝑓 ′ (𝑥 𝑖 )
𝛾 is the learning rate
•
•
55
The
gradient in
more
dimensions
We move in the opposite
direction of the gradient.
56
Minimizing the MSE
• To minimize 𝑓 =� 1 𝑡𝑗 − σ 𝑚 𝑤 𝑥 2
with respect to the 𝑤0, 𝑤1,
𝑗= 𝑖=0
•σ 𝑁 � 1 𝑖 … 𝑤𝑚
𝑗,𝑖
we can calculate the partial derivatives
• � 𝑓 = 2 σ𝑁 (𝑡 𝑤 )
𝑗, ) for 𝑘 = 1, 2, …
−σ 𝑖
𝑥𝑗,𝑖 (−𝑥 𝑘 , 𝑚
� �𝑚 𝑗=1 𝑗
𝛛𝑤 𝑘 � 𝑖=0
𝛛 𝑓 = 2
𝑗= (𝑡𝑗 − 𝑦𝑗 ) ) for 𝑘 = 1, 2, …
,𝑚
• 𝛛𝑤 �
𝑘 σ 𝑁 � 1 (−𝑥 𝑗,𝑘
57
Prediction (matrix form)
𝑦𝑗 = 𝑦𝑗,1 =𝑖= 𝑤𝑖𝑥𝑗,𝑖 = (𝑥𝑗,0, 𝑥𝑗,1, … , 𝑥𝑗,𝑚 ) ∙ (𝑤0,1, 𝑤1,1,
0 … , 𝑤
σ 𝑚 𝑚,1 )
58
The gradient
59
Implementing the gradient descent
• Input: • Forward step:
• 𝑋, input, a 𝑁 × 𝑚 NumPy • 𝑌 = 𝑋@𝑊
matrix: • Update step:
• 𝑊 = 𝑊 − 𝜂𝛻𝑓
• 𝑇, corresponding target values,
• N items, m features
• 𝑊 −= 𝜂𝑋. 𝑇(𝑌
• 𝑁 × 1 column vector
− 𝑇)
length 𝑁 and must be
• (maybe it is given as a vector of
• (𝜂, eta, is a learning
• 𝑚 × 1 column vector
60
Understanding Linear Regression: A Python
Implementation:
https://colab.research.google.com/drive/1r7UyAJzub
bhCncOiTFWEFIKS2LDw2lD0?usp=sharing
Summary