0% found this document useful (0 votes)
8 views62 pages

Lecture 7 - Perceptron and Linear Regression

This lecture covers the concepts of Perceptron and Linear Regression as foundational elements of AI systems. It discusses the structure and functioning of a Perceptron, including its components such as inputs, weights, bias, and activation functions, as well as the significance of inductive bias in model predictions. Additionally, it highlights the importance of efficient coding with libraries like NumPy for machine learning applications.

Uploaded by

shaikham2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views62 pages

Lecture 7 - Perceptron and Linear Regression

This lecture covers the concepts of Perceptron and Linear Regression as foundational elements of AI systems. It discusses the structure and functioning of a Perceptron, including its components such as inputs, weights, bias, and activation functions, as well as the significance of inductive bias in model predictions. Additionally, it highlights the importance of efficient coding with libraries like NumPy for machine learning applications.

Uploaded by

shaikham2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 62

Introduction to AI

Systems

Lecture 7
Perceptron and Linear
Regression
6.1 Overview
Intro to AI Systems

2
This week: Two main themes
Perceptron (classifier) Linear Regression
• Assign a numerical value to
an
observation

3
In addition: Vectors, matrices,
NumPy
• Efficient code: both writing and
execution
• A@B can replace three nested loops
• GPUs – parallel processing
• NumPy:
• Based on vectors and matrices
• Used by Marsland
• Libraries for ML, including Deep Learning
• Necessary for a deeper understanding
• of complex neural networks
• Tensor generalizes vectors and matrices
4
Overview
6.2 Brain and
A.1 Vectors
the
perceptron
6.3 The
A.2 Matrices
perceptron
algorithm

Compendium
Slides
Slides “Geometry
Lecture 06
V&M and Linear
6.4 Linear
Algebra”
regression

6.5 Gradient descent

5
6.2 The Brain and
the Perceptron
Intro to AI Systems
Inspiration for AI and ML
• Psychology
• Ask people how they think
• Observe how humans behave
• Logic
• How should you think
• ‘’Hardware’’:
• ‘’If we want to make a machine replicating humans, it should be built on
similar hardware’’
• … and more …
The human brain
Rough figures
• 1.5 kilos
• 1011 Neurons

• 1014 Synapses
• cells

• Clock time: 10 −3 seconds


• connections between neurons

computer: 1 GHz =
• Compared to

10−9 seconds
"Medical gallery of Blausen Medical 2014". WikiJournal of Medicine 1 (2)
.
DOI:10.15347/wjm/2014.010. ISSN 2002-4436.

8
Neuron
• Axon
• Transports signals to other cells
• Dendrites
• Receive signals from other cells'
axons at the synapses
• Soma (cell body):
• "Sums" the signals from the
dendrites
• When membrane potential passes a https://simple.wikipedia.org /wiki/Neuron#/media/File:Neuron.s
threshold, vg

• an action potential is sent down the


axon, the cell "spikes" or "fires"
9
The Perceptron
• Frank Rosenblatt, 1958
• A learning algorithm,
• which we will consider
• A custom-built machine
• based on this algorithm
• for image recognition

https://en.wikipedia.org
/wiki/File:Mark_I_perceptron.jpeg 10
The Perceptron
6.3 The Perceptron
Algorithm
Intro to AI Systems

12
The Perceptron

13
The Perceptron

The Perceptron is a type of artificial neuron used in machine learning for binary classification
tasks. It serves as the foundational building block for more com +6plex neural network
architectures.

A Perceptron consists of:


• Inputs: Features from the dataset (x1,..xn)
• Weights: Each input has an associated weight (w1​,w2​,…,wn​), which determines the
influence of the input on the output.
• Bias: An additional parameter (b) that allows the model to fit the data better by shifting
the activation function.
• Activation Function: A function (f) that processes the weighted sum of the inputs and
produces the output (commonly a step function for binary classification).
Adjusting the threshold - example
• Consider the simplest situation with only one input: 𝑥1
• Assume we have a fixed threshold, say 𝜃 = 1
• Let blue be the positive class, and red the negative class

1
• We see that the positive class corresponds to 𝑥1 > 3, or 3 𝑥1 >
1 1

𝑤1 = 3
yield the desired outcome: 𝑤1𝑥1 > 1 iff 𝑥1 > 3
15
Adjusting the threshold – example
• Assume the same fixed threshold, say 𝜃 = 1
contd.
• Let blue be the positive class, and red the negative class

• We see that now the positive class corresponds to 𝑥1 < 3


• Is there a 𝑤1 such that 𝑥1 < 3 if and only if 𝑤1𝑥1 > 1?

• If we instead can change the threshold to 𝜃 = −1, we see


• NO!

𝑥1< 3 if and only if 𝑤1 1𝑥 > −1 if 𝑤 =


3
−1 must change the threshold as well as the weights!
• that
1
• We

16
The Bias Term

• The bias term is a constant added to the output of a model, allowing it to shift the function up or down.
It ensures that the model can better fit the training data by adjusting the output independently of the
input features.
• Including a bias term is essential for capturing patterns in the data that do not pass through the origin.
For instance, if you have a linear relationship but the line of best fit does not start at (0,0), the bias term
enables the model to accommodate that offset.
• In the Context of Linear Regression:
• In a simple linear regression model represented by the equation:
• y = w₀ + w₁ * x
• Here, w0w₀w0​is the bias term (intercept), and w1w₁w1​is the weight for the input feature xxx.
• Without the bias term (w0w₀w0​), the model would be forced to pass through the origin (0,0), which may
not accurately reflect the relationship in the data.

17
The Bias
Term 𝑥0 =

𝑤𝑖 𝑥𝑖 > 𝜃 is the same


Since:
•σ𝑖=
1 as
𝑤 𝑤
𝑤𝑖 𝑥𝑖 − 𝜃 > 0, is the
1
•σ𝑖=1
𝑚

𝑚
𝑤
1 0

𝑤𝑖 𝑥 𝑖 + 𝑤0 𝑥0 = σ 𝑚

σ𝑖=1
𝑚 as
•same 𝑥
Provided 𝑥 = −1 (and 𝑤
𝑖=0
𝑤𝑖 𝑥 >0
0 0 2

=𝑖 𝜃)

2

We a new feature 𝑥0 = −1 for all items


can
• add
• Replace (𝑥1 , 𝑥 2 , … , 𝑥 𝑚 ) with (−1, 𝑥1 , 𝑥 2 ,
… , 𝑥𝑚 )
• Replace σ 𝑚 𝑤𝑖 𝑥𝑖 > 𝜃 with ℎ = σ 𝑚
𝑤𝑖 𝑥𝑖 > 0
𝑖=1 𝑖=0
18
Inductive Bias

• Inductive bias refers to the set of assumptions a model uses to generalize beyond the training data.
It affects how a model makes predictions on unseen data.

Examples:
• Linear Models:
• Assume relationships between variables are linear, meaning the change in the output is
proportional to the change in the input.
• Decision Trees:
• Assume the data can be split hierarchically into subsets, which allows the model to capture
complex interactions between features through a series of binary decisions.

19
Redefined objective

• Find a line 𝑤1𝑥1 + 𝑤0𝑥0 = 0


Geometric understanding

• such that ℎ −1, 𝑥1 = 𝑤1𝑥1 + 𝑤0𝑥0 = 𝑤1𝑥1


− 𝑤0 > 0
• if and only if 𝑥1 < 3
20
Training one perceptron
−𝑤
random numbers, 𝑤0, 𝑤1,
1. Initialize: set all weights to small
… 𝑤𝑚 0

2. Repeat until <some criteria>:


• Inputs: 𝑥0 , 𝑥1, … 𝑥𝑚
Consider one training instance

• Label: 𝑡, which is 1 or 0
𝑦=𝑜 =𝑔 𝑖= σ
of𝑚the
If 𝑦 = 𝑡, do nothing, if 𝑦 ≠ 𝑡, update
• Calculate the output
0
𝑤𝑖 𝑥𝑖
• perceptron
weights

21
−𝑤
0

Update weights
𝜂 > 0 is the fixed learning rate

if 𝑡 = 1, 𝑦 if 𝑡 = 0, 𝑦
= 0 = 1
• increase σ 𝑖=
𝑚
𝑤𝑖 • decrease σ 𝑖=
𝑚
𝑤𝑖
0 0
𝑥𝑖 𝑤 𝑖 𝑥 𝑖 : 𝑥𝑖 𝑤 𝑖 𝑥 𝑖 :
• by increasing each • by decreasing each

• if 𝑥𝑖 > 0: increase 𝑤𝑖 • if 𝑥𝑖 > 0: decrease 𝑤𝑖


• if 𝑥𝑖 < 0: decrease 𝑤𝑖 • if 𝑥𝑖 < 0: increase 𝑤𝑖
• 𝑤𝑖 = 𝑤𝑖 + 𝜂𝑥𝑖 • 𝑤𝑖 = 𝑤𝑖 − 𝜂𝑥𝑖
𝑤𝑖 =• 𝑤
cover+
𝑖 η(𝑡cases− = 𝑤𝑖 − η(𝑦 −
𝑦)𝑥𝑖 𝑡)𝑥𝑖
both

(covers all cases)


22
Example
• We are in the middle

• Learning rate: 𝜂 =
of training

0.1

(−1, −1)
• Current weights:
• Consider the point P=(-1,2):
•ℎ 𝑃 = 1 −2
<0
• i.e., positive class

• ℎ = −𝑤0 + 𝑤1𝑥1
provided

•𝑤 0 = 𝑤0 − η 𝑦 − 𝑡 𝑥0 =
• Wrongly classified
=1−𝑥 1 > 0
−1 − 0.1 0 − 1 −1 =
• Consider the point T=(-1,4.6):
•ℎ 𝑇𝑥= 1 − 4.6 < 0
Update:
• 1> 1 −1.1
• 𝑤1 = 𝑤1 − η 𝑦 − 𝑡 𝑥1 =
• Do nothing
−1 − 0.1 0 − 1 2 = 23
Example
• We are in the middle

• Learning rate: 𝜂 =
of training

0.1 1 − 𝑥1 =
0

(−1, −1)
• Current weights:
• Consider the point P=(-1,2):
• i.e., positive class •ℎ 𝑃 = 1 −2 < 0

• ℎ = −𝑤0 + 𝑤1𝑥1
provided • Wrongly classified

• 𝑤0 = 𝑤0 − η 𝑦 − 𝑡 𝑥0
• Update:
=1−𝑥 1 > 0
=
• Consider the point T=(-1,4.6):
•ℎ1> 𝑇𝑥= 1 − 4.6 < 0
• 1 −1 − 0.1 0 − 1 −1
• Do nothing = −1.1
• 𝑤1 = 𝑤1 − η 𝑦 − 𝑡 𝑥1 = 24
Observe

1 − 𝑥1 =
0
• But
• 𝑤0 = 1.1, 𝑤1 = 0.8
• Many possible solutions
• 𝑤0 = −1.1, 𝑤1 =
−0.8 • Same line
• 𝑤0 = −2.2, 𝑤1 =
−1.6
• But swaps the two classes!

• 𝑤0 = −5.5, 𝑤1 =
−4.0
• Same line 25
• With only one training item
• the algorithm will sooner or later
• classify the item correctly
• and no longer update
• When there are several training items, there
might be disagreement:
Properties
weight 𝑤𝑖
• one item will increase a certain

• another item will decrease it

• What then?

26
Linear
separability
• A set is linearly separable if there is a
straight line in the feature plane such that
all points in one class fall on one side and
all points in the other class fall on the
other side
• For more than two features, this
generalizes to a hyper-plane
• (With one dimension to a point, cf. the
example so far)

27
Linear classifier
• A linear classifier will always
propose a linear decision boundary
• (point, line, plane, hyper-plane)
• whether the set is linearly
• separable or not
• The perceptron is a linear
classifier

28
Perceptron
Convergence Theorem

• If the training set is linearly separable,


the perceptron algorithm will (sooner or
later)
• find a linear decision boundary
• stop updating
• Unless the learning rate 𝜂 is too large
• Comment:
• There are normally more than one
solution, which generalizes
differently to test data

29
• The brain, neuron and synapsis

Perceptron
• The perceptron
• The bias term

summary •

The perceptron algorithm
Linear classifiers
• Linear separability

30
6.4 Linear
Regression
Intro to AI Systems

31
Supervised learning – two types
Classification Regression
• Assign a label (class) from a • Assign a numerical value to an
finite set of labels to an observation
observation • e.g., the temperature
tomorrow

32
Supervised learning
• Each observation (datapoint) is described as a feature vector
• 𝒙𝑗
𝑥𝑗,1, 𝑥𝑗,2,
= "input"
• There is a…well-defined
, 𝑥𝑗,𝑚
• The
set of possible target values, T
• The goal is for an input to predict a target value 𝑓(𝒙𝑗 ) from
T
• For supervised learning, we have a training set
• 𝒙1, 𝑡1 , 𝒙2, 𝑡2 , … 𝒙𝑁 , 𝑡𝑁
• We try to learn the function 𝑓 from the training set
33
Linear Regression

•Definition: Linear regression is a method for modeling the relationship


between a dependent variable (target) and one or more independent variables
(features) using a linear equation.
•Formula:

•Simple Linear Regression: One feature, produces a straight line.


•Multiple Linear Regression: Multiple features, forms a hyperplane.
Regression

• In regression, the goal is to predict a continuous value (a number) instead of a category.


• The target set here consists of real numbers. For example, predicting the price of a
house based on its features (like size, location, etc.) is a regression problem.
• We want the predicted value (let’s call it yj​) to be as close as possible to the actual value
(denoted by tj​).
• This is generally about function approximation—we are trying to find a function f(xj)
that can accurately estimate or predict the values in between the known data points,
based on what we already know.

35
An example
• Given, the following data, can
we find the value of the
output when x = 0.44?

36
Linear regression
• We need some idea regarding
the kind of functions we are
looking for
• The simplest is to assume a

•𝑓 𝒙 = 𝑓 (𝑥1, 𝑥2 ,
linear function
… , 𝑥𝑚 ) =
𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 +
…𝑤𝑚𝑥𝑚
Of course, this isn't always a good fit, but linear regression may also
be adopted to some non-linear functions by feature engineering.
37
Inductive bias
• To learn from data, you must have some idea regarding how the data are
distributed.
• You choose a model.
• You try to find parameters which makes the model fit the training data well.
• Models carry with them inductive biases, e.g.,
• Linear regression can only learn
• straight lines.
• Perceptron can only learn linear
• decision boundaries

38
Linear regression
Simple Linear Regression (One input variable):

This is the case where there is only one input


variable (denoted x1x_1x1​).
The formula for the model is:

Here, w0 is the intercept (the point where the


line crosses the y-axis), and w1​is the slope (how
steep the line is).

The model describes a straight line when you


plot the input x1​against the output (or
predicted value). This is easy to visualize
because it’s just a line on a 2D graph
39
Multiple Linear Regression (More than one
input variable):
When you have multiple input variables the
formula becomes:

• Here, each wi corresponds to a weight or coefficient for the


respective input xi.
• X represents the vector of input variables.
• Instead of a line, this model describes a hyperplane in a higher-
dimensional space, which is much harder to visualize
40
Notation
• 𝑓𝒙 = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 +…𝑤𝑚
𝑥𝑚
• Often used notation:
• 𝑓𝒙 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 +…𝛽𝑚
𝑥𝑚
• The 𝛽𝑖-s are parameters of the model
• 𝛽1, 𝛽2𝑥2, … 𝛽𝑚 called coefficients
• 𝛽0 called intercept
• We have used 𝑤𝑖 for weights
• We will assume a 𝑥 = 1 (bias)
41
Mean Square Error (MSE)
What is MSE?
Why do we need MSE?
 MSE stands for Mean Squared Error, and it measures
• When we create a line (or a model) to predict values the average squared difference between the actual
from data, we want to know how well this line fits the data points and the predictions made by the line.
actual data points.
 Mathematically:
• We also want to compare different lines (or models)
to see which one does a better job at predicting the
data.  where:
• MSE helps us quantify the "goodness" of the fit by
telling us how far the predictions are from the actual o ti ​is the actual value of the data point,
values.
o yi​is the predicted value (using the line or model),
o n is the total number of data points.
Mean Square Error (MSE) II
• We need a way to tell whether a
line is a good fit to the data,
and whether one line is better
than another

• The goal is to minimize this error.


MSE in 3D
• With two input variables, we are
trying to find a plane.
• The MSE is similar
Footnote: variants

same values of 𝑥𝑖 -s and 𝑤𝑖-


• They all have minimum for the

s.
• RMSE has the ’’right scale’’,
but
not suited for finding min.
• Mean (MSE or RMSE) is needed
for comparison across
different training/test sets
45
Goal
• The goal is to find the 𝑤0, 𝑤1, … 𝑤𝑚 that minimizes the
MSE

46
Minimizing: one variable
• We assume you know how to
minimize with one variable:
• 𝑓𝑥 = (𝑥 − 3)2+2

• 𝑓 ′ (𝑥) = 2 𝑥 − 3

• 𝑓 ′ (𝑥) = 0 iff 𝑥 = 3

47
The minimization problem
• (Here as a maximization
problem, ‘’upside-down’’, easier
to draw.)
• Looking for a point where the
tangent plane is horizontal:
• Where all the derivatives are 0
• The MSE is convex:
• There is only one global
minimum
• No problem with local optima

48
Tangent plane
• In 2D, the derivative of a function gives the slope of the tangent to the
curve at a specific point.
• In higher dimensions (more variables), partial derivatives determine
the slopes of tangents to the surface, but each tangent is parallel to one
of the axes (x, y, etc.).
• These tangents combine to form a tangent (hyper-)plane, representing
the local behavior of the surface around a point.
• The steepest direction in this plane is given by the gradient:
• In gradient ascent, we follow this direction to maximize the
function.
• In gradient descent, we move in the opposite direction to minimize
the function.
Partial derivatives
Many machine learning models (like linear regression, neural networks) involve
functions with several variables. For example, the error function (or cost
function) depends on multiple parameters (weights).

To minimize the error (i.e., to find the best model), we need to understand how
changes in each variable affect the error. Partial derivatives show how the
function changes to one variable at a time while keeping others constant.

In summary, partial derivatives are crucial for finding the optimal values of
parameters, guiding optimization algorithms, and analyzing how changes in one
variable affect the overall outcome in functions with multiple variables.
Partial derivatives
• We assume you know
• If 𝑓 𝑥 = 𝑎 + 𝑏𝑥
• 𝑓, ′then
𝑥 𝑓 𝑥 = 2 𝑎 + 𝑏𝑥
2 �

𝑏
𝑑�
= 𝑥
• Extended to more dimensions,
we can construct partial
derivatives, e.g.
•𝑔 𝑥, 𝑦 = (𝑎 + 𝑏𝑥
+𝛛 𝑐𝑦)2
= 2 𝑎 + 𝑏𝑥 +
•𝛛 𝑔
𝑥𝑥, 𝑦 𝑐𝑦
• 𝛛𝛛 𝑔
𝑏
𝑦 𝑥, 𝑦
= 2 𝑎 + 𝑏𝑥 +
https://www.wikihow.com/Image:OyXsh.png
Minimizing the MSE

•Minimizing the Mean Squared Error (MSE) is the process of finding the best parameters (weights) that make the
predicted values as close as possible to the actual values.
•Key steps:
• Calculate MSE: First, we compute the MSE, which measures the average squared differences between predicted
and actual values.
• Use Gradient Descent: To minimize the MSE, we apply an optimization algorithm like gradient descent. This
involves:
• Computing partial derivatives of the MSE with respect to each parameter.
• Updating the parameters in small steps in the direction that reduces the MSE (opposite to the gradient).
• Iterate until convergence: Continue updating the parameters until the MSE is minimized, meaning the model fits
the data as closely as possible.
Some good and some
bad news
• This has a closed-form solution, i.e., there is a
recipe for calculating the solution:
• This works fine for low dimensions (few features
for each observation)

• Standard algorithms are 𝑂(𝑚3) where m is the


• But it gets slow for more dimensions.

number of dimensions.
• In ML, we may have millions of
features/dimensions
• And since it does not require ML, we will not
investigate it.
• But we can always use gradient descent

53
6.5 Applying Gradient
Descent
Intro to AI Systems

54
Gradient descent
in one variable
• (Lecture 2)
• Start with 𝑥0
Iteratively find, 𝑥1, 𝑥2,.., 𝑥𝑖,…
with decreasing 𝑓 ( 𝑥 𝑖 ) by setting


𝑥𝑖+1 = 𝑥𝑖 − 𝛾𝑓 ′ (𝑥 𝑖 )
𝛾 is the learning rate

55
The
gradient in
more
dimensions
We move in the opposite
direction of the gradient.

56
Minimizing the MSE
• To minimize 𝑓 =� 1 𝑡𝑗 − σ 𝑚 𝑤 𝑥 2
with respect to the 𝑤0, 𝑤1,
𝑗= 𝑖=0

•σ 𝑁 � 1 𝑖 … 𝑤𝑚
𝑗,𝑖
we can calculate the partial derivatives
• � 𝑓 = 2 σ𝑁 (𝑡 𝑤 )
𝑗, ) for 𝑘 = 1, 2, …
−σ 𝑖
𝑥𝑗,𝑖 (−𝑥 𝑘 , 𝑚
� �𝑚 𝑗=1 𝑗
𝛛𝑤 𝑘 � 𝑖=0
𝛛 𝑓 = 2
𝑗= (𝑡𝑗 − 𝑦𝑗 ) ) for 𝑘 = 1, 2, …
,𝑚
• 𝛛𝑤 �
𝑘 σ 𝑁 � 1 (−𝑥 𝑗,𝑘

• (Observe that this is just a generalization of


𝛛
•𝛛𝑥 𝑔 𝑥, 𝑦 = 2 𝑎 + 𝑏𝑥 + 𝑐𝑦 𝑏 )

57
Prediction (matrix form)
𝑦𝑗 = 𝑦𝑗,1 =𝑖= 𝑤𝑖𝑥𝑗,𝑖 = (𝑥𝑗,0, 𝑥𝑗,1, … , 𝑥𝑗,𝑚 ) ∙ (𝑤0,1, 𝑤1,1,
0 … , 𝑤
σ 𝑚 𝑚,1 )

58
The gradient

59
Implementing the gradient descent
• Input: • Forward step:
• 𝑋, input, a 𝑁 × 𝑚 NumPy • 𝑌 = 𝑋@𝑊
matrix: • Update step:
• 𝑊 = 𝑊 − 𝜂𝛻𝑓
• 𝑇, corresponding target values,
• N items, m features

• 𝑊 −= 𝜂𝑋. 𝑇(𝑌
• 𝑁 × 1 column vector
− 𝑇)
length 𝑁 and must be
• (maybe it is given as a vector of
• (𝜂, eta, is a learning

• Make a weight matrix, 𝑊,


transformed) rate)

• 𝑚 × 1 column vector

60
Understanding Linear Regression: A Python
Implementation:
https://colab.research.google.com/drive/1r7UyAJzub
bhCncOiTFWEFIKS2LDw2lD0?usp=sharing
Summary

Linear Mean Square


Inductive bias
regression Error

Minimizing the Using Gradient


in Matrix form
MSE Descent
62

You might also like