0% found this document useful (0 votes)
11 views25 pages

ML02

1) The document introduces linear regression with one variable (univariate linear regression) for solving supervised learning problems. In supervised learning, the training set provides examples of the "right answer" or target variable y for each value of the input or feature x. 2) Linear regression finds the best-fitting straight line to describe the relationship between x and y. This straight line is represented by a hypothesis function hθ(x) = θ0 + θ1x, where θ0 and θ1 are parameters to be estimated. 3) The parameters θ0 and θ1 are estimated by minimizing a cost function J(θ0, θ1), which measures the total deviation between the predicted y values hθ

Uploaded by

Muneeb Butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views25 pages

ML02

1) The document introduces linear regression with one variable (univariate linear regression) for solving supervised learning problems. In supervised learning, the training set provides examples of the "right answer" or target variable y for each value of the input or feature x. 2) Linear regression finds the best-fitting straight line to describe the relationship between x and y. This straight line is represented by a hypothesis function hθ(x) = θ0 + θ1x, where θ0 and θ1 are parameters to be estimated. 3) The parameters θ0 and θ1 are estimated by minimizing a cost function J(θ0, θ1), which measures the total deviation between the predicted y values hθ

Uploaded by

Muneeb Butt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

10/23/2023

Introduction to
Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
[email protected]
Slides of Prof. Dr. Andrew Ng, Stanford & Dr. Humayoun

Lecture 2:
Supervised Learning
Linear regression with one variable
Reading:
• Chapter 17, “Bayesian Reasoning and Machine Learning” Page 345-348
• Chapter 03, “Pattern Recognition and Machine Learning” of Christopher M. Bishop, Page 137
• Chapter 11, “Data Mining A Knowledge Discovery Approach”, from page 346
• Chapter 18 , “Artificial Intelligence A Modern Approach”, from page 718

Model representation

1
10/23/2023

500
Housing Prices
(Portland, OR) 400

300

Price 200
(in 1000s
100
of dollars)
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Classification Problem
Discrete valued output
4

Training set of Price ($) in 1000's


Size in feet2 (x)
housing prices (y)
(Portland, OR) 2104 460
1416 232
m
1534 315
852 178
Notation: … …
m = Number of training examples x(1) = 2104
x’s = “input” variable / features x(3) = 1534
y’s = “output” variable / “target” variable y(4) = 178
(x, y) – one training example y(2) = 232
(x(i), y(i)) – ith training example
i is an index to training set 5

Training Set How do we represent h ?

hθ(x) = θ0 + θ1x
Shorthand: h(x)
Learning Algorithm

Size of Estimated hθ(x) = θ0 + θ1x


h
house price
x Hypothesis Estimated
value of y
Linear regression with one variable
h is a function Univariate linear regression
h maps from x’s to y’s
6

2
10/23/2023

In summary
• A hypothesis h takes in some variable(s)
• Uses parameters determined by a learning
system
• Outputs a prediction based on that input

Cost function
• A cost function let us figure out how to fit the
best straight line to our data

Training Set Price ($) in 1000's


Size in feet2 (x)
(y)
2104 460
1416 232
m
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
9

3
10/23/2023

Different parameter values give different functions


3 3 3
h(x) = 0 + 0.5.x
2 h(x) = 1.5 + 0.x 2 2

1 1 1 h(x) = 1 + 0.5.x

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

Positive slope if θ1> 0


10

Idea: Choose so that


is close to for our
training examples y

x
• hθ(x) is a "y imitator"
• Tries to convert the x into y
• Considering we already have y we
can evaluate how well hθ(x) does this

11

Minimal deviation of x from y


Idea: Choose so that
is close to for our
training examples
Minimization problem y

12

4
10/23/2023

ℎ 𝑥 = 𝜃 + 𝜃 𝑥( )

1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚

minimize 𝐽 𝜃 , 𝜃
Cost function

13

1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚

• This cost function is also called the squared


error cost function
– Reasonable choice for most regression functions
– Probably most commonly used function

Cost function intuition I


Simplified version
Hypothesis:
𝜃 =0
3 3
Parameters: 2 2
1 1
0 0
Cost Function: 0 1 2 3 0 1 2 3

Goal:
15

5
10/23/2023

(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 = ℎ 𝑥 −𝑦
2𝑚
𝐽 1 =0
1 1
𝐽 𝜃 = 𝜃 𝑥−𝑦 = 0 +0 +0 =0
2×3 6 16

(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 = 𝜃 𝑥( ) − 𝑦
2𝑚 𝐽 0.5 = 0.58
1 1 3.6
= (0.5 − 1) +(1 − 2) +(1.5 − 3) = 0.5 + 1 + 1.5 = ≈ 0.58
2×3 6 6 17

(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 =
2𝑚
𝜃 𝑥( ) − 𝑦 𝐽 0 ≈ 2.3
1 1 14
= (0 × 1 − 1) +(0 × 2 − 2) +(0 × 3 − 3) = 1+4+9 = ≈ 2.3
6 6 6 18

6
10/23/2023

(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 𝜃 = −0.5 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

𝐽 −0.5 ≈ 5.15 19

• If we compute a range of values plot


• 𝐽(𝜃 ) vs 𝜃 we get a polynomial (looks like
a quadratic)

20

Cost function intuition II


Hypothesis:

Parameters:

Cost Function:

Goal:
21

7
10/23/2023

???

(for fixed , this is a function of x) (function of the parameters )

500

400

Price ($)
in 1000’s 300

200
𝜃 = 50
100
𝜃 = 0.06
0
0 1000 2000 3000
Size in feet2 (x)
𝜃 𝜃 ???

22

23

(for fixed , this is a function of x) (function of the parameters )

Contour Plot 24

8
10/23/2023

(for fixed , this is a function of x) (function of the parameters )

25

(for fixed , this is a function of x) (function of the parameters )

26

(for fixed , this is a function of x) (function of the parameters )

27

9
10/23/2023

• Doing this manually is painful


• What we really want is an efficient algorithm
for finding the minimum J for θ0 and θ1

28

Gradient descent algorithm


• Minimize cost function J
• Used all over machine learning
for minimization

29

Gradient descent algorithm


Have some function

Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum

30

10
10/23/2023

• Local Search for optimization :


– hill climbing, simulated annealing, Gradient
descent algorithm, etc

31

Local Search Methods


• Applicable when seeking Goal State & don't care how
to get there. E.g.,
– N-queens,
– finding shortest/cheapest round trips
• (Travel Salesman Problem, Vehicle Routing Problem)
– finding models of propositional formulae (SAT solvers)
– VLSI layout, planning, scheduling, time-tabling, . . .
– map coloring,
– resource allocation
– protein structure prediction
– genome sequence assembly
32

Local search
 Key idea (surprisingly simple):

1. Select (random) initial state


(generate an initial guess)

2. Make local modification to


improve current state (evaluate
current state and move to other
states)

3. Repeat Step 2 until goal state


found (or out of time) 33

11
10/23/2023

Gradient descent algorithm


Have some function

Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
34

J(0,1)

1
0

35

J(0,1)

1
0

36

12
10/23/2023

Gradient descent algorithm

Derivative term

𝛼: Learning Rate

𝛼: Learning Rate (Should be a small number)


• Large number:= Huge steps
• Small number := baby steps
37

Gradient descent algorithm

Correct:
Simultaneous update of 𝜃 , 𝜃 Incorrect:

38

Gradient descent intuition

• To understand the intuition, we'll return to a simpler


function where we minimize one parameter to help
explain the algorithm in more detail

𝑤ℎ𝑒𝑟𝑒 𝜃 ∈ 𝑅

39

13
10/23/2023

Two key terms in the algorithm


• Derivative term
•𝛼

40

Partial derivative vs. derivative


• Use partial derivative when we have multiple variables but only
derive with respect to one
• Use derivative when we are deriving with respect to all the variables

𝑑
𝜃 =𝜃 −𝛼 𝐽(𝜃 )
𝑑𝜃
≥0
𝜃 = 𝜃 − 𝛼(+𝑣𝑒 𝑛𝑜. )

Derivative: it takes the tangent to the point (the straight red line) and calculates the slop of
this tangent line. Slop = vertical line / horizontal line 41

𝑑
𝜃 =𝜃 −𝛼 𝐽(𝜃 )
𝑑𝜃
≤0
𝜃 = 𝜃 − 𝛼 . (−𝑣𝑒 𝑛𝑜. )

42

14
10/23/2023

Slope
• Familiar meaning?
• The slope of a line is the change in y divided by the
change in x .
• Slope (m) = = =
• Pick any two points on the line: (𝑥 , 𝑦 ), (𝑥 , 𝑦 )
• Ex. Find the slope of the line which passes through the
points (2, 5) and (0, 1) :
• 𝑚= = = = 2 which is positive number
• Meaning: Every time x increases by 1 (anywhere on
the line), y increase by 2 , and whenever x decreases
by 1, y decreases by 2 .

5−1 4 2
𝑚= = =
2−0 2 1

Positive slope (i.e. m > 0 )


• y always increases
when x increases
and y always decreases
when x decreases.
• The graph of the line starts
at the bottom left and goes
towards the top right.

44

3+1 4 4
𝑚= = = − = −1.33
−2 − 1 −3 3

Negative slope (i.e. m < 0 )


Y always decreases
when x increases
and y always increases
when x decreases.

45

15
10/23/2023

Horizontal and Vertical Lines

• The slope of any


horizontal line is 0
• 𝑚= =0

• The slope of any vertical


line is undefined

46

• Positive value
• Negative value
• Zero value

• At each point, the line is always tangent to the curve


• Its slope is the derivative

𝛼: Learning Rate

If α is too small, gradient


descent can be slow.

If α is too large, gradient descent


can overshoot the minimum. It
may fail to converge.

48

16
10/23/2023

Question: When you get to a local minimum

at local optima

Current value of Derivative term = 0


θ1 = θ1- 0
So θ1 remains the same
49

Gradient descent can converge to a local


minimum, even with the learning rate α fixed.

As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.

50

Gradient descent for linear regression

Gradient descent algorithm Linear Regression Model

51

17
10/23/2023

Gradient descent for linear regression

52

𝜕 1
ℎ 𝑥 −𝑦
𝜕𝜃 2𝑚

𝜕 1
= 𝜃 + 𝜃 𝑥( ) − 𝑦
𝜕𝜃 2𝑚

1
(ℎ (𝑥 ) − 𝑦 )
𝑚

1
(ℎ (𝑥 ) − 𝑦 ) . 𝑥 ( )
𝑚

53

Gradient descent algorithm


𝜕
𝐽(𝜃 , 𝜃 )
𝜕𝜃

update
and
simultaneously

𝜕
𝐽(𝜃 , 𝜃 )
𝜕𝜃
54

18
10/23/2023

J(0,1)

1
0

55

J(0,1)

1
0

56

57

19
10/23/2023

(for fixed , this is a function of x) (function of the parameters )

58

(for fixed , this is a function of x) (function of the parameters )

59

(for fixed , this is a function of x) (function of the parameters )

60

20
10/23/2023

(for fixed , this is a function of x) (function of the parameters )

61

(for fixed , this is a function of x) (function of the parameters )

62

(for fixed , this is a function of x) (function of the parameters )

63

21
10/23/2023

(for fixed , this is a function of x) (function of the parameters )

64

(for fixed , this is a function of x) (function of the parameters )

65

(for fixed , this is a function of x) (function of the parameters )

66

22
10/23/2023

Linear Regression with One Variable


• Error here a is y-intercept
while b is slope

• SSE

67

Linear Regression with One Variable

68

Another name:
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses
all the training examples.

Another algorithm that solves

Normal equations method

Gradient descent algorithm scales better than Normal


equations method to larger datasets
69

23
10/23/2023

Generalization of
Gradient descent algorithm
• Learn with larger number of features.

• Difficult to plot

70

We see here this matrix shows us Vector


Size, Number of bedrooms Shown as y
Number floors, Age of home Shows us the prices
All in one variable 71

• Need linear algebra for more complex linear


regression models
• Linear algebra is good for making
computationally efficient models (we’ll see
later)
– Provides a good way to work with large sets of data
sets
– Typically, vectorization of a problem is a common
optimization technique

24
10/23/2023

End

25

You might also like