ML02
ML02
Introduction to
Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
[email protected]
Slides of Prof. Dr. Andrew Ng, Stanford & Dr. Humayoun
Lecture 2:
Supervised Learning
Linear regression with one variable
Reading:
• Chapter 17, “Bayesian Reasoning and Machine Learning” Page 345-348
• Chapter 03, “Pattern Recognition and Machine Learning” of Christopher M. Bishop, Page 137
• Chapter 11, “Data Mining A Knowledge Discovery Approach”, from page 346
• Chapter 18 , “Artificial Intelligence A Modern Approach”, from page 718
Model representation
1
10/23/2023
500
Housing Prices
(Portland, OR) 400
300
Price 200
(in 1000s
100
of dollars)
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem
Given the “right answer” for Predict real-valued output
each example in the data.
Classification Problem
Discrete valued output
4
hθ(x) = θ0 + θ1x
Shorthand: h(x)
Learning Algorithm
2
10/23/2023
In summary
• A hypothesis h takes in some variable(s)
• Uses parameters determined by a learning
system
• Outputs a prediction based on that input
Cost function
• A cost function let us figure out how to fit the
best straight line to our data
3
10/23/2023
1 1 1 h(x) = 1 + 0.5.x
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
x
• hθ(x) is a "y imitator"
• Tries to convert the x into y
• Considering we already have y we
can evaluate how well hθ(x) does this
11
12
4
10/23/2023
ℎ 𝑥 = 𝜃 + 𝜃 𝑥( )
1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚
minimize 𝐽 𝜃 , 𝜃
Cost function
13
1
𝐽 𝜃 ,𝜃 = ℎ 𝑥 −𝑦
2𝑚
Goal:
15
5
10/23/2023
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 = ℎ 𝑥 −𝑦
2𝑚
𝐽 1 =0
1 1
𝐽 𝜃 = 𝜃 𝑥−𝑦 = 0 +0 +0 =0
2×3 6 16
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 = 𝜃 𝑥( ) − 𝑦
2𝑚 𝐽 0.5 = 0.58
1 1 3.6
= (0.5 − 1) +(1 − 2) +(1.5 − 3) = 0.5 + 1 + 1.5 = ≈ 0.58
2×3 6 6 17
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1
𝐽 𝜃 =
2𝑚
𝜃 𝑥( ) − 𝑦 𝐽 0 ≈ 2.3
1 1 14
= (0 × 1 − 1) +(0 × 2 − 2) +(0 × 3 − 3) = 1+4+9 = ≈ 2.3
6 6 6 18
6
10/23/2023
3 3
2 2
y
1 𝜃 = −0.5 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
𝐽 −0.5 ≈ 5.15 19
20
Parameters:
Cost Function:
Goal:
21
7
10/23/2023
???
500
400
Price ($)
in 1000’s 300
200
𝜃 = 50
100
𝜃 = 0.06
0
0 1000 2000 3000
Size in feet2 (x)
𝜃 𝜃 ???
22
23
Contour Plot 24
8
10/23/2023
25
26
27
9
10/23/2023
28
29
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
30
10
10/23/2023
31
Local search
Key idea (surprisingly simple):
11
10/23/2023
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum
34
J(0,1)
1
0
35
J(0,1)
1
0
36
12
10/23/2023
Derivative term
𝛼: Learning Rate
Correct:
Simultaneous update of 𝜃 , 𝜃 Incorrect:
38
𝑤ℎ𝑒𝑟𝑒 𝜃 ∈ 𝑅
39
13
10/23/2023
40
𝑑
𝜃 =𝜃 −𝛼 𝐽(𝜃 )
𝑑𝜃
≥0
𝜃 = 𝜃 − 𝛼(+𝑣𝑒 𝑛𝑜. )
Derivative: it takes the tangent to the point (the straight red line) and calculates the slop of
this tangent line. Slop = vertical line / horizontal line 41
𝑑
𝜃 =𝜃 −𝛼 𝐽(𝜃 )
𝑑𝜃
≤0
𝜃 = 𝜃 − 𝛼 . (−𝑣𝑒 𝑛𝑜. )
42
14
10/23/2023
Slope
• Familiar meaning?
• The slope of a line is the change in y divided by the
change in x .
• Slope (m) = = =
• Pick any two points on the line: (𝑥 , 𝑦 ), (𝑥 , 𝑦 )
• Ex. Find the slope of the line which passes through the
points (2, 5) and (0, 1) :
• 𝑚= = = = 2 which is positive number
• Meaning: Every time x increases by 1 (anywhere on
the line), y increase by 2 , and whenever x decreases
by 1, y decreases by 2 .
5−1 4 2
𝑚= = =
2−0 2 1
44
3+1 4 4
𝑚= = = − = −1.33
−2 − 1 −3 3
45
15
10/23/2023
46
• Positive value
• Negative value
• Zero value
𝛼: Learning Rate
48
16
10/23/2023
at local optima
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.
50
51
17
10/23/2023
52
𝜕 1
ℎ 𝑥 −𝑦
𝜕𝜃 2𝑚
𝜕 1
= 𝜃 + 𝜃 𝑥( ) − 𝑦
𝜕𝜃 2𝑚
1
(ℎ (𝑥 ) − 𝑦 )
𝑚
1
(ℎ (𝑥 ) − 𝑦 ) . 𝑥 ( )
𝑚
53
update
and
simultaneously
𝜕
𝐽(𝜃 , 𝜃 )
𝜕𝜃
54
18
10/23/2023
J(0,1)
1
0
55
J(0,1)
1
0
56
57
19
10/23/2023
58
59
60
20
10/23/2023
61
62
63
21
10/23/2023
64
65
66
22
10/23/2023
• SSE
67
68
Another name:
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses
all the training examples.
23
10/23/2023
Generalization of
Gradient descent algorithm
• Learn with larger number of features.
• Difficult to plot
70
24
10/23/2023
End
25