Deep Learning Basics Lecture 2 Backpropagation
Deep Learning Basics Lecture 2 Backpropagation
Lecture 2: Backpropagation
Princeton University COS 495
Instructor: Yingyu Liang
How to train the dragon?
…
……
𝑦
…
ℎ1 ℎ2 ℎ𝐿
𝑥
How to get the expected output
𝑓𝜃 (𝑥)
𝑥 𝑙 𝑥; 𝜃 ≠ 0
𝑦
𝑥 𝑙 𝑥; 𝜃 + 𝑑 ≈ 0
Loss 𝑙(𝑥; 𝜃 + 𝑑)
How to get the expected output
How to find 𝑑: 𝑙 𝑥; 𝜃 + 𝜖𝑣 ≈ 𝑙 𝑥; 𝜃 + 𝛻𝑙 𝑥; 𝜃 ∗ 𝜖𝑣 for small scalar 𝜖
𝑥 𝑙 𝑥; 𝜃 + 𝑑 ≈ 0
Loss 𝑙(𝑥; 𝜃 + 𝑑)
How to get the expected output
Conclusion: Move 𝜃 along −𝛻𝑙 𝑥; 𝜃 for a small amount
𝑥 𝑙 𝑥; 𝜃 + 𝑑
Loss 𝑙(𝑥; 𝜃 + 𝑑)
Neural Networks as real circuits
Pictorial illustration of gradient descent
Gradient
• Gradient of the loss is simple
• E.g., 𝑙 𝑓𝜃 , 𝑥, 𝑦 = 𝑓𝜃 𝑥 − 𝑦 2 /2
𝜕𝑙 𝜕𝑓
• = (𝑓𝜃 𝑥 − 𝑦)
𝜕𝜃 𝜕𝜃
• Key part: gradient of the hypothesis
Open the box: real circuit
Single neuron
𝑥1
− 𝑓
𝑥2
Function: 𝑓 = 𝑥1 − 𝑥2
Single neuron
𝑥1
1
− 𝑓
𝑥2
−1
Function: 𝑓 = 𝑥1 − 𝑥2
𝜕𝑓 𝜕𝑓
Gradient: = 1, = −1
𝜕𝑥1 𝜕𝑥2
Two neurons
𝑥1
𝑥3
− 𝑓
+ 𝑥2
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4
Two neurons
𝑥1
𝑥3 1
𝜕𝑥2
− 𝑓
=1
𝜕𝑥3
+ 𝑥2
−1
𝑥4
𝜕𝑥2
=1
𝜕𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4
𝜕𝑥2 𝜕𝑥2 𝜕𝑓
Gradient: = 1, = 1. What about ?
𝜕𝑥3 𝜕𝑥4 𝜕𝑥3
Two neurons
𝑥1
𝑥3 1
− 𝑓
−1
+ 𝑥2
−1
𝑥4
−1
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2
Gradient: = = −1
𝜕𝑥3 𝜕𝑥2 𝜕𝑥3
Multiple input
𝑥1
𝑥3 1
− 𝑓
−1
𝑥5
+ 𝑥2
−1
𝜕𝑥2 𝑥4
=1
𝜕𝑥5 −1
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4
𝜕𝑥2
Gradient: =1
𝜕𝑥5
Multiple input
𝑥1
𝑥3 1
− 𝑓
−1
𝑥5
+ 𝑥2
−1
−1 𝑥4
−1
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑥3 + 𝑥5 + 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥5
Gradient: = = −1
𝜕𝑥5 𝜕𝑥5 𝜕𝑥3
Weights on the edges
𝑥3 𝑥1
𝑤3 1
− 𝑓
+ 𝑥2
𝑤4 −1
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3 𝑥3 + 𝑤4 𝑥4
Weights on the edges
𝑤3
𝑥1
1
𝑥3 − 𝑓
𝑤4
+ 𝑥2
−1
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3 𝑥3 + 𝑤4 𝑥4
Weights on the edges
𝑥3 𝑥1
𝑤3 −𝑥3 1
− 𝑓
+ 𝑥2
𝑤4 −𝑥4 −1
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝑤3 𝑥3 + 𝑤4 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2
Gradient: = = −1 × 𝑥3 = −𝑥3
𝜕𝑤3 𝜕𝑥2 𝜕𝑤3
Activation
𝑥3 𝑥1
𝑤3 1
− 𝑓
𝜎 𝑥2
𝑤4 −1
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
Activation
𝑥3 𝑥1
𝑤3 1
− 𝑓
𝑛𝑒𝑡2 𝜎 𝑥2
−1
𝑤4
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
Let 𝑛𝑒𝑡2 = 𝑤3 𝑥3 + 𝑤4 𝑥4
Activation
𝜕𝑛𝑒𝑡2
= 𝑥3
𝑥3 𝜕𝑤3 𝑥1
𝑤3 1
− 𝑓
𝜕𝑥2
= 𝜎′ 𝑛𝑒𝑡2 𝜎 𝑥2
𝜕𝑛𝑒𝑡2
𝑤4 −1
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2 𝜕𝑛𝑒𝑡2
Gradient: = = −1 × 𝜎 ′ × 𝑥3 = −𝜎 ′ 𝑥3
𝜕𝑤3 𝜕𝑥2 𝜕𝑛𝑒𝑡2 𝜕𝑤3
Activation
𝑥3 −𝜎′𝑥3 𝑥1
𝑤3 1
− 𝑓
−𝜎′ 𝑛𝑒𝑡2 𝜎 𝑥2
𝑤4 −1
𝑥4
Function: 𝑓 = 𝑥1 − 𝑥2 = 𝑥1 − 𝜎 𝑤3 𝑥3 + 𝑤4 𝑥4
𝜕𝑓 𝜕𝑓 𝜕𝑥2 𝜕𝑛𝑒𝑡2
Gradient: = = −1 × 𝜎 ′ × 𝑥3 = −𝜎 ′ 𝑥3
𝜕𝑤3 𝜕𝑥2 𝜕𝑛𝑒𝑡2 𝜕𝑤3
Multiple paths
𝑥5
+ 𝑥1
1
𝑥3 𝑤3 − 𝑓
𝑛𝑒𝑡2 𝜎 𝑥2
𝑤4 −1
𝑥4
𝑛𝑒𝑡11
𝑥1 𝜎 ℎ11
+ 𝑓
𝑥2 𝜎 ℎ12
𝑛𝑒𝑡21
Math form
Gradient descent
• Minimize loss 𝐿 𝜃 , where the hypothesis is parametrized by 𝜃
• Gradient descent
• Initialize 𝜃0
• 𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻 𝐿 𝜃𝑡
Stochastic gradient descent (SGD)
• Suppose data points arrive one by one
1 𝑛
• 𝐿 𝜃 = σ𝑡=1 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡 ), but we only know 𝑙(𝜃, 𝑥𝑡 , 𝑦𝑡 ) at time 𝑡
𝑛
• Update rule
1
𝜃𝑡+1 = 𝜃𝑡 − 𝜂𝑡 𝛻 𝑙 𝜃𝑡 , 𝑥𝑡𝑏+𝑖 , 𝑦𝑡𝑏+𝑖
𝑏
1≤𝑖≤𝑏