02B DL2023 NN Backprop
02B DL2023 NN Backprop
Deep Learning
Artificial Neural Networks
(backpropagation)
Dr Bethany Chan
Professor Francis Chin
2023
1
System
Question X (unknown function) Answer Y
f: X ®Y
3
Training is difficult
Training a neural network - to find a set of weights
to best map inputs to outputs – is difficult
– How to adjust thousands or hundred thousands of
weights efficiently
– Error surface is non-convex, highly dimensional,
contains local minimals, saddle and flat spots.
– Overfitting – can fit training data perfectly
(if not enough) as there are many parameters
(even more than the training data).
– High computational complexity for huge amount of
training data and large neutral network
(GPUs are used)
4
Training NN with backpropagation
5
Why the recent breakthroughs?
• Algorithms : backprop,
CNN, LSTM, TensorFlow, …
• Big data: ImageNet, …
• Computing power: CPUs,
GPUs,..
• Dollars: Google, Facebook,
Amazon, …
6
1986 paper
in Nature
with over
30,000
citations!!!
One of the
main
contributors for
the success of
Deep Learning
7
Backpropagation (BP)
• Training adjusts the weights of the network so as
to give the minimum error (loss) of the output.
• BP is an algorithm to calculate the effect of how
the loss be affected by weights of the network
• For the Gradient Descent algorithm, BP efficiently
computes the gradient of the loss function J(W)
with respect to each weight wij(𝑙) of the network.
𝜕J 𝑾
i.e.,
𝜕𝑤%+ (')
y = f(x), x=g(s) s x y
!" !$ !" 𝜕𝑥 𝜕𝑠 𝜕𝑦
= = g’(s)f’(x)
!# !# !$
10
A simple BP example
Desired
w1 , b 1 w(L-1), b(L-1) w(L), b(L)
… 0.4 0.6 1 output
s(L-1), x(L-1) y
s(L), x(L)
J(w1,b1,… w(L),b(L)) b(L) y
" (L) "
= (x - y) = # (0.6 -1)2
2
# ss(L)(L) J
x(L-1) x(L)
x(L) = g(s(L)) 𝜕𝑠 (") 𝜕𝑥 (") 𝜕𝐽
s(L) = w(L)x(L-1) + b(L) w(L)
𝜕𝐽 𝜕𝑤 (")
How J changes w.r.t. w(L)?
𝜕𝑤 (")
'( '* (") '+ (") '(
= (") (") (")
') (") ') '* '+ chain rule
= 𝑥 (2())g’(𝑠 (2))(𝑥 (2)-y)
$%
'( '* (") '+ (") '( '( '+ (") '(
= (") (") (") = '* (") '+ (") $&(")
', (") ', '* '+ '* (") key value
= g’(𝑠 (2))(𝑥 (2)-y) = g’(𝑠 (2))(𝑥 (2)-y) for BP
11
A Simple BP Example
w(L-1), b(L-1) w(L), b(L)
… 0.4 0.6 1
'( '* (") '( s(L-2), x(L-2) s(L-1), x(L-1) s(L), x(L) y
') (")
= ') (") '* (") y
b(L-1) b(L)
("#$) !3
=𝑥 !4(&) x(L-2) s(L-1) x(L-1) s(L) x(L) J
$%
𝜕𝑠 ("$%) 𝜕𝑥 ("$%) 𝜕𝑠 (") 𝜕𝐽
as = g’(𝑠 (") )(𝑥 (") -y) w(L-1) w(L)
Output$&
at(")
BP Value
previous layer s(L-1) = w(L-1)x(L-2) + b(L-1) s(L) = w(L)x(L-1) + b(L)
x(L-1) = g(s(L-1)) x(L) = g(s(L))
'( '+ ("$%) '* (") '( '( '* ("$%) '(
'* ("$%) = '* ("$%) '+ ("$%) '* (") =
') ("$%) ') ("$%) '* ("$%)
&'
= g’(𝑠 ("#$) ) 𝑤 (") &((") =𝑥 ("#)) &'
&( ("$%)
Rate of change of the loss Output at Rate of change of
BP step function w.r.t. next input s previous the loss function
layer w.r.t. next input
12
s
How w’s are adjusted?
Layer L-2 Layer L-1 Layer L
𝒔(𝑳1𝟐)
𝒙 (𝑳1𝟐) 𝒔 (𝑳1𝟏)
𝒙 (𝑳1𝟏) 𝒔(𝑳) 𝒙(𝑳) 𝒚
𝒘(𝑳(𝟏) $%
=𝛿 ("$%) 𝒘(𝑳) $%
=𝛿 (")
𝛿 ("$&) $&("$%) $& (")
~
𝐽
X X
𝜕𝐽 𝜕𝐽
𝜕𝑤 ("'() 𝜕𝑤 (")
Backpropagation
!3
Backpropagate !4(𝒍)
= 𝛿 (')
!3 !3
= 𝑥 ('()) 𝛿 (') based on output 𝑥 ('()) and BP value = 𝛿 (')
!$(𝒍) !4(𝒍)
13
Training NN with backpropagation
14
More complicated example
Let us define the notation when each layer
has more than one node
𝒘𝒊𝒋 (𝒍)
ith node 𝒙𝒊 (𝒍1𝟏) 𝒔𝒋 (𝒍) jth node
15
More complicated example
(") = (") ("#$) b(L) y
𝑠* ∑+ 𝑤+* 𝑥+ 𝒙𝒋 (𝑳)
𝑥* (") = g(𝑠* (") ) x(L-1) s(L) x(L) J (𝑳)
𝒘𝒊𝒋 (𝑳) 𝒔𝒋
$ (") ) w(L) 𝒔𝒊 (𝑳$𝟏) 𝒙𝒊 (𝑳$𝟏)
J= ∑ 𝑥* − 𝑦*
) *
&' &(' (") &,' (") &'
=
&-&' (") &-&' (") &(' (") &,' (")
&' 𝒔𝒋 (𝑳) 𝒙𝒋
(𝑳)
= 𝑥+ ("#$) g’(𝑠* (") )(𝑥* (") -y) = 𝑥+ ("#$)
&(' (")
𝒘𝒊𝒋 (𝑳)
How does J change w.r.t. 𝑠+ ("#$) 𝒙𝒊 (𝑳$𝟏)
𝒔𝒊 (𝑳$𝟏)
("#$) (")
Note that 𝑠+ will affect 𝑠* , j=1,2…
&' &,& ("$%) &(' (") &'
("$%) = ∑* ("$%) ("$%) (") Sum over layer L
&(& &(& &,& &('
(") &' Backpropagation:
= g’(𝑠+ ("#$) ) ∑* 𝑤+* Reverse of feed forward
&(' (") 16
Backpropagation
Layer l Layer l +1
dj(l+1)
di(l)
xj(l) di(l)
!3
Will prove 𝛿% (') =
𝑠, (-) = :(𝑤., (-) 𝑥. (-'() + 𝑏) !4! ($)
𝑥+ (') = 𝑔(𝑠+ ' ) 𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )
18
Framework for training Neural Network
1. Initialize all weights 𝑤!" ($) at random
2. Repeat until convergence:
3. Pick a random example to feed into layer 0
($) 𝜕𝐽 𝑾
4. Forward: Compute all 𝑥" 𝜕𝑤+, (-)
𝜕𝐽 𝑾
Layer l-1 Layer l
𝜕𝑤+, (-)
20
Training NN with backpropagation
21
(𝒍)
Key equations for backpropagation: 𝜹𝒊
(2) 2 !3
For output layer L: 𝛿% = 𝑔′(𝑠% )(!A(B)) 𝑦
Note: 𝐽 = loss function between ℎ(𝑥) and 𝑦 𝑔 𝒉(𝒙)
ℎ(𝑥) = 𝑔(𝒔 2 )
(2) !3 !3 !A(B) !3 𝒔𝒊 (𝑳)
𝛿% = (&) ( as (&) = )
!4! !4! !4! (&) !A(B)
22
Training NN with backpropagation
23
/0 𝐰 (,)
GOAL: Want to show: (1) = 𝛿+ ×𝑥* (,-.)
/2/0
By induction (slides 26-29)
𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠> (?) 𝜕𝐽 𝐰 (?1")
= (?)
× = ×𝑥=
𝜕𝑤=> (?) 𝜕𝑠> 𝜕𝑤=> (?) 𝜕𝑠> (?)
? 𝒘𝒊𝒋 (𝒍)
𝜕𝑠> (?1") 𝒔𝒊 (𝒍$𝟏) 𝒙𝒊 (𝒍$𝟏)
⟹ ?
= 𝑥=
𝜕𝑤=>
(,) /0 𝐰
GOAL now becomes: want to show that 𝛿+ =
/#0 (1)
24
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')
(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)
By Induction (Backward)
!3 𝐰
with 𝛿+ (H) = !4 (*)
"
for 𝑘 = L, L-1, … 𝑙+1
25
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')
(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)
26
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')
(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)
27
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')
(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)
29
𝒈 : Common Activation Functions
1
Sigmoid: 𝑔 𝑠 = 𝜎 𝑠 =
1 + 𝑒 (4
𝑑𝑔(𝑠) 𝑒 (4
→ 𝑔K 𝑠 = =
𝑑𝑠 1 + 𝑒 (4 L
1 + 𝑒 (4 − 1 1
=
1 + 𝑒 (4 1 + 𝑒 (4
= 1 − 𝑔 𝑠 𝑔(𝑠)
𝒈 : Common Activation Functions
Tanh: 2 1 − 𝑒 '2&
𝑔 𝑠 = −1=
1 + 𝑒 '2& 1 + 𝑒 '2&
Quotient rule
1 − 𝑒 '& /𝑒 & 𝑒 & − 𝑒 '&
= = M K M+ N(MN+
1 + 𝑒 '& /𝑒 & 𝑒 & + 𝑒 '& (N) = N&
𝑑𝑔(𝑠) (𝑒 & + 𝑒 '& ) 𝑒 & + 𝑒 '& − (𝑒 & − 𝑒 '& )(𝑒 & − 𝑒 '& )
𝑔3 𝑠 = =
𝑑𝑠 (𝑒 & +𝑒 '& )2
(𝑒 & − 𝑒 '& )2 2
=1 − & = 1 − 𝑔(𝑠)
(𝑒 + 𝑒 '& )2
RelU: 𝑔 𝑠 = J0 if 𝑠 < 0 → 𝑔′ 𝑠 = J
0 if 𝑠 < 0
𝑠 if 𝑠 ≥ 0 1 if 𝑠 ≥ 0
31
For the special case of output layer
32
Output Layer
Size Loss Function Activation
J Function 𝑔 d
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax ℎ 𝑥 −𝑦
( 2
Mean square error: 𝐽= ℎ 𝑥 −𝑦
2
33
(")
Output Layer 𝛿 - Mean Square Error
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚
1 L
𝐽 = ℎ 𝑥 −𝑦
2
For output layer L (with one output):
(") $ "
𝜕𝐽
𝛿 =𝑔 𝑠
𝜕ℎ 𝑥
= 𝑔$ 𝑠 "
(ℎ 𝑥 − 𝑦)
34
(")
Output Layer 𝛿 - Binary Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚
Note that ℎ 𝑥 = 𝑔 𝑠 = 𝜎 𝑠
𝐽 = −[𝑦 log ℎ 𝑥 + 1 − 𝑦 log 1 − ℎ(𝑥 )]
For output layer L (with one output):
&'
𝛿 (") = 𝑔2 𝑠 "
&3 ,
" 4 $#4
= 𝜎′ 𝑠 −3 , + $#3 ,
4#3 ,
= 1−𝜎 𝑠 𝜎(𝑠) − 3 , ($#3 , ) =ℎ 𝑥 −𝑦
35
Output Layer 𝛿 (") - Multi Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚
Softmax
𝒔𝒊 𝑝.
𝑝% = 𝑔 𝑠% ( note that 𝑔 is softmax )
(𝑳) $ 𝑳 &'
𝜹 =𝑔 𝒔 &𝒉 𝒙
!3 !3 !3 !3 𝒔𝑪 𝑝5
=( , ,… )
!𝒉 𝒙 !T, !T- !T.
K 𝑳 !T!
𝒈 𝒔 = ( !4 ), for 1 ≤ 𝑖, 𝑗 ≤ 𝐶 is a matrix 36
"
Output Layer 𝛿 (") - Multi Classification
!3
As 𝜹(𝑳) = 𝒈K 𝒔 𝑳 ; let 𝑠% =𝑠% 2 and 𝑝% = 𝑥% 2
𝒔 𝑳
𝒉 𝒙
!𝒉 𝒙 𝑔 =𝒙 𝑳
U /! 𝒔𝟏 𝑝(
𝒉 𝒙 =(𝑝),… 𝑝S ) where 𝑔 𝑠% = 𝑝% = ∑ U /*
Softmax
!T! 𝒔𝒊 𝑝.
𝒈K 𝒔 𝑳 =( ), for 1 ≤ 𝑖, 𝑗 ≤ 𝐶
!4"
!T!
If 𝑖 = 𝑗; = (𝑒 4! ∑ 𝑒 4* - 𝑒 4! 𝑒 4! )/(∑ 𝑒 4* )2 = 𝑝% (1-𝑝% )
!4!
!T! 𝒔𝑪 𝑝5
If 𝑖 ≠ 𝑗; !4 = -(𝑒 4" 𝑒 4! )/(∑ 𝑒 4* )2 = −𝑝+ 𝑝%
"
!3 𝐰
• 𝑤%+ (') → 𝑤%+ (') − 𝛼 𝑥% '() 𝛿+ '
where !$!" ($)
= 𝛿* (:) ×𝑥+ (:#$)
Tanh
3 " $3 !"
g 𝑠 = 3 "/3 !" 𝒈’ 𝑠 = 1 − 𝑔(𝑠)&
BP 0 if 𝑠 < 0 0 if 𝑠 < 0
Relu 𝒈 𝑠 = ^ 𝒈′ 𝑠 = ^
Summary 𝑠 if 𝑠 ≥ 0 1 if 𝑠 ≥ 0
𝑔 d(L) = 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Sigmoid d(L) = ℎ 𝑥 − 𝑦
Softmax d(L) = 𝒉 𝒙 − 𝒚
39 39
Training NN with backpropagation
40
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2
Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2
W1 W2
Dimensions of matrices:
𝑥: 1 ´ 2
𝑥 𝒔& 𝒙& 𝑊( : 2 ´ 5
𝑏( : 1 ´ 5
𝑠( , 𝑥( : 1 ´ 5
𝒔% 𝒙% 𝑊2 : 5 ´ 2
𝑏2 : 1 ´ 2
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 ) 𝑠2 , 𝑥2 : 1 ´ 2
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
41
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2
W1 W2 Dimensions of matrices:
𝑥: M ´ 2
𝑥 𝒔& 𝒙& 𝑊( : 2 ´ 5
M = the number
𝑏( : 1 ´ 5
of input data 𝑠( , 𝑥( : M ´ 5
𝒔% 𝒙% 𝑊2 : 5 ´ 2
𝑏2 : 1 ´ 2
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 ) 𝑠2 , 𝑥2 : M ´ 2
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
42
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2
Backward pass equation:
Backward pass equations:
(-)
𝛿+ = 𝑔′(𝑠+ - )(d 𝑤+, (-/%) 𝛿, (-/%) ) 𝛿& = ℎ − 𝑦
𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝑥 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5
𝒔% 𝒙%
Dimensions of matrices:
𝛿& , 𝑥 : M ´ 2
𝒔% 𝒙%
𝑊& : 5 ´ 2
𝛿% , 𝑠% : M ´ 5
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
44
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
Relevant gradients:
𝜕𝐽 𝑠2 = 𝑥( 𝑊2 + 𝑏2
= 𝑥%5 𝛿& ℎ = 𝑥2 = softmax 𝑠2
𝜕𝑊&
𝜕𝐽
= 𝛿& Backward pass equations:
𝜕𝑏& 𝛿& = ℎ − 𝑦
𝜕𝐽
= 𝑥 5 𝛿% 𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝜕𝑊%
𝑥 𝜕𝐽 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5
= 𝛿%
𝜕𝑏%
Dimensions of matrices:
𝛿& , ℎ : M ´ 2
𝒔% 𝒙%
𝑊& : 5 ´ 2
𝛿% , 𝑠% : M ´ 5
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
45