0% found this document useful (0 votes)
19 views45 pages

02B DL2023 NN Backprop

The document discusses the fundamentals of artificial neural networks, focusing on the backpropagation algorithm for training these networks. It outlines the challenges of training, including adjusting weights and computational complexity, and provides a framework for effectively training neural networks. Key concepts such as the architecture of neural networks, the role of activation functions, and techniques to improve training are also covered.

Uploaded by

marshe386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views45 pages

02B DL2023 NN Backprop

The document discusses the fundamentals of artificial neural networks, focusing on the backpropagation algorithm for training these networks. It outlines the challenges of training, including adjusting weights and computational complexity, and provides a framework for effectively training neural networks. Key concepts such as the architecture of neural networks, the role of activation functions, and techniques to improve training are also covered.

Uploaded by

marshe386
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

DASC7606 -2B

Deep Learning
Artificial Neural Networks
(backpropagation)
Dr Bethany Chan
Professor Francis Chin
2023
1
System
Question X (unknown function) Answer Y
f: X ®Y

How to find the unknown function?

Instead of Linear Model,


Neural Network can model any function
i.e., can solve any Classification/Regression Problem.
Remaining Problems:
How to build the Neural Network for our problem?
What does the NN look like? # layers, neurons, etc
Even given the right NN architecture,
how to find the weights of NN?
How to train the NN with past data?
?2
Outline
Artificial Neural Networks
• The Perceptron
• Building Neural Networks with Perceptron
• Training Neural Networks with
backpropagation
• Techniques to improve training
• Theoretical Motivation
• Deep Learning vs. Machine Learning

3
Training is difficult
Training a neural network - to find a set of weights
to best map inputs to outputs – is difficult
– How to adjust thousands or hundred thousands of
weights efficiently
– Error surface is non-convex, highly dimensional,
contains local minimals, saddle and flat spots.
– Overfitting – can fit training data perfectly
(if not enough) as there are many parameters
(even more than the training data).
– High computational complexity for huge amount of
training data and large neutral network
(GPUs are used)
4
Training NN with backpropagation

• Efficient backpropagation ® breakthrough


• A simple example
• Computation during the backward pass: d
• The equations for d
!" 𝐖
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

5
Why the recent breakthroughs?

• Algorithms : backprop,
CNN, LSTM, TensorFlow, …
• Big data: ImageNet, …
• Computing power: CPUs,
GPUs,..
• Dollars: Google, Facebook,
Amazon, …

6
1986 paper
in Nature
with over
30,000
citations!!!
One of the
main
contributors for
the success of
Deep Learning

7
Backpropagation (BP)
• Training adjusts the weights of the network so as
to give the minimum error (loss) of the output.
• BP is an algorithm to calculate the effect of how
the loss be affected by weights of the network
• For the Gradient Descent algorithm, BP efficiently
computes the gradient of the loss function J(W)
with respect to each weight wij(𝑙) of the network.
𝜕J 𝑾
i.e.,
𝜕𝑤%+ (')

where wij(𝑙) is the weight between neuron i and


neuron j at layer 𝑙
8
Backpropagation is a way of
computing gradients of
expressions through recursive
application of chain rule.

The Chain Rule


x=g(s) y=f(x)

y = f(x), x=g(s) s x y
!" !$ !" 𝜕𝑥 𝜕𝑠 𝜕𝑦
= = g’(s)f’(x)
!# !# !$

By the chain rule, iterating backwards one layer


at a time from the last layer.
9
Training NN with backpropagation

• Efficient backpropagation ® breakthrough


• A simple example
• Computation during the backward pass: d
• The equations for d
!" 𝐖
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

10
A simple BP example
Desired
w1 , b 1 w(L-1), b(L-1) w(L), b(L)
… 0.4 0.6 1 output

s(L-1), x(L-1) y
s(L), x(L)
J(w1,b1,… w(L),b(L)) b(L) y
" (L) "
= (x - y) = # (0.6 -1)2
2
# ss(L)(L) J
x(L-1) x(L)
x(L) = g(s(L)) 𝜕𝑠 (") 𝜕𝑥 (") 𝜕𝐽
s(L) = w(L)x(L-1) + b(L) w(L)
𝜕𝐽 𝜕𝑤 (")
How J changes w.r.t. w(L)?
𝜕𝑤 (")
'( '* (") '+ (") '(
= (") (") (")
') (") ') '* '+ chain rule
= 𝑥 (2())g’(𝑠 (2))(𝑥 (2)-y)
$%
'( '* (") '+ (") '( '( '+ (") '(
= (") (") (") = '* (") '+ (") $&(")
', (") ', '* '+ '* (") key value
= g’(𝑠 (2))(𝑥 (2)-y) = g’(𝑠 (2))(𝑥 (2)-y) for BP
11
A Simple BP Example
w(L-1), b(L-1) w(L), b(L)
… 0.4 0.6 1
'( '* (") '( s(L-2), x(L-2) s(L-1), x(L-1) s(L), x(L) y
') (")
= ') (") '* (") y
b(L-1) b(L)
("#$) !3
=𝑥 !4(&) x(L-2) s(L-1) x(L-1) s(L) x(L) J
$%
𝜕𝑠 ("$%) 𝜕𝑥 ("$%) 𝜕𝑠 (") 𝜕𝐽
as = g’(𝑠 (") )(𝑥 (") -y) w(L-1) w(L)
Output$&
at(")
BP Value
previous layer s(L-1) = w(L-1)x(L-2) + b(L-1) s(L) = w(L)x(L-1) + b(L)
x(L-1) = g(s(L-1)) x(L) = g(s(L))

'( '+ ("$%) '* (") '( '( '* ("$%) '(
'* ("$%) = '* ("$%) '+ ("$%) '* (") =
') ("$%) ') ("$%) '* ("$%)
&'
= g’(𝑠 ("#$) ) 𝑤 (") &((") =𝑥 ("#)) &'
&( ("$%)
Rate of change of the loss Output at Rate of change of
BP step function w.r.t. next input s previous the loss function
layer w.r.t. next input
12
s
How w’s are adjusted?
Layer L-2 Layer L-1 Layer L

𝒔(𝑳1𝟐)
𝒙 (𝑳1𝟐) 𝒔 (𝑳1𝟏)
𝒙 (𝑳1𝟏) 𝒔(𝑳) 𝒙(𝑳) 𝒚
𝒘(𝑳(𝟏) $%
=𝛿 ("$%) 𝒘(𝑳) $%
=𝛿 (")
𝛿 ("$&) $&("$%) $& (")
~
𝐽
X X
𝜕𝐽 𝜕𝐽
𝜕𝑤 ("'() 𝜕𝑤 (")

Backpropagation
!3
Backpropagate !4(𝒍)
= 𝛿 (')
!3 !3
= 𝑥 ('()) 𝛿 (') based on output 𝑥 ('()) and BP value = 𝛿 (')
!$(𝒍) !4(𝒍)
13
Training NN with backpropagation

• Efficient backpropagation ® breakthrough


• A simple example
• Computation during the backward pass: d
• The equations for d
!" 𝐖
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

14
More complicated example
Let us define the notation when each layer
has more than one node

Layer l-1 Layer l

𝒘𝒊𝒋 (𝒍)
ith node 𝒙𝒊 (𝒍1𝟏) 𝒔𝒋 (𝒍) jth node

15
More complicated example
(") = (") ("#$) b(L) y
𝑠* ∑+ 𝑤+* 𝑥+ 𝒙𝒋 (𝑳)
𝑥* (") = g(𝑠* (") ) x(L-1) s(L) x(L) J (𝑳)
𝒘𝒊𝒋 (𝑳) 𝒔𝒋
$ (") ) w(L) 𝒔𝒊 (𝑳$𝟏) 𝒙𝒊 (𝑳$𝟏)
J= ∑ 𝑥* − 𝑦*
) *
&' &(' (") &,' (") &'
=
&-&' (") &-&' (") &(' (") &,' (")
&' 𝒔𝒋 (𝑳) 𝒙𝒋
(𝑳)
= 𝑥+ ("#$) g’(𝑠* (") )(𝑥* (") -y) = 𝑥+ ("#$)
&(' (")
𝒘𝒊𝒋 (𝑳)
How does J change w.r.t. 𝑠+ ("#$) 𝒙𝒊 (𝑳$𝟏)
𝒔𝒊 (𝑳$𝟏)
("#$) (")
Note that 𝑠+ will affect 𝑠* , j=1,2…
&' &,& ("$%) &(' (") &'
("$%) = ∑* ("$%) ("$%) (") Sum over layer L
&(& &(& &,& &('
(") &' Backpropagation:
= g’(𝑠+ ("#$) ) ∑* 𝑤+* Reverse of feed forward
&(' (") 16
Backpropagation
Layer l Layer l +1

dj(l+1)
di(l)

𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )


17
Forward Pass vs Backpropagation
Layer l -1 Layer l Layer l Layer l +1

xj(l) di(l)

!3
Will prove 𝛿% (') =
𝑠, (-) = :(𝑤., (-) 𝑥. (-'() + 𝑏) !4! ($)

𝑥+ (') = 𝑔(𝑠+ ' ) 𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )
18
Framework for training Neural Network
1. Initialize all weights 𝑤!" ($) at random
2. Repeat until convergence:
3. Pick a random example to feed into layer 0
($) 𝜕𝐽 𝑾
4. Forward: Compute all 𝑥" 𝜕𝑤+, (-)

5. Backward: Compute all 𝛿" ($)


6. Update weights: 𝑤=> (?) → 𝑤=> (?) − 𝛼 𝑥= (?1") 𝛿> (?)
7. Return the final weights 𝑤!" ($) 𝜕𝐽
𝜕𝑠, (-)

Layer l-1 Layer l


𝒙𝒊 (𝒍'𝟏) = forward 𝜹𝒋 (𝒍) =backward
𝒊 th output of 𝒊 th node output of 𝒋th node 𝒋 th
node node
𝒘𝒊𝒋 (𝒍) 19
(,) (,) (,-.) (,)
Update rule: 𝑤*+ → 𝑤*+ − 𝛼 𝑥* 𝛿+

𝜕𝐽 𝑾
Layer l-1 Layer l
𝜕𝑤+, (-)

compute 𝒘𝒊𝒋 (𝒍) compute


(𝒍)
𝒙𝒊 (𝒍(𝟏) 𝜹𝒋

20
Training NN with backpropagation

• Efficient backpropagation ® breakthrough


• A simple example
• Computation during the backward pass: d
• The equations for d
!" 𝐖
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

21
(𝒍)
Key equations for backpropagation: 𝜹𝒊
(2) 2 !3
For output layer L: 𝛿% = 𝑔′(𝑠% )(!A(B)) 𝑦
Note: 𝐽 = loss function between ℎ(𝑥) and 𝑦 𝑔 𝒉(𝒙)
ℎ(𝑥) = 𝑔(𝒔 2 )
(2) !3 !3 !A(B) !3 𝒔𝒊 (𝑳)
𝛿% = (&) ( as (&) = )
!4! !4! !4! (&) !A(B)

For earlier layers for 𝑙 = L-1, …., 1 :


𝛿𝒋 (𝒍/𝟏)
𝛿% (') = 𝑔′(𝑠% ' )( ∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) ) 𝒘𝒊𝒋 (𝒍1𝟏)
𝛿𝒊 (𝒍)

22
Training NN with backpropagation

• Efficient backpropagation ® breakthrough


• A simple example
• Computation during the backward pass: d
• The equations for d
𝝏𝑱 𝐖
• Showing gradient 𝝏𝒘𝒊𝒋 (𝒍)
is indeed equal to 𝒙𝒊(𝒍(𝟏)𝜹𝒋 (𝒍)
and 𝜹𝒋 (𝒍) = 𝝏𝑱 𝐰
• BP with different activation functions 𝝏𝒔 (𝒍)
𝒋

• Coding NN with hidden layers

23
/0 𝐰 (,)
GOAL: Want to show: (1) = 𝛿+ ×𝑥* (,-.)
/2/0
By induction (slides 26-29)
𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠> (?) 𝜕𝐽 𝐰 (?1")
= (?)
× = ×𝑥=
𝜕𝑤=> (?) 𝜕𝑠> 𝜕𝑤=> (?) 𝜕𝑠> (?)

𝑠> (?) = ( ∑ 𝑖 𝑤=> (?) 𝑥= (?1") ) + 𝑏 𝒔𝒋 (𝒍) 𝒙𝒋


(𝒍)

? 𝒘𝒊𝒋 (𝒍)
𝜕𝑠> (?1") 𝒔𝒊 (𝒍$𝟏) 𝒙𝒊 (𝒍$𝟏)
⟹ ?
= 𝑥=
𝜕𝑤=>

(,) /0 𝐰
GOAL now becomes: want to show that 𝛿+ =
/#0 (1)
24
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋


(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

By Induction (Backward)
!3 𝐰
with 𝛿+ (H) = !4 (*)
"
for 𝑘 = L, L-1, … 𝑙+1

25
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋


(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

𝑠+ ('<)) = (∑ 𝑖 𝑤%+ ('<)) 𝑥% (') ) + 𝑏


𝜕𝑠+ '<)
⟹ '
= 𝑤%+ ('<))
𝜕𝑥%

26
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋


(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

(?C") '+) (')


= ∑> 𝛿> ×𝑤%+ ('<)) ×
'*) (')

= ∑+ 𝛿+ ('<)) × 𝑤%+ ('<)) ×𝑔′(𝑠% ' ) [since 𝑥% (') = 𝑔(𝑠% ' )]

27
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋


(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

(?C") '+) (')


= ∑> 𝛿> ×𝑤%+ ('<)) ×
'*) (')

= ∑+ 𝛿+ ('<)) × 𝑤%+ ('<)) ×𝑔′(𝑠% ' ) = 𝛿= (?)

Recall: 𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )


28
Training NN with backpropagation

• Efficient backpropagation ® breakthrough


• A simple example
• Computation during the backward pass: d
• The equations for d
!3 J
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

29
𝒈 : Common Activation Functions

1
Sigmoid: 𝑔 𝑠 = 𝜎 𝑠 =
1 + 𝑒 (4
𝑑𝑔(𝑠) 𝑒 (4
→ 𝑔K 𝑠 = =
𝑑𝑠 1 + 𝑒 (4 L

1 + 𝑒 (4 − 1 1
=
1 + 𝑒 (4 1 + 𝑒 (4

= 1 − 𝑔 𝑠 𝑔(𝑠)
𝒈 : Common Activation Functions
Tanh: 2 1 − 𝑒 '2&
𝑔 𝑠 = −1=
1 + 𝑒 '2& 1 + 𝑒 '2&
Quotient rule
1 − 𝑒 '& /𝑒 & 𝑒 & − 𝑒 '&
= = M K M+ N(MN+
1 + 𝑒 '& /𝑒 & 𝑒 & + 𝑒 '& (N) = N&

𝑑𝑔(𝑠) (𝑒 & + 𝑒 '& ) 𝑒 & + 𝑒 '& − (𝑒 & − 𝑒 '& )(𝑒 & − 𝑒 '& )
𝑔3 𝑠 = =
𝑑𝑠 (𝑒 & +𝑒 '& )2
(𝑒 & − 𝑒 '& )2 2
=1 − & = 1 − 𝑔(𝑠)
(𝑒 + 𝑒 '& )2

RelU: 𝑔 𝑠 = J0 if 𝑠 < 0 → 𝑔′ 𝑠 = J
0 if 𝑠 < 0
𝑠 if 𝑠 ≥ 0 1 if 𝑠 ≥ 0
31
For the special case of output layer

Output Layer L For earlier layers:


𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑤%+ ('<)) 𝛿+ ('<)) )

For output layer L:


di(L) !3
𝛿% (2) = 𝑔′(𝑠% 2 )(!A(B))

Last layer 𝛿% (2) depends on (the derivative of) :


• activation function 𝑔 in the last layer
• loss function 𝐽

32
Output Layer
Size Loss Function Activation
J Function 𝑔 d
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax ℎ 𝑥 −𝑦

( 2
Mean square error: 𝐽= ℎ 𝑥 −𝑦
2

Binary cross entropy loss: 𝐽 = −[𝑦 log ℎ 𝑥 + 1 − 𝑦 log 1 − ℎ(𝑥 ]

Cross entropy loss: 𝐽 = − ∑ 𝑦4 log ℎ(𝑥)

33
(")
Output Layer 𝛿 - Mean Square Error
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

1 L
𝐽 = ℎ 𝑥 −𝑦
2
For output layer L (with one output):
(") $ "
𝜕𝐽
𝛿 =𝑔 𝑠
𝜕ℎ 𝑥
= 𝑔$ 𝑠 "
(ℎ 𝑥 − 𝑦)
34
(")
Output Layer 𝛿 - Binary Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

Note that ℎ 𝑥 = 𝑔 𝑠 = 𝜎 𝑠
𝐽 = −[𝑦 log ℎ 𝑥 + 1 − 𝑦 log 1 − ℎ(𝑥 )]
For output layer L (with one output):
&'
𝛿 (") = 𝑔2 𝑠 "
&3 ,
" 4 $#4
= 𝜎′ 𝑠 −3 , + $#3 ,
4#3 ,
= 1−𝜎 𝑠 𝜎(𝑠) − 3 , ($#3 , ) =ℎ 𝑥 −𝑦
35
Output Layer 𝛿 (") - Multi Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

Cross entropy loss: 𝐽 = − ∑ 𝑦H log 𝑝H Skip Proof – next 2 slides


Note that 𝒉 𝒙 , 𝒚 and 𝜹(𝑳) are arrays 𝒔 𝑳
𝑔
𝒉 𝒙
=𝒙 𝑳
𝒚 = (𝑦),…, 𝑦S ) 𝒔𝟏 𝑝(
𝒉 𝒙 = 𝒙 𝑳 = (𝑝),…, 𝑝S )

Softmax
𝒔𝒊 𝑝.
𝑝% = 𝑔 𝑠% ( note that 𝑔 is softmax )
(𝑳) $ 𝑳 &'
𝜹 =𝑔 𝒔 &𝒉 𝒙
!3 !3 !3 !3 𝒔𝑪 𝑝5
=( , ,… )
!𝒉 𝒙 !T, !T- !T.
K 𝑳 !T!
𝒈 𝒔 = ( !4 ), for 1 ≤ 𝑖, 𝑗 ≤ 𝐶 is a matrix 36
"
Output Layer 𝛿 (") - Multi Classification
!3
As 𝜹(𝑳) = 𝒈K 𝒔 𝑳 ; let 𝑠% =𝑠% 2 and 𝑝% = 𝑥% 2
𝒔 𝑳
𝒉 𝒙
!𝒉 𝒙 𝑔 =𝒙 𝑳
U /! 𝒔𝟏 𝑝(
𝒉 𝒙 =(𝑝),… 𝑝S ) where 𝑔 𝑠% = 𝑝% = ∑ U /*

Softmax
!T! 𝒔𝒊 𝑝.
𝒈K 𝒔 𝑳 =( ), for 1 ≤ 𝑖, 𝑗 ≤ 𝐶
!4"
!T!
If 𝑖 = 𝑗; = (𝑒 4! ∑ 𝑒 4* - 𝑒 4! 𝑒 4! )/(∑ 𝑒 4* )2 = 𝑝% (1-𝑝% )
!4!
!T! 𝒔𝑪 𝑝5
If 𝑖 ≠ 𝑗; !4 = -(𝑒 4" 𝑒 4! )/(∑ 𝑒 4* )2 = −𝑝+ 𝑝%
"

𝑝)(1-𝑝)) −𝑝)𝑝L −𝑝)𝑝W … −𝑝)𝑝S


𝒈$ 𝒔 𝑳
= − 𝑝L𝑝) 𝑝L(1-𝑝L) −𝑝L𝑝W … −𝑝L𝑝S
C x C matrix

− 𝑝S 𝑝) − 𝑝S 𝑝L −𝑝S 𝑝W … 𝑝S (1-𝑝S )
37
Output Layer 𝛿 (") - Multi Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

Cross entropy loss: 𝐽 = − ∑ 𝑦5 log 𝑝5


&' &' &' &' &' 4&
=( , , … ) where =−
&𝒉 𝒙 &8% &80 &81 &8& 8&
𝑝% (1-𝑝% ) −𝑝% 𝑝& −𝑝% 𝑝1 … −𝑝% 𝑝2 −𝑦% /𝑝%
45 −𝑝& 𝑝% 𝑝& (1-𝑝& ) −𝑝& 𝑝1 … −𝑝& 𝑝2 −𝑦& /𝑝&
𝜹(𝑳) = 𝒈3 𝒔 𝑳
4𝒉 𝒙
= … …
− 𝑝2 𝑝% − 𝑝2 𝑝& −𝑝2 𝑝1 … 𝑝2 (1-𝑝2 ) −𝑦2 /𝑝2

= (−𝑦$ +𝑝$ ∑ 𝑦5 , −𝑦) +𝑝) ∑ 𝑦5 , …−𝑦9 +𝑝9 ∑ 𝑦5 )


= (−𝑦$ +𝑝$ , −𝑦) +𝑝) , …−𝑦9 +𝑝9 ) as ∑ 𝑦5 =1
=𝒉 𝒙 −𝒚 38
Layer l-1 Layer l
𝒙𝒊 (𝒍'𝟏) = forward 𝜹𝒋 (𝒍) =backward
𝜹𝒋 (𝒍'𝟏) 𝒊 th output of 𝒊 th node output of 𝒋th node 𝒋 th
node
𝑤*+ (,) node

!3 𝐰
• 𝑤%+ (') → 𝑤%+ (') − 𝛼 𝑥% '() 𝛿+ '
where !$!" ($)
= 𝛿* (:) ×𝑥+ (:#$)

• 𝛿+ (:#$) = 𝑔′(𝑠+ :#$ )(∑ 𝑤+* (:) 𝛿* (:) ); 𝑙 = L, L-1, L-2…


%
sigmoid g 𝑠 = 𝒈’ 𝑠 = 1 − 𝑔 𝑠 𝑔(𝑠)
%/3 !"

Tanh
3 " $3 !"
g 𝑠 = 3 "/3 !" 𝒈’ 𝑠 = 1 − 𝑔(𝑠)&
BP 0 if 𝑠 < 0 0 if 𝑠 < 0
Relu 𝒈 𝑠 = ^ 𝒈′ 𝑠 = ^
Summary 𝑠 if 𝑠 ≥ 0 1 if 𝑠 ≥ 0

𝑔 d(L) = 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Sigmoid d(L) = ℎ 𝑥 − 𝑦
Softmax d(L) = 𝒉 𝒙 − 𝒚
39 39
Training NN with backpropagation

• Efficient backpropagation ® breakthrough


• A simple example
• Computation during the backward pass: d
• The equations for d
!3 J
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

40
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2
Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2

W1 W2
Dimensions of matrices:
𝑥: 1 ´ 2
𝑥 𝒔& 𝒙& 𝑊( : 2 ´ 5
𝑏( : 1 ´ 5
𝑠( , 𝑥( : 1 ´ 5
𝒔% 𝒙% 𝑊2 : 5 ´ 2
𝑏2 : 1 ´ 2
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 ) 𝑠2 , 𝑥2 : 1 ´ 2
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
41
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2

W1 W2 Dimensions of matrices:
𝑥: M ´ 2
𝑥 𝒔& 𝒙& 𝑊( : 2 ´ 5
M = the number
𝑏( : 1 ´ 5
of input data 𝑠( , 𝑥( : M ´ 5
𝒔% 𝒙% 𝑊2 : 5 ´ 2
𝑏2 : 1 ´ 2
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 ) 𝑠2 , 𝑥2 : M ´ 2
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
42
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2
Backward pass equation:
Backward pass equations:
(-)
𝛿+ = 𝑔′(𝑠+ - )(d 𝑤+, (-/%) 𝛿, (-/%) ) 𝛿& = ℎ − 𝑦
𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝑥 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5

𝒔% 𝒙%

Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )


Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
43
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2
Backward pass equations:
𝛿& = ℎ − 𝑦
𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝑥 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5

Dimensions of matrices:
𝛿& , 𝑥 : M ´ 2
𝒔% 𝒙%
𝑊& : 5 ´ 2
𝛿% , 𝑠% : M ´ 5
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
44
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
Relevant gradients:
𝜕𝐽 𝑠2 = 𝑥( 𝑊2 + 𝑏2
= 𝑥%5 𝛿& ℎ = 𝑥2 = softmax 𝑠2
𝜕𝑊&
𝜕𝐽
= 𝛿& Backward pass equations:
𝜕𝑏& 𝛿& = ℎ − 𝑦
𝜕𝐽
= 𝑥 5 𝛿% 𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝜕𝑊%
𝑥 𝜕𝐽 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5
= 𝛿%
𝜕𝑏%
Dimensions of matrices:
𝛿& , ℎ : M ´ 2
𝒔% 𝒙%
𝑊& : 5 ´ 2
𝛿% , 𝑠% : M ´ 5
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
45

You might also like