0% found this document useful (0 votes)

19 views45 pages

02B DL2023 NN Backprop

The document discusses the fundamentals of artificial neural networks, focusing on the backpropagation algorithm for training these networks. It outlines the challenges of training, including adjusting weights and computational complexity, and provides a framework for effectively training neural networks. Key concepts such as the architecture of neural networks, the role of activation functions, and techniques to improve training are also covered.

Uploaded by

marshe386

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views45 pages

02B DL2023 NN Backprop

Uploaded by

marshe386

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

DASC7606 -2B

Deep Learning
Artificial Neural Networks
(backpropagation)
Dr Bethany Chan
Professor Francis Chin
2023
1
System
Question X (unknown function) Answer Y
f: X ®Y

How to find the unknown function?

Instead of Linear Model,

Neural Network can model any function
i.e., can solve any Classification/Regression Problem.
Remaining Problems:
How to build the Neural Network for our problem?
What does the NN look like? # layers, neurons, etc
Even given the right NN architecture,
how to find the weights of NN?
How to train the NN with past data?
?2
Outline
Artificial Neural Networks
• The Perceptron
• Building Neural Networks with Perceptron
• Training Neural Networks with
backpropagation
• Techniques to improve training
• Theoretical Motivation
• Deep Learning vs. Machine Learning

3
Training is difficult
Training a neural network - to find a set of weights
to best map inputs to outputs – is difficult
– How to adjust thousands or hundred thousands of
weights efficiently
– Error surface is non-convex, highly dimensional,
contains local minimals, saddle and flat spots.
– Overfitting – can fit training data perfectly
(if not enough) as there are many parameters
(even more than the training data).
– High computational complexity for huge amount of
training data and large neutral network
(GPUs are used)
4
Training NN with backpropagation

• Efficient backpropagation ® breakthrough

• A simple example
• Computation during the backward pass: d
• The equations for d
!" 𝐖
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

5
Why the recent breakthroughs?

• Algorithms : backprop,
CNN, LSTM, TensorFlow, …
• Big data: ImageNet, …
• Computing power: CPUs,
GPUs,..
• Dollars: Google, Facebook,
Amazon, …

6
1986 paper
in Nature
with over
30,000
citations!!!
One of the
main
contributors for
the success of
Deep Learning

7
Backpropagation (BP)
• Training adjusts the weights of the network so as
to give the minimum error (loss) of the output.
• BP is an algorithm to calculate the effect of how
the loss be affected by weights of the network
• For the Gradient Descent algorithm, BP efficiently
computes the gradient of the loss function J(W)
with respect to each weight wij(𝑙) of the network.
𝜕J 𝑾
i.e.,
𝜕𝑤%+ (')

where wij(𝑙) is the weight between neuron i and

neuron j at layer 𝑙
8
Backpropagation is a way of
computing gradients of
expressions through recursive
application of chain rule.

The Chain Rule

x=g(s) y=f(x)

y = f(x), x=g(s) s x y
!" !$ !" 𝜕𝑥 𝜕𝑠 𝜕𝑦
= = g’(s)f’(x)
!# !# !$

By the chain rule, iterating backwards one layer

at a time from the last layer.
9
Training NN with backpropagation

• Efficient backpropagation ® breakthrough

10
A simple BP example
Desired
w1 , b 1 w(L-1), b(L-1) w(L), b(L)
… 0.4 0.6 1 output

s(L-1), x(L-1) y
s(L), x(L)
J(w1,b1,… w(L),b(L)) b(L) y
" (L) "
= (x - y) = # (0.6 -1)2
2
# ss(L)(L) J
x(L-1) x(L)
x(L) = g(s(L)) 𝜕𝑠 (") 𝜕𝑥 (") 𝜕𝐽
s(L) = w(L)x(L-1) + b(L) w(L)
𝜕𝐽 𝜕𝑤 (")
How J changes w.r.t. w(L)?
𝜕𝑤 (")
'( '* (") '+ (") '(
= (") (") (")
') (") ') '* '+ chain rule
= 𝑥 (2())g’(𝑠 (2))(𝑥 (2)-y)
$%
'( '* (") '+ (") '( '( '+ (") '(
= (") (") (") = '* (") '+ (") $&(")
', (") ', '* '+ '* (") key value
= g’(𝑠 (2))(𝑥 (2)-y) = g’(𝑠 (2))(𝑥 (2)-y) for BP
11
A Simple BP Example
w(L-1), b(L-1) w(L), b(L)
… 0.4 0.6 1
'( '* (") '( s(L-2), x(L-2) s(L-1), x(L-1) s(L), x(L) y
') (")
= ') (") '* (") y
b(L-1) b(L)
("#$) !3
=𝑥 !4(&) x(L-2) s(L-1) x(L-1) s(L) x(L) J
$%
𝜕𝑠 ("$%) 𝜕𝑥 ("$%) 𝜕𝑠 (") 𝜕𝐽
as = g’(𝑠 (") )(𝑥 (") -y) w(L-1) w(L)
Output$&
at(")
BP Value
previous layer s(L-1) = w(L-1)x(L-2) + b(L-1) s(L) = w(L)x(L-1) + b(L)
x(L-1) = g(s(L-1)) x(L) = g(s(L))

'( '+ ("$%) '* (") '( '( '* ("$%) '(
'* ("$%) = '* ("$%) '+ ("$%) '* (") =
') ("$%) ') ("$%) '* ("$%)
&'
= g’(𝑠 ("#$) ) 𝑤 (") &((") =𝑥 ("#)) &'
&( ("$%)
Rate of change of the loss Output at Rate of change of
BP step function w.r.t. next input s previous the loss function
layer w.r.t. next input
12
s
How w’s are adjusted?
Layer L-2 Layer L-1 Layer L

𝒔(𝑳1𝟐)
𝒙 (𝑳1𝟐) 𝒔 (𝑳1𝟏)
𝒙 (𝑳1𝟏) 𝒔(𝑳) 𝒙(𝑳) 𝒚
𝒘(𝑳(𝟏) $%
=𝛿 ("$%) 𝒘(𝑳) $%
=𝛿 (")
𝛿 ("$&) $&("$%) $& (")
~
𝐽
X X
𝜕𝐽 𝜕𝐽
𝜕𝑤 ("'() 𝜕𝑤 (")

Backpropagation
!3
Backpropagate !4(𝒍)
= 𝛿 (')
!3 !3
= 𝑥 ('()) 𝛿 (') based on output 𝑥 ('()) and BP value = 𝛿 (')
!$(𝒍) !4(𝒍)
13
Training NN with backpropagation

• Efficient backpropagation ® breakthrough

14
More complicated example
Let us define the notation when each layer
has more than one node

Layer l-1 Layer l

𝒘𝒊𝒋 (𝒍)
ith node 𝒙𝒊 (𝒍1𝟏) 𝒔𝒋 (𝒍) jth node

15
More complicated example
(") = (") ("#$) b(L) y
𝑠* ∑+ 𝑤+* 𝑥+ 𝒙𝒋 (𝑳)
𝑥* (") = g(𝑠* (") ) x(L-1) s(L) x(L) J (𝑳)
𝒘𝒊𝒋 (𝑳) 𝒔𝒋
$ (") ) w(L) 𝒔𝒊 (𝑳$𝟏) 𝒙𝒊 (𝑳$𝟏)
J= ∑ 𝑥* − 𝑦*
) *
&' &(' (") &,' (") &'
=
&-&' (") &-&' (") &(' (") &,' (")
&' 𝒔𝒋 (𝑳) 𝒙𝒋
(𝑳)
= 𝑥+ ("#$) g’(𝑠* (") )(𝑥* (") -y) = 𝑥+ ("#$)
&(' (")
𝒘𝒊𝒋 (𝑳)
How does J change w.r.t. 𝑠+ ("#$) 𝒙𝒊 (𝑳$𝟏)
𝒔𝒊 (𝑳$𝟏)
("#$) (")
Note that 𝑠+ will affect 𝑠* , j=1,2…
&' &,& ("$%) &(' (") &'
("$%) = ∑* ("$%) ("$%) (") Sum over layer L
&(& &(& &,& &('
(") &' Backpropagation:
= g’(𝑠+ ("#$) ) ∑* 𝑤+* Reverse of feed forward
&(' (") 16
Backpropagation
Layer l Layer l +1

dj(l+1)
di(l)

𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )

17
Forward Pass vs Backpropagation
Layer l -1 Layer l Layer l Layer l +1

xj(l) di(l)

!3
Will prove 𝛿% (') =
𝑠, (-) = :(𝑤., (-) 𝑥. (-'() + 𝑏) !4! ($)

𝑥+ (') = 𝑔(𝑠+ ' ) 𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )
18
Framework for training Neural Network
1. Initialize all weights 𝑤!" ($) at random
2. Repeat until convergence:
3. Pick a random example to feed into layer 0
($) 𝜕𝐽 𝑾
4. Forward: Compute all 𝑥" 𝜕𝑤+, (-)

5. Backward: Compute all 𝛿" ($)

6. Update weights: 𝑤=> (?) → 𝑤=> (?) − 𝛼 𝑥= (?1") 𝛿> (?)
7. Return the final weights 𝑤!" ($) 𝜕𝐽
𝜕𝑠, (-)

Layer l-1 Layer l

𝒙𝒊 (𝒍'𝟏) = forward 𝜹𝒋 (𝒍) =backward
𝒊 th output of 𝒊 th node output of 𝒋th node 𝒋 th
node node
𝒘𝒊𝒋 (𝒍) 19
(,) (,) (,-.) (,)
Update rule: 𝑤*+ → 𝑤*+ − 𝛼 𝑥* 𝛿+

𝜕𝐽 𝑾
Layer l-1 Layer l
𝜕𝑤+, (-)

compute 𝒘𝒊𝒋 (𝒍) compute

(𝒍)
𝒙𝒊 (𝒍(𝟏) 𝜹𝒋

20
Training NN with backpropagation

• Efficient backpropagation ® breakthrough

21
(𝒍)
Key equations for backpropagation: 𝜹𝒊
(2) 2 !3
For output layer L: 𝛿% = 𝑔′(𝑠% )(!A(B)) 𝑦
Note: 𝐽 = loss function between ℎ(𝑥) and 𝑦 𝑔 𝒉(𝒙)
ℎ(𝑥) = 𝑔(𝒔 2 )
(2) !3 !3 !A(B) !3 𝒔𝒊 (𝑳)
𝛿% = (&) ( as (&) = )
!4! !4! !4! (&) !A(B)

For earlier layers for 𝑙 = L-1, …., 1 :

𝛿𝒋 (𝒍/𝟏)
𝛿% (') = 𝑔′(𝑠% ' )( ∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) ) 𝒘𝒊𝒋 (𝒍1𝟏)
𝛿𝒊 (𝒍)

22
Training NN with backpropagation

• Efficient backpropagation ® breakthrough

• A simple example
• Computation during the backward pass: d
• The equations for d
𝝏𝑱 𝐖
• Showing gradient 𝝏𝒘𝒊𝒋 (𝒍)
is indeed equal to 𝒙𝒊(𝒍(𝟏)𝜹𝒋 (𝒍)
and 𝜹𝒋 (𝒍) = 𝝏𝑱 𝐰
• BP with different activation functions 𝝏𝒔 (𝒍)
𝒋

• Coding NN with hidden layers

23
/0 𝐰 (,)
GOAL: Want to show: (1) = 𝛿+ ×𝑥* (,-.)
/2/0
By induction (slides 26-29)
𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠> (?) 𝜕𝐽 𝐰 (?1")
= (?)
× = ×𝑥=
𝜕𝑤=> (?) 𝜕𝑠> 𝜕𝑤=> (?) 𝜕𝑠> (?)

𝑠> (?) = ( ∑ 𝑖 𝑤=> (?) 𝑥= (?1") ) + 𝑏 𝒔𝒋 (𝒍) 𝒙𝒋

(𝒍)

? 𝒘𝒊𝒋 (𝒍)
𝜕𝑠> (?1") 𝒔𝒊 (𝒍$𝟏) 𝒙𝒊 (𝒍$𝟏)
⟹ ?
= 𝑥=
𝜕𝑤=>

(,) /0 𝐰
GOAL now becomes: want to show that 𝛿+ =
/#0 (1)
24
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

By Induction (Backward)
!3 𝐰
with 𝛿+ (H) = !4 (*)
"
for 𝑘 = L, L-1, … 𝑙+1

25
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

𝑠+ ('<)) = (∑ 𝑖 𝑤%+ ('<)) 𝑥% (') ) + 𝑏

𝜕𝑠+ '<)
⟹ '
= 𝑤%+ ('<))
𝜕𝑥%

26
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

(?C") '+) (')

= ∑> 𝛿> ×𝑤%+ ('<)) ×
'*) (')

= ∑+ 𝛿+ ('<)) × 𝑤%+ ('<)) ×𝑔′(𝑠% ' ) [since 𝑥% (') = 𝑔(𝑠% ' )]

27
(?) '( 𝐰
NEW GOAL : want to show that 𝛿> =
'*& (')

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

(𝒍/𝟏)

(')
=7 ('<))
× (')
×
𝜕𝑠% + 𝜕𝑠+ 𝜕𝑥% 𝜕𝑠% (') 𝒘𝒊𝒋 (𝒍1𝟏)
𝒔𝒊 (𝒍) 𝒙𝒊 (𝒍)
(?C") '*& ('(%) '+) (')
= ∑> 𝛿> × '+) (')
× '* (')
)

(?C") '+) (')

= ∑> 𝛿> ×𝑤%+ ('<)) ×
'*) (')

= ∑+ 𝛿+ ('<)) × 𝑤%+ ('<)) ×𝑔′(𝑠% ' ) = 𝛿= (?)

Recall: 𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )

28
Training NN with backpropagation

• Efficient backpropagation ® breakthrough

• A simple example
• Computation during the backward pass: d
• The equations for d
!3 J
• Showing gradient !$!" ($)
is indeed equal to 𝑥% ('())𝛿+ (')
• BP with different activation functions
• Coding NN with hidden layers

29
𝒈 : Common Activation Functions

1
Sigmoid: 𝑔 𝑠 = 𝜎 𝑠 =
1 + 𝑒 (4
𝑑𝑔(𝑠) 𝑒 (4
→ 𝑔K 𝑠 = =
𝑑𝑠 1 + 𝑒 (4 L

1 + 𝑒 (4 − 1 1
=
1 + 𝑒 (4 1 + 𝑒 (4

= 1 − 𝑔 𝑠 𝑔(𝑠)
𝒈 : Common Activation Functions
Tanh: 2 1 − 𝑒 '2&
𝑔 𝑠 = −1=
1 + 𝑒 '2& 1 + 𝑒 '2&
Quotient rule
1 − 𝑒 '& /𝑒 & 𝑒 & − 𝑒 '&
= = M K M+ N(MN+
1 + 𝑒 '& /𝑒 & 𝑒 & + 𝑒 '& (N) = N&

𝑑𝑔(𝑠) (𝑒 & + 𝑒 '& ) 𝑒 & + 𝑒 '& − (𝑒 & − 𝑒 '& )(𝑒 & − 𝑒 '& )
𝑔3 𝑠 = =
𝑑𝑠 (𝑒 & +𝑒 '& )2
(𝑒 & − 𝑒 '& )2 2
=1 − & = 1 − 𝑔(𝑠)
(𝑒 + 𝑒 '& )2

RelU: 𝑔 𝑠 = J0 if 𝑠 < 0 → 𝑔′ 𝑠 = J
0 if 𝑠 < 0
𝑠 if 𝑠 ≥ 0 1 if 𝑠 ≥ 0
31
For the special case of output layer

Output Layer L For earlier layers:

𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑤%+ ('<)) 𝛿+ ('<)) )

For output layer L:

di(L) !3
𝛿% (2) = 𝑔′(𝑠% 2 )(!A(B))

Last layer 𝛿% (2) depends on (the derivative of) :

• activation function 𝑔 in the last layer
• loss function 𝐽

32
Output Layer
Size Loss Function Activation
J Function 𝑔 d
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax ℎ 𝑥 −𝑦

( 2
Mean square error: 𝐽= ℎ 𝑥 −𝑦
2

Binary cross entropy loss: 𝐽 = −[𝑦 log ℎ 𝑥 + 1 − 𝑦 log 1 − ℎ(𝑥 ]

Cross entropy loss: 𝐽 = − ∑ 𝑦4 log ℎ(𝑥)

33
(")
Output Layer 𝛿 - Mean Square Error
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

1 L
𝐽 = ℎ 𝑥 −𝑦
2
For output layer L (with one output):
(") $ "
𝜕𝐽
𝛿 =𝑔 𝑠
𝜕ℎ 𝑥
= 𝑔$ 𝑠 "
(ℎ 𝑥 − 𝑦)
34
(")
Output Layer 𝛿 - Binary Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

Note that ℎ 𝑥 = 𝑔 𝑠 = 𝜎 𝑠
𝐽 = −[𝑦 log ℎ 𝑥 + 1 − 𝑦 log 1 − ℎ(𝑥 )]
For output layer L (with one output):
&'
𝛿 (") = 𝑔2 𝑠 "
&3 ,
" 4 $#4
= 𝜎′ 𝑠 −3 , + $#3 ,
4#3 ,
= 1−𝜎 𝑠 𝜎(𝑠) − 3 , ($#3 , ) =ℎ 𝑥 −𝑦
35
Output Layer 𝛿 (") - Multi Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

Cross entropy loss: 𝐽 = − ∑ 𝑦H log 𝑝H Skip Proof – next 2 slides

Note that 𝒉 𝒙 , 𝒚 and 𝜹(𝑳) are arrays 𝒔 𝑳
𝑔
𝒉 𝒙
=𝒙 𝑳
𝒚 = (𝑦),…, 𝑦S ) 𝒔𝟏 𝑝(
𝒉 𝒙 = 𝒙 𝑳 = (𝑝),…, 𝑝S )

Softmax
𝒔𝒊 𝑝.
𝑝% = 𝑔 𝑠% ( note that 𝑔 is softmax )
(𝑳) $ 𝑳 &'
𝜹 =𝑔 𝒔 &𝒉 𝒙
!3 !3 !3 !3 𝒔𝑪 𝑝5
=( , ,… )
!𝒉 𝒙 !T, !T- !T.
K 𝑳 !T!
𝒈 𝒔 = ( !4 ), for 1 ≤ 𝑖, 𝑗 ≤ 𝐶 is a matrix 36
"
Output Layer 𝛿 (") - Multi Classification
!3
As 𝜹(𝑳) = 𝒈K 𝒔 𝑳 ; let 𝑠% =𝑠% 2 and 𝑝% = 𝑥% 2
𝒔 𝑳
𝒉 𝒙
!𝒉 𝒙 𝑔 =𝒙 𝑳
U /! 𝒔𝟏 𝑝(
𝒉 𝒙 =(𝑝),… 𝑝S ) where 𝑔 𝑠% = 𝑝% = ∑ U /*

Softmax
!T! 𝒔𝒊 𝑝.
𝒈K 𝒔 𝑳 =( ), for 1 ≤ 𝑖, 𝑗 ≤ 𝐶
!4"
!T!
If 𝑖 = 𝑗; = (𝑒 4! ∑ 𝑒 4* - 𝑒 4! 𝑒 4! )/(∑ 𝑒 4* )2 = 𝑝% (1-𝑝% )
!4!
!T! 𝒔𝑪 𝑝5
If 𝑖 ≠ 𝑗; !4 = -(𝑒 4" 𝑒 4! )/(∑ 𝑒 4* )2 = −𝑝+ 𝑝%
"

𝑝)(1-𝑝)) −𝑝)𝑝L −𝑝)𝑝W … −𝑝)𝑝S

𝒈$ 𝒔 𝑳
= − 𝑝L𝑝) 𝑝L(1-𝑝L) −𝑝L𝑝W … −𝑝L𝑝S
C x C matrix
…
− 𝑝S 𝑝) − 𝑝S 𝑝L −𝑝S 𝑝W … 𝑝S (1-𝑝S )
37
Output Layer 𝛿 (") - Multi Classification
Loss Function Activation
Size d(L)
J Function 𝑔
Regression 1 Mean square error 𝑔 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Binary Classification 1 Binary cross entropy Sigmoid ℎ 𝑥 −𝑦
Multi Classification C Cross entropy Softmax 𝒉 𝒙 −𝒚

Cross entropy loss: 𝐽 = − ∑ 𝑦5 log 𝑝5

&' &' &' &' &' 4&
=( , , … ) where =−
&𝒉 𝒙 &8% &80 &81 &8& 8&
𝑝% (1-𝑝% ) −𝑝% 𝑝& −𝑝% 𝑝1 … −𝑝% 𝑝2 −𝑦% /𝑝%
45 −𝑝& 𝑝% 𝑝& (1-𝑝& ) −𝑝& 𝑝1 … −𝑝& 𝑝2 −𝑦& /𝑝&
𝜹(𝑳) = 𝒈3 𝒔 𝑳
4𝒉 𝒙
= … …
− 𝑝2 𝑝% − 𝑝2 𝑝& −𝑝2 𝑝1 … 𝑝2 (1-𝑝2 ) −𝑦2 /𝑝2

= (−𝑦$ +𝑝$ ∑ 𝑦5 , −𝑦) +𝑝) ∑ 𝑦5 , …−𝑦9 +𝑝9 ∑ 𝑦5 )

= (−𝑦$ +𝑝$ , −𝑦) +𝑝) , …−𝑦9 +𝑝9 ) as ∑ 𝑦5 =1
=𝒉 𝒙 −𝒚 38
Layer l-1 Layer l
𝒙𝒊 (𝒍'𝟏) = forward 𝜹𝒋 (𝒍) =backward
𝜹𝒋 (𝒍'𝟏) 𝒊 th output of 𝒊 th node output of 𝒋th node 𝒋 th
node
𝑤*+ (,) node

!3 𝐰
• 𝑤%+ (') → 𝑤%+ (') − 𝛼 𝑥% '() 𝛿+ '
where !$!" ($)
= 𝛿* (:) ×𝑥+ (:#$)

• 𝛿+ (:#$) = 𝑔′(𝑠+ :#$ )(∑ 𝑤+* (:) 𝛿* (:) ); 𝑙 = L, L-1, L-2…

%
sigmoid g 𝑠 = 𝒈’ 𝑠 = 1 − 𝑔 𝑠 𝑔(𝑠)
%/3 !"

Tanh
3 " $3 !"
g 𝑠 = 3 "/3 !" 𝒈’ 𝑠 = 1 − 𝑔(𝑠)&
BP 0 if 𝑠 < 0 0 if 𝑠 < 0
Relu 𝒈 𝑠 = ^ 𝒈′ 𝑠 = ^
Summary 𝑠 if 𝑠 ≥ 0 1 if 𝑠 ≥ 0

𝑔 d(L) = 𝑔′(𝑠)(ℎ 𝑥 − 𝑦)
Sigmoid d(L) = ℎ 𝑥 − 𝑦
Softmax d(L) = 𝒉 𝒙 − 𝒚
39 39
Training NN with backpropagation

• Efficient backpropagation ® breakthrough

40
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2
Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2

W1 W2
Dimensions of matrices:
𝑥: 1 ´ 2
𝑥 𝒔& 𝒙& 𝑊( : 2 ´ 5
𝑏( : 1 ´ 5
𝑠( , 𝑥( : 1 ´ 5
𝒔% 𝒙% 𝑊2 : 5 ´ 2
𝑏2 : 1 ´ 2
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 ) 𝑠2 , 𝑥2 : 1 ´ 2
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
41
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2

W1 W2 Dimensions of matrices:
𝑥: M ´ 2
𝑥 𝒔& 𝒙& 𝑊( : 2 ´ 5
M = the number
𝑏( : 1 ´ 5
of input data 𝑠( , 𝑥( : M ´ 5
𝒔% 𝒙% 𝑊2 : 5 ´ 2
𝑏2 : 1 ´ 2
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 ) 𝑠2 , 𝑥2 : M ´ 2
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
42
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2
Backward pass equation:
Backward pass equations:
(-)
𝛿+ = 𝑔′(𝑠+ - )(d 𝑤+, (-/%) 𝛿, (-/%) ) 𝛿& = ℎ − 𝑦
𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝑥 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5

𝒔% 𝒙%

Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )

Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
43
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
𝑠2 = 𝑥( 𝑊2 + 𝑏2
ℎ = 𝑥2 = softmax 𝑠2
Backward pass equations:
𝛿& = ℎ − 𝑦
𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝑥 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5

Dimensions of matrices:
𝛿& , 𝑥 : M ´ 2
𝒔% 𝒙%
𝑊& : 5 ´ 2
𝛿% , 𝑠% : M ´ 5
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
44
NN Classifier with one hidden layer
Layer 0 Layer 1 Layer 2 Forward pass equations:
𝑠( = 𝑥𝑊( + 𝑏(
𝑥( = tanh 𝑠(
Relevant gradients:
𝜕𝐽 𝑠2 = 𝑥( 𝑊2 + 𝑏2
= 𝑥%5 𝛿& ℎ = 𝑥2 = softmax 𝑠2
𝜕𝑊&
𝜕𝐽
= 𝛿& Backward pass equations:
𝜕𝑏& 𝛿& = ℎ − 𝑦
𝜕𝐽
= 𝑥 5 𝛿% 𝛿% = 𝑔4 𝑠% 𝛿& 𝑊&5
𝜕𝑊%
𝑥 𝜕𝐽 𝒔& 𝒙& = 1 − tanh& 𝑠% °𝛿& 𝑊&5
= 𝛿%
𝜕𝑏%
Dimensions of matrices:
𝛿& , ℎ : M ´ 2
𝒔% 𝒙%
𝑊& : 5 ´ 2
𝛿% , 𝑠% : M ´ 5
Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )
Let 𝛿. is the backward output of layer 𝑖 (here 𝛿( , 𝛿2 )
45

Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
MLP and Backpropagation Overview
No ratings yet
MLP and Backpropagation Overview
22 pages
NN Ch3
No ratings yet
NN Ch3
40 pages
Supervised Learning: Multilayer Networks I
No ratings yet
Supervised Learning: Multilayer Networks I
40 pages
CI-6-8 Backpropagation (COMPLETE) Updated
No ratings yet
CI-6-8 Backpropagation (COMPLETE) Updated
76 pages
Backpropagation A Peek Into The Mathematics of Optimization
No ratings yet
Backpropagation A Peek Into The Mathematics of Optimization
4 pages
Back-Propagation in Neural Networks
No ratings yet
Back-Propagation in Neural Networks
42 pages
Limitations of Single Layer Neural Networks
No ratings yet
Limitations of Single Layer Neural Networks
43 pages
ANN Notes Updated
0% (1)
ANN Notes Updated
46 pages
MLP Learning via Backpropagation
No ratings yet
MLP Learning via Backpropagation
24 pages
Back in NN
No ratings yet
Back in NN
12 pages
MLP Backpropagation Learning Explained
No ratings yet
MLP Backpropagation Learning Explained
24 pages
Clase 4 Backpropagation
No ratings yet
Clase 4 Backpropagation
63 pages
l7 - Learning in Multi-Layer Perceptrons, Back-Propagation
No ratings yet
l7 - Learning in Multi-Layer Perceptrons, Back-Propagation
16 pages
L05 Slides - mlp2
No ratings yet
L05 Slides - mlp2
21 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
9 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Part 2 Module 2 DL BP
No ratings yet
Part 2 Module 2 DL BP
66 pages
14 Backprop
No ratings yet
14 Backprop
34 pages
Back-Propagation in Neural Networks
No ratings yet
Back-Propagation in Neural Networks
26 pages
Deep Learning
No ratings yet
Deep Learning
24 pages
L4deep Learning
No ratings yet
L4deep Learning
14 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
Backpropergation
No ratings yet
Backpropergation
4 pages
Neural Network Lab Guide
No ratings yet
Neural Network Lab Guide
17 pages
Neural Networks: Multilayer Perceptrons
No ratings yet
Neural Networks: Multilayer Perceptrons
63 pages
Multi-Layer Perceptron and Backpropagation
No ratings yet
Multi-Layer Perceptron and Backpropagation
29 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
19 pages
Chapter 1 Annexe
No ratings yet
Chapter 1 Annexe
17 pages
Module 1 DL
No ratings yet
Module 1 DL
84 pages
Notes On Backpropagation
No ratings yet
Notes On Backpropagation
14 pages
PNAL6 MLPTraining
No ratings yet
PNAL6 MLPTraining
40 pages
Ann PPT
No ratings yet
Ann PPT
48 pages
ML Session 15 Backpropagation
No ratings yet
ML Session 15 Backpropagation
30 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
11 pages
Backpropagation in MLP: A Detailed Guide
No ratings yet
Backpropagation in MLP: A Detailed Guide
34 pages
A Step by Step Backpropagation
No ratings yet
A Step by Step Backpropagation
8 pages
Artificial Neural Networks - 13: Dr. Aditya Abhyankar
No ratings yet
Artificial Neural Networks - 13: Dr. Aditya Abhyankar
43 pages
Backpropagation Through Time
No ratings yet
Backpropagation Through Time
11 pages
Ann4-3s.pdf 7oct PDF
No ratings yet
Ann4-3s.pdf 7oct PDF
21 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
2012-1158. Backpropagation NN
No ratings yet
2012-1158. Backpropagation NN
56 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Back Propogation
No ratings yet
Back Propogation
43 pages
NN Lecture Notes
No ratings yet
NN Lecture Notes
45 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
29 pages
Deep Feedforward Networks Application To Patter Recognition
No ratings yet
Deep Feedforward Networks Application To Patter Recognition
5 pages
Back Propogation Algorithm
No ratings yet
Back Propogation Algorithm
13 pages
Backpropagation
No ratings yet
Backpropagation
2 pages
Hernandez Lobatoc15
No ratings yet
Hernandez Lobatoc15
9 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
Unit 4
No ratings yet
Unit 4
16 pages
Week3 Backpropagation
No ratings yet
Week3 Backpropagation
32 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Understanding Backpropagation and Its Role in Deep LearningPARTH LAMBAT AND - 20250415 - 122012 - 0000
No ratings yet
Understanding Backpropagation and Its Role in Deep LearningPARTH LAMBAT AND - 20250415 - 122012 - 0000
18 pages
Back Propagation
No ratings yet
Back Propagation
5 pages
MATLAB Codes for Engineering ODEs
No ratings yet
MATLAB Codes for Engineering ODEs
1 page
Bisection & False Position Methods
No ratings yet
Bisection & False Position Methods
44 pages
EC3606
No ratings yet
EC3606
1 page
Differential Numerical
No ratings yet
Differential Numerical
41 pages
Convolutional Neural Network Case Studies: (1) Anomalies in Mortality Rates (2) Image Recognition
No ratings yet
Convolutional Neural Network Case Studies: (1) Anomalies in Mortality Rates (2) Image Recognition
24 pages
Numerical Method Program
No ratings yet
Numerical Method Program
25 pages
A New Formulation of The Theorems of Hurwitz, Routh and Sturm
No ratings yet
A New Formulation of The Theorems of Hurwitz, Routh and Sturm
11 pages
Interpolation and The Lagrange Polynomial
No ratings yet
Interpolation and The Lagrange Polynomial
9 pages
Big M Method in Linear Programming
No ratings yet
Big M Method in Linear Programming
17 pages
Final Quiz 1
No ratings yet
Final Quiz 1
3 pages
Applications of Numerical Method in Chemical Engineering
No ratings yet
Applications of Numerical Method in Chemical Engineering
3 pages
Ensemble Methods Random Forests.
No ratings yet
Ensemble Methods Random Forests.
9 pages
OEV Poster Gernot A0
No ratings yet
OEV Poster Gernot A0
1 page
Matlab Neural Network Toolbox Guide
No ratings yet
Matlab Neural Network Toolbox Guide
8 pages
General 2 CS
No ratings yet
General 2 CS
6 pages
Final 2015
No ratings yet
Final 2015
2 pages
Linear Programming: © Bruce F. Wollenberg, University of Minnesota 1
No ratings yet
Linear Programming: © Bruce F. Wollenberg, University of Minnesota 1
20 pages
Time and Space Complexity in Algorithms
No ratings yet
Time and Space Complexity in Algorithms
8 pages
Introduction and Basic Concepts: (Iii) Classification of Optimization Problems
No ratings yet
Introduction and Basic Concepts: (Iii) Classification of Optimization Problems
19 pages
Numerical Methods With Applications
No ratings yet
Numerical Methods With Applications
7 pages
2.1 Bisection Method
No ratings yet
2.1 Bisection Method
22 pages
Lec - 6 - Partial - Fraction - Decomposition - Irreducible Factors
No ratings yet
Lec - 6 - Partial - Fraction - Decomposition - Irreducible Factors
4 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
22 pages
Exercise 6
No ratings yet
Exercise 6
2 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Unit 4, M 4, Interpolation Notes (Part 1)
No ratings yet
Unit 4, M 4, Interpolation Notes (Part 1)
17 pages
Factoring Polynomials: Be Sure Your Answers Will Not Factor Further!
No ratings yet
Factoring Polynomials: Be Sure Your Answers Will Not Factor Further!
5 pages
JEE Math Problem Solution
No ratings yet
JEE Math Problem Solution
9 pages
Numerical Methods: Dr. Nasir M Mirza
No ratings yet
Numerical Methods: Dr. Nasir M Mirza
27 pages
What Are P and NP Class of Problems
No ratings yet
What Are P and NP Class of Problems
2 pages

02B DL2023 NN Backprop

Uploaded by

02B DL2023 NN Backprop

Uploaded by

DASC7606 -2B

How to find the unknown function?

Instead of Linear Model,

• Efficient backpropagation ® breakthrough

where wij(𝑙) is the weight between neuron i and

The Chain Rule

By the chain rule, iterating backwards one layer

• Efficient backpropagation ® breakthrough

• Efficient backpropagation ® breakthrough

Layer l-1 Layer l

𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )

5. Backward: Compute all 𝛿" ($)

Layer l-1 Layer l

compute 𝒘𝒊𝒋 (𝒍) compute

• Efficient backpropagation ® breakthrough

For earlier layers for 𝑙 = L-1, …., 1 :

• Efficient backpropagation ® breakthrough

• Coding NN with hidden layers

𝑠> (?) = ( ∑ 𝑖 𝑤=> (?) 𝑥= (?1") ) + 𝑏 𝒔𝒋 (𝒍) 𝒙𝒋

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

𝑠+ ('<)) = (∑ 𝑖 𝑤%+ ('<)) 𝑥% (') ) + 𝑏

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

(?C") '+) (')

= ∑+ 𝛿+ ('<)) × 𝑤%+ ('<)) ×𝑔′(𝑠% ' ) [since 𝑥% (') = 𝑔(𝑠% ' )]

𝜕𝐽 𝐰 𝜕𝐽 𝐰 𝜕𝑠+ ('<)) 𝜕𝑥% (') 𝒔𝒋 (𝒍/𝟏) 𝒙𝒋

(?C") '+) (')

= ∑+ 𝛿+ ('<)) × 𝑤%+ ('<)) ×𝑔′(𝑠% ' ) = 𝛿= (?)

Recall: 𝛿% (') = 𝑔′(𝑠% ' )(∑ 𝑗 𝑤%+ ('<)) 𝛿+ ('<)) )

• Efficient backpropagation ® breakthrough

Output Layer L For earlier layers:

For output layer L:

Last layer 𝛿% (2) depends on (the derivative of) :

Binary cross entropy loss: 𝐽 = −[𝑦 log ℎ 𝑥 + 1 − 𝑦 log 1 − ℎ(𝑥 ]

Cross entropy loss: 𝐽 = − ∑ 𝑦4 log ℎ(𝑥)

Cross entropy loss: 𝐽 = − ∑ 𝑦H log 𝑝H Skip Proof – next 2 slides

𝑝)(1-𝑝)) −𝑝)𝑝L −𝑝)𝑝W … −𝑝)𝑝S

Cross entropy loss: 𝐽 = − ∑ 𝑦5 log 𝑝5

= (−𝑦$ +𝑝$ ∑ 𝑦5 , −𝑦) +𝑝) ∑ 𝑦5 , …−𝑦9 +𝑝9 ∑ 𝑦5 )

• 𝛿+ (:#$) = 𝑔′(𝑠+ :#$ )(∑ 𝑤+* (:) 𝛿* (:) ); 𝑙 = L, L-1, L-2…

• Efficient backpropagation ® breakthrough

Let 𝑥. is the forward output of layer 𝑖 (here 𝑥( , 𝑥2 )

You might also like