RNN
RNN
●
In sequential data input can’t be fixed length
●
So we need new type of models to process this kind of
problems
R N N i s.................
ht = A (h t−1 , x t )
ht =g t (x t , x t −1 , x t −2 ,... x 2 , x1 )
h3 = A (A ( A (h(−1 ) , x 0 ) , x 1 ) , x 2 )
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks
●
Let x t be the value of input at time t
●
and let ht be the network output at
time t V
●
Many recurrent neural networks use
W
ht =f ( ht −1 , x t , θ ) f
●
To define the values of their hidden
U
units. To indicate that the state is the
hidden units of the network
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks
●
The unfolding process thus introduces two major
advantages:
– 1. Regardless of the sequence length, the learned
model always has the same input size, because it is
specified in terms of transition from one state to
another state, rather than specified in terms of a
variable-length history of states.
– 2. It is possible to use the same transition function f
with the same parameters at every time step.
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks
V V V V
W W W
f f f f
U U U U
V V V V
W W W
f f f f
U U U U
3. Maintain information about order 4. track long term dependencies but not too long
RNN - Forward Pass
●
The forward pass of an RNN is the same as that of a
multilayer perceptron with a single hidden layer
●
Except that activations arrive at the hidden layer from
both the current external input and the hidden layer
activations from the previous timestep.
L Lt −1 Lt Lt +1
^y ^y t −1 ^y t ^y t +1
o Unfold o t −1 ot o t +1
V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
y z t =b+ Wht −1 +Uxt y t −1 yt y t +1
ht =tanh( z t )
L o t =c +Vht Lt −1 Lt Lt +1
y^ t =softmax(o t )
^y ^y t −1 ^y t ^y t +1
o Unfold o t −1 ot o t +1
V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Types of Recurrent Neural Networks
●
The Recurrent Neural Nets allow us to operate over
sequences of vectors: Sequences in the input, the
output, or in the most general case both.
ht =tanh( z t )
L o t =c +Vht Lt −1 Lt Lt +1
y^ t =softmax(o t )
^y ^y t −1 ^y t ^y t +1
o Unfold o t −1 ot o t +1
V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Back Propagation Through Time(BPTT)
∂L
V =V − α
∂V
∂L
W =W −α
∂W
∂L
U =U− α
∂U
Back Propagation Through Time(BPTT)
∂L ∂ L ∂ L1 ∂ L 2 ∂ Ln
V =V − α = + +...+
∂V ∂V ∂ V ∂ V ∂V
T
∂L ∂ Lt
=∑
∂V t =1 ∂V
T
∂L ∂L ∂ Lt
W =W −α =∑
∂W ∂W t =1 ∂W
T
∂L ∂L ∂ Lt
U =U− α =∑
∂U ∂U t =1 ∂U
Back Propagation Through Time(BPTT)
●
The total loss is simply the sum of the loss over all time
steps
T
L( θ )=∑ L t ( θ )
t =1
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations
∂ L ∂ L1 ∂ L2 ∂ Ln
= + +...+
∂V ∂ V ∂ V ∂V
T
∂L ∂ Lt
=∑
∂V t =1 ∂V
∂ L t ∂ L t ∂ y^ t ∂ o t
=
∂V ∂ y^ t ∂ o t ∂ V
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations
∂ L t ∂ Lt ∂ y^ t ∂ ot ∂ ht
=
∂W ∂ y^ t ∂ ot ∂ ht ∂W
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
ht =tanh (b+Wh t −1+Ux t )
∂ ht ∂h t ∂ ht ∂ ht −1
= +
∂W ∂W ∂ ht −1 ∂ W
h0 ht −2 ht −1 ht ht +1
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht −1 ∂h t−2
= +
∂W ∂W ∂ ht −1 ∂ W
+
{
∂ ht −2 ∂W } W
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
∂ ht ∂h t ∂ ht ∂ ht −1
= +
∂W ∂W ∂ ht −1 ∂ W
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht −1 ∂h t−2
= +
∂W ∂W ∂ ht −1 ∂ W
+
{
∂ ht −2 ∂W }
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂h t−1 ∂ ht −2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −1 ∂h t−2 ∂ W
h0 ht −2 ht −1 ht ht +1
W
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂h t−1 ∂ ht −2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −1 ∂h t−2 ∂ W
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂ ht− 2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −2 ∂W
t
∂ ht ∂h t ∂ ht ∂ ht ∂h t−1 ∂h t ∂ ht −2 ∂ ht ∂ hr
= + + =∑
∂W ∂h t ∂W ∂ ht −1 ∂W ∂h t−2 ∂ W r =1 ∂ hr ∂ W
t
∂ L t ∂ Lt ∂ y^ t ∂ ot ∂ ht ∂ h r
=
∂W ∂ y^ t ∂ ot ∂ ht
∑ ∂h ∂W
r =1 r
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations
●
Taking the gradient of U is similar to doing it for W
since they both require taking sequential derivativs of
an ht vector.
∂ L t ∂ L t ∂ y^ t ∂ o t ∂ ht
=
∂U ∂ y^ t ∂ o t ∂ ht ∂U
t
∂ L t ∂ L t ∂ y^ t ∂ o t ∂ ht ∂h r
=
∂U ∂ y^ t ∂ o t ∂ ht
∑ ∂ h ∂U
r =1 r
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
The Problem of Long Term Dependencies
●
One of the appeals of RNNs is the idea that they might
be able to connect previous information to the present
task
●
previous video frames might inform the understanding
of the present frame.
●
But can they?
Reference: http://proceedings.mlr.press/v28/pascanu13.pdf
The Problem of Long Term Dependencies
●
In practice many people truncate the backpropagation
to a few steps.
Difficulty of training RNNs
●
The network computes a
single output y ^y
● and the initial state h0 is V
W W W W
initialised with zeros h0 h1 h2 h3 h4
●
The state of the network at U U U U
time t is given by x1 x2 x3 x4
ht = σ ( b+Wh t−1 +Ux t )
An SRN “unrolled” for four time steps (t ∈ [1, 4])
●
When the network computes a
categorical distribution its
output is given by ^y =softmax (c +Vht )
Reference: https://jmlr.org/papers/volume21/18-141/18-141.pdf
Back Propagation Through Time(BPTT)
∂L
V =V − α
∂V
∂L
W =W −α
∂W
∂L
U =U− α
∂U
Difficulty of training RNNs
∂ L ∂ L ∂ ^y ∂h t
=
∂W ∂ ^y ∂ ht ∂W
∂ h4 ∂h 4 ∂ h4 ∂ h3 ∂h 4 ∂ h3 ∂ h2 ∂ h 4 ∂h 3 ∂h 2 ∂h1
= + + +
∂W ∂W ∂ h3 ∂ W ∂h 3 ∂ h2 ∂W ∂ h3 ∂h 2 ∂h 1 ∂W
t
∂ L ∂ L ∂ ^y ∂ ht ∂ h k
=
∂W ∂ ^y ∂ ht
∑ ∂h ∂W
k=1 k
'
.≤‖diag ( σ ( zi ))‖‖W ‖
∂ hi
' 1
σ ( z i)≤ = γ [if σ is sigmoid ]
‖ ‖
∂ hi−1
≤γ λ
4
σ ' ( z i)≤1= γ [if σ is tanh]
For input between [-1,1], we have For input between [0,1], we have
derivative between [0.42, 1]. derivative between [0.20, 0.25]
Refrence: https://stats.stackexchange.com/questions/101560/tanh-activation-function-vs-sigmoid-
activation-function
Difficulty of training RNNs
t
∂ ht
‖ ‖‖
∂ hk
= ∏ ∂h
i =k+1
∂h i
i−1
‖
t
.≤ ∏ γ λ
i=k+ 1
.≤( γ λ )(t−k )
https://arxiv.org/pdf/1412.3555.pdf
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
The Core Idea Behind LSTMs
●
The key to LSTMs is the cell state(C), the horizontal
line running through the top of the diagram.
●
It runs straight down the entire chain, with only some
minor linear interactions. It’s very easy for information
to just flow along it unchanged.
[ht −1 , x t ]=[1,2,3,4,5,6]
if we have 3 units
3 cell states wf = 3*6= 3(3+3)
0 0 0 0 0 −1
[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
b f =[1,2,3 ]
Reference: https://datascience.stackexchange.com/questions/19196/forget-layer-in-a-recurrent-
neural-network-rnn/19269#19269
C t−1 Forget Gate
5 5 5
0 1 1 ft
σ
-5 177 136
+ 1 2 3 b f =[1,2,3 ]
-6 175 133
0 0 0 0 0 −1
ht −1
1 2 3 1 2 3 4 5 6
[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
4 5 6
xt
Forget Gate
1
0 0 0 0 0 −1
[
3 4 5 6 7 8
−6 1
2
3
5
6
−5
−6
w f .[h t−1 , x t ]= 5 6 7 8 9 10 . = 175
4
133
[]
] [ ]
−5 0
[ ][][ ]
w f .[h t−1 , x t ]+ b f = 175 + 2 = 177
133 3 136 [ ] []
f t =σ (w f .[h t−1 , x t ]+ b f )=σ ( 177 )= 1
136 1
Reference: https://datascience.stackexchange.com/questions/19196/forget-layer-in-a-recurrent-
neural-network-rnn/19269#19269
C t−1
5 5 5 0 5 5 C t−1∗f t
0 1 1 ft ~
σ C t=f t∗Ct −1 + i t∗ Ct
-5 177 136
+ 1 2 3 b f =[1,2,3 ]
-6 175 133
0 0 0 0 0 −1
ht −1
1 2 3 1 2 3 4 5 6
[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
4 5 6
xt
Input Gate
22 42 64
+ 1 1 1 bi =[1,1,1]
21 41 63
1 1 1 1 1 1
ht −1
1 2 3 1 2 3 4 5 6
[
w i= 2 2 2 2 2 2
3 3 3 3 3 3 ]
4 5 6
xt
Input Gate – Sigmoid Layer
1 1 1 1 1 1
[
w i= 2 2 2 2 2 2
3 3 3 3 3 3 ] 1
bi =[1,1,1]
[
1 1 1 1 1 1
22 1
2
3
5
6
[ ] []
1
1
22
w i . [h t−1 , x t ]+ bi = 2 2 2 2 2 2 . + 1 = 42
3 3 3 3 3 3
4
64
i t =σ (w i . [ht−1 , x t ]+ bi )= σ ( 42 )= 1
64 1
[]
] [][ ]
i t =σ (w i . [ht−1 , x t ]+ bi )
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Input Gate – Tanh Layer
~ ~
1 1 -1 Ct C t=tanh (w C . [ht −1 , x t ]+bC )
tanh
22 42 -60
+ 1 1 1 bC =[1,1,1]
21 41 -61
1 1 1 1 1 1
ht −1
1 2 3 1 2 3 4 5 6
[
w C= 2 2 2 2 2 2
−3 −3 −3 −3 −3 −3 ]
4 5 6
xt
Input Gate – Tanh Layer
1 1 1 1 1 1 ~
[
w C= 2 2 2 2 2 2
3 3 3 3 3 3 ] bC =[1,1,1]
1
C t=tanh (w C . [ht −1 , x t ]+bC )
~
[
1
w C . [ht −1 , x t ]+bC = 2
1
2
1
2
1
2
22
1
2
−3 −3 −3 −3 −3 −3
1
1
[ ][]
[
22
2 . + 1 = 42
−62
42 )= 1
−62 −1
] ] [][ ]
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Input Gate
~
C t=f t∗Ct −1 + i t∗Ct
[5 5 5 ]
~
C t=f t∗Ct −1 +i t∗C t
0 5 1 1 1
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Output Gate
●
Finally, we need to decide what we’re going to output.
This output will be based on our cell state, but will be a
filtered version.
●
First, we run a sigmoid layer which decides what parts
of the cell state we’re going to output.
●
Then, we put the cell state through tanh (to push the
values to be between −1 and 1) and multiply it by the
output of the sigmoid gate, so that we only output the
parts we decided to.
C t=[1 6 4 ]
C t=[1 6 4 ]
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Output Gate
●
For the language model example, since it just saw a
subject, it might want to output information relevant to
a verb, in case that’s what is coming next.
●
For example, it might output whether the subject is
singular or plural, so that we know what form a verb
should be conjugated into if that’s what follows next.
. .
y y t −1 yt y t +1
L Lt −1 Lt Lt +1
o o(...) o t −1 ot o t +1
V W V W V W V W
Unfold
W h ht −1 ht ht +1 h(...)
U U U U
x x t −1 xt x t +1