0% found this document useful (0 votes)
64 views79 pages

RNN

Uploaded by

srinujodu1431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views79 pages

RNN

Uploaded by

srinujodu1431
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Recurrent Neural Networks(RNN)

Why Recurrent Neural Networks



So far we have seen networks such as densely
connected networks and convolutional neural
networks

Each input shown to them is processed independently

language translation, natural language processing
(nlp), speech recognition, and image captioning

Here Data is sequence

Traditional neural networks can’t process such kind of
sequential data, it is a major shortcoming.
Limitations of Feed Forward Neural Networks

1. Will consider only present input
– In Feed farward networks Previously processed
image(cat) will not effect on next image(dog)

2. Fixed-sized vector as input
– (Fixed-sized vector input e.g. an image) and
produce a fixed-sized vector as output (e.g.
probabilities of different classes).

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Limitations of Feed Forward Neural Networks

3. Can’t process variable length data
– in CNN all images are re scaled to fixed size before
processing like 64x64


In sequential data input can’t be fixed length

So we need new type of models to process this kind of
problems

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Why Recurrent Neural Networks

Recurrent neural networks address this issue.

Humans don’t start their thinking from scratch every
second.

As we read essay, you understand each word based on
your understanding of previous words.

We don’t throw everything away and start thinking from
scratch again.

Our thoughts have persistence.

Means we keeping memories of what came before
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks

Biological intelligence processes information
incrementally

while maintaining an internal model of what it’s
processing, built from past information

and constantly updated as new information comes in.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Recurrent Neural Networks

Can you predict where this ball go next?

Ava Soleimany MIT 6.S191 Deep Sequence Modeling


Recurrent Neural Networks

Can you predict where this ball go next?

Ava Soleimany MIT 6.S191 Deep Sequence Modeling


Recurrent Neural Networks

Can you predict where this ball go next?

Ava Soleimany MIT 6.S191 Deep Sequence Modeling


Recurrent Neural Networks
Text

RNN is a class of artificial neural networks

RNN is a class of artificial neural networks

R N N i s.................

Ava Soleimany MIT 6.S191 Deep Sequence Modeling


Recurrent Neural Networks

A chunk of neural network, A,
looks at some input xt and
outputs a value ht.

A loop allows information to be
passed from one step of the
network to the next.

ht = A (h t−1 , x t )

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Recurrent Neural Networks

A recurrent neural network can be thought of as multiple
copies of the same network, each passing a message
to a successor.
An unrolled recurrent neural network.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Recurrent Neural Networks

ht =g t (x t , x t −1 , x t −2 ,... x 2 , x1 )
h3 = A (A ( A (h(−1 ) , x 0 ) , x 1 ) , x 2 )
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks

Let x t be the value of input at time t

and let ht be the network output at
time t V

Many recurrent neural networks use
W
ht =f ( ht −1 , x t , θ ) f

To define the values of their hidden
U
units. To indicate that the state is the
hidden units of the network

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks

The unfolding process thus introduces two major
advantages:
– 1. Regardless of the sequence length, the learned
model always has the same input size, because it is
specified in terms of transition from one state to
another state, rather than specified in terms of a
variable-length history of states.
– 2. It is possible to use the same transition function f
with the same parameters at every time step.

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks

V V V V
W W W
f f f f
U U U U

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Recurrent Neural Networks

V V V V
W W W
f f f f
U U U U

1. Variable length sequences 2. Parameter sharing across the sequence

3. Maintain information about order 4. track long term dependencies but not too long
RNN - Forward Pass

The forward pass of an RNN is the same as that of a
multilayer perceptron with a single hidden layer

Except that activations arrive at the hidden layer from
both the current external input and the hidden layer
activations from the previous timestep.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


y y t −1 yt y t +1

L Lt −1 Lt Lt +1

^y ^y t −1 ^y t ^y t +1

o Unfold o t −1 ot o t +1

V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
y z t =b+ Wht −1 +Uxt y t −1 yt y t +1

ht =tanh( z t )
L o t =c +Vht Lt −1 Lt Lt +1

y^ t =softmax(o t )
^y ^y t −1 ^y t ^y t +1

o Unfold o t −1 ot o t +1

V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Types of Recurrent Neural Networks

The Recurrent Neural Nets allow us to operate over
sequences of vectors: Sequences in the input, the
output, or in the most general case both.

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Types of Recurrent Neural Networks
Feed farward
Network

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Types of Recurrent Neural Networks
Image caption
generation

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Types of Recurrent Neural Networks
Image caption Sentiment
Generation Classification

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Types of Recurrent Neural Networks
Image caption Sentiment Language
Generation Classification Translation

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Types of Recurrent Neural Networks
Image caption Sentiment Language Music
Generation Classification Translation Generation

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/


y z t =b+ Wht −1 +Uxt y t −1 yt y t +1

ht =tanh( z t )
L o t =c +Vht Lt −1 Lt Lt +1

y^ t =softmax(o t )
^y ^y t −1 ^y t ^y t +1

o Unfold o t −1 ot o t +1

V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Back Propagation Through Time(BPTT)
∂L
V =V − α
∂V

∂L
W =W −α
∂W

∂L
U =U− α
∂U
Back Propagation Through Time(BPTT)
∂L ∂ L ∂ L1 ∂ L 2 ∂ Ln
V =V − α = + +...+
∂V ∂V ∂ V ∂ V ∂V
T
∂L ∂ Lt
=∑
∂V t =1 ∂V
T
∂L ∂L ∂ Lt
W =W −α =∑
∂W ∂W t =1 ∂W
T
∂L ∂L ∂ Lt
U =U− α =∑
∂U ∂U t =1 ∂U
Back Propagation Through Time(BPTT)

The total loss is simply the sum of the loss over all time
steps
T
L( θ )=∑ L t ( θ )
t =1

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations

∂ L ∂ L1 ∂ L2 ∂ Ln
= + +...+
∂V ∂ V ∂ V ∂V
T
∂L ∂ Lt
=∑
∂V t =1 ∂V

∂ L t ∂ L t ∂ y^ t ∂ o t
=
∂V ∂ y^ t ∂ o t ∂ V

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations

∂ L t ∂ Lt ∂ y^ t ∂ ot ∂ ht
=
∂W ∂ y^ t ∂ ot ∂ ht ∂W

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
ht =tanh (b+Wh t −1+Ux t )

ht −1=tanh(b+Wh t −2+ Uxt −1)

∂ ht ∂h t ∂ ht ∂ ht −1
= +
∂W ∂W ∂ ht −1 ∂ W

h0 ht −2 ht −1 ht ht +1

∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht −1 ∂h t−2
= +
∂W ∂W ∂ ht −1 ∂ W
+
{
∂ ht −2 ∂W } W
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
∂ ht ∂h t ∂ ht ∂ ht −1
= +
∂W ∂W ∂ ht −1 ∂ W

∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht −1 ∂h t−2
= +
∂W ∂W ∂ ht −1 ∂ W
+
{
∂ ht −2 ∂W }
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂h t−1 ∂ ht −2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −1 ∂h t−2 ∂ W
h0 ht −2 ht −1 ht ht +1

W
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂h t−1 ∂ ht −2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −1 ∂h t−2 ∂ W
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂ ht− 2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −2 ∂W
t
∂ ht ∂h t ∂ ht ∂ ht ∂h t−1 ∂h t ∂ ht −2 ∂ ht ∂ hr
= + + =∑
∂W ∂h t ∂W ∂ ht −1 ∂W ∂h t−2 ∂ W r =1 ∂ hr ∂ W
t
∂ L t ∂ Lt ∂ y^ t ∂ ot ∂ ht ∂ h r
=
∂W ∂ y^ t ∂ ot ∂ ht
∑ ∂h ∂W
r =1 r
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations

Taking the gradient of U is similar to doing it for W
since they both require taking sequential derivativs of
an ht vector.

∂ L t ∂ L t ∂ y^ t ∂ o t ∂ ht
=
∂U ∂ y^ t ∂ o t ∂ ht ∂U
t
∂ L t ∂ L t ∂ y^ t ∂ o t ∂ ht ∂h r
=
∂U ∂ y^ t ∂ o t ∂ ht
∑ ∂ h ∂U
r =1 r

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
The Problem of Long Term Dependencies

One of the appeals of RNNs is the idea that they might
be able to connect previous information to the present
task

previous video frames might inform the understanding
of the present frame.

But can they?

Reference: http://proceedings.mlr.press/v28/pascanu13.pdf
The Problem of Long Term Dependencies

I used to live in France and I learned how to speak............


Vanishing/Exploding Gradient Problem in RNN

I grew up in France…..............................I speak fluent ............


Difficulty of training RNNs

There are two widely known issues with properly
training recurrent neural networks with longer
sequences
1. Vanishing gradient problem
2. Exploding gradient problem


In practice many people truncate the backpropagation
to a few steps.
Difficulty of training RNNs

The network computes a
single output y ^y
● and the initial state h0 is V
W W W W
initialised with zeros h0 h1 h2 h3 h4

The state of the network at U U U U
time t is given by x1 x2 x3 x4
ht = σ ( b+Wh t−1 +Ux t )
An SRN “unrolled” for four time steps (t ∈ [1, 4])

When the network computes a
categorical distribution its
output is given by ^y =softmax (c +Vht )
Reference: https://jmlr.org/papers/volume21/18-141/18-141.pdf
Back Propagation Through Time(BPTT)
∂L
V =V − α
∂V

∂L
W =W −α
∂W

∂L
U =U− α
∂U
Difficulty of training RNNs
∂ L ∂ L ∂ ^y ∂h t
=
∂W ∂ ^y ∂ ht ∂W

∂ h4 ∂h 4 ∂ h4 ∂ h3 ∂h 4 ∂ h3 ∂ h2 ∂ h 4 ∂h 3 ∂h 2 ∂h1
= + + +
∂W ∂W ∂ h3 ∂ W ∂h 3 ∂ h2 ∂W ∂ h3 ∂h 2 ∂h 1 ∂W

t
∂ L ∂ L ∂ ^y ∂ ht ∂ h k
=
∂W ∂ ^y ∂ ht
∑ ∂h ∂W
k=1 k

Mitesh M. Khapra CS7015: Deep Learning NPTEL


Difficulty of training RNNs
t
∂ L ∂ L ∂ ^y ∂ ht ∂ h k ∂ h4 ∂ h 4 ∂ h3 ∂ h2
=
∂W ∂ ^y ∂ ht
∑ ∂h ∂W =
k=1 k ∂ h1 ∂ h3 ∂ h2 ∂ h1
t
∂ ht ∂ hi
=∏ ht = σ (b+Wh t−1 +Ux t )
∂ hk i=k +1 ∂ hi−1
t
hi =σ (z i)
∂ ht ∂ hi ∂ z i z i=b+Wh i−1+ Uxi
=∏
∂ hk i=k +1 ∂ z i ∂ hi−1
∂ hi ∂ hi
'
=diag ( σ (z i))W =diag ( σ ' ( z i))
∂ hi−1 ∂ zi
Mitesh M. Khapra CS7015: Deep Learning NPTEL
Difficulty of training RNNs
∂ hi
‖ ‖
∂ hi−1
'
=‖diag ( σ ( z i ))W ‖ ∂ hi
‖ ‖
∂ hi−1
≤ γ ‖W ‖

'
.≤‖diag ( σ ( zi ))‖‖W ‖
∂ hi
' 1
σ ( z i)≤ = γ [if σ is sigmoid ]
‖ ‖
∂ hi−1
≤γ λ

4
σ ' ( z i)≤1= γ [if σ is tanh]

Mitesh M. Khapra CS7015: Deep Learning NPTEL


Tanh Activation Function

Since data is centered around 0, the derivatives
are higher

For input between [-1,1], we have For input between [0,1], we have
derivative between [0.42, 1]. derivative between [0.20, 0.25]

Refrence: https://stats.stackexchange.com/questions/101560/tanh-activation-function-vs-sigmoid-
activation-function
Difficulty of training RNNs
t
∂ ht
‖ ‖‖
∂ hk
= ∏ ∂h
i =k+1
∂h i
i−1

t
.≤ ∏ γ λ
i=k+ 1
.≤( γ λ )(t−k )

If ( γ λ )<1 then gradient will vanish


If ( γ λ )>1 then gradient will explode
Mitesh M. Khapra CS7015: Deep Learning NPTEL
Truncated BPTT
LSTM Networks
SimpleRNN

All recurrent neural networks have the form of a chain
of repeating modules of neural network.

This lets them maintain information in 'memory' over
time

But, it can be difficult to train standard RNNs to solve
problems that require learning long-term temporal
dependencies.

This is because of vanishing/exploding gradient
problem
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Simple RNN

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


LSTM Networks

Long Short Term Memory networks – usually just
called “LSTMs”

Special kind of RNN

LSTMs are explicitly designed to avoid the long-term
dependency problem.

Remembering information for long periods of time is
practically their default behavior

https://arxiv.org/pdf/1412.3555.pdf
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
The Core Idea Behind LSTMs

The key to LSTMs is the cell state(C), the horizontal
line running through the top of the diagram.

It runs straight down the entire chain, with only some
minor linear interactions. It’s very easy for information
to just flow along it unchanged.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


The Core Idea Behind LSTMs

Unlike to the traditional recurrent unit which overwrites
its content at each time-step

LSTM unit is able to decide whether to keep the
existing memory via the introduced gates.

Intuitively, if the LSTM unit detects an important feature
from an input sequence at early stage

It easily carries this information (the existence of the
feature) over a long distance

Hence, capturing potential long-distance dependencies.
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Step-by-Step LSTM Walk Through

Language model trying to predict the next word based
on all the previous ones.

The cell state(C) might include the gender of the
present subject, so that the correct pronouns can be
used.

When we see a new subject, we want to forget the
gender of the old subject.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Forget Gate

The first step in our LSTM is to decide what information
we’re going to throw away from the cell state(C).

This decision is made by a sigmoid layer called the
“forget gate layer.”
● It looks at ht−1 and xt, and outputs a number between 0
and 1 for each number in the cell state Ct−1.

1 represents “completely keep this”

0 represents “completely get rid of this.”
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Forget Gate
h t−1 =[1,2,3]
x t =[4,5,6]

[ht −1 , x t ]=[1,2,3,4,5,6]
if we have 3 units
3 cell states wf = 3*6= 3(3+3)
0 0 0 0 0 −1

[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
b f =[1,2,3 ]
Reference: https://datascience.stackexchange.com/questions/19196/forget-layer-in-a-recurrent-
neural-network-rnn/19269#19269
C t−1 Forget Gate
5 5 5

0 1 1 ft
σ

-5 177 136

+ 1 2 3 b f =[1,2,3 ]
-6 175 133

0 0 0 0 0 −1
ht −1
1 2 3 1 2 3 4 5 6
[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
4 5 6
xt
Forget Gate

1
0 0 0 0 0 −1

[
3 4 5 6 7 8

−6 1
2
3

5
6
−5
−6
w f .[h t−1 , x t ]= 5 6 7 8 9 10 . = 175
4
133
[]
] [ ]
−5 0
[ ][][ ]
w f .[h t−1 , x t ]+ b f = 175 + 2 = 177
133 3 136 [ ] []
f t =σ (w f .[h t−1 , x t ]+ b f )=σ ( 177 )= 1
136 1
Reference: https://datascience.stackexchange.com/questions/19196/forget-layer-in-a-recurrent-
neural-network-rnn/19269#19269
C t−1
5 5 5 0 5 5 C t−1∗f t
0 1 1 ft ~
σ C t=f t∗Ct −1 + i t∗ Ct
-5 177 136

+ 1 2 3 b f =[1,2,3 ]
-6 175 133

0 0 0 0 0 −1
ht −1
1 2 3 1 2 3 4 5 6
[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
4 5 6
xt
Input Gate

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Input Gate – Sigmoid Layer
1 1 1 it i t =σ (w i . [ht−1 , x t ]+ bi )
σ

22 42 64

+ 1 1 1 bi =[1,1,1]
21 41 63

1 1 1 1 1 1
ht −1
1 2 3 1 2 3 4 5 6
[
w i= 2 2 2 2 2 2
3 3 3 3 3 3 ]
4 5 6
xt
Input Gate – Sigmoid Layer
1 1 1 1 1 1

[
w i= 2 2 2 2 2 2
3 3 3 3 3 3 ] 1
bi =[1,1,1]

[
1 1 1 1 1 1

22 1
2
3

5
6

[ ] []
1

1
22
w i . [h t−1 , x t ]+ bi = 2 2 2 2 2 2 . + 1 = 42
3 3 3 3 3 3
4
64

i t =σ (w i . [ht−1 , x t ]+ bi )= σ ( 42 )= 1
64 1
[]
] [][ ]
i t =σ (w i . [ht−1 , x t ]+ bi )
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Input Gate – Tanh Layer
~ ~
1 1 -1 Ct C t=tanh (w C . [ht −1 , x t ]+bC )
tanh
22 42 -60

+ 1 1 1 bC =[1,1,1]
21 41 -61

1 1 1 1 1 1
ht −1
1 2 3 1 2 3 4 5 6
[
w C= 2 2 2 2 2 2
−3 −3 −3 −3 −3 −3 ]
4 5 6
xt
Input Gate – Tanh Layer
1 1 1 1 1 1 ~

[
w C= 2 2 2 2 2 2
3 3 3 3 3 3 ] bC =[1,1,1]

1
C t=tanh (w C . [ht −1 , x t ]+bC )

~
[
1
w C . [ht −1 , x t ]+bC = 2
1
2
1
2
1
2

22
1
2
−3 −3 −3 −3 −3 −3

1
1

C t=tanh (w C . [ht −1 , x t ]+bC )=tanh (


2
3
4
5
6
1

[ ][]
[
22
2 . + 1 = 42
−62

42 )= 1
−62 −1
] ] [][ ]
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Input Gate

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Input Gate – Update Cell Sate

~
C t=f t∗Ct −1 + i t∗Ct

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Input Gate – Update Cell Sate

[5 5 5 ]
~
C t=f t∗Ct −1 +i t∗C t
0 5 1 1 1

1[] [][][ ][]


C t= 1 ∗ 5 + 1 ∗ 1 = 6
5 1 −1 4

Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Output Gate

Finally, we need to decide what we’re going to output.
This output will be based on our cell state, but will be a
filtered version.

First, we run a sigmoid layer which decides what parts
of the cell state we’re going to output.

Then, we put the cell state through tanh (to push the
values to be between −1 and 1) and multiply it by the
output of the sigmoid gate, so that we only output the
parts we decided to.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Output Gate

C t=[1 6 4 ]

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Output Gate

C t=[1 6 4 ]

result of tanh(Ct) is [0.76,0.9999,0.9993]


Suppose Ot = [0,0.5,1]
ht = [(0*0.76), (0.5*0.99), (1*0.99)]= [0, 0.495,0.99]

Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Output Gate

For the language model example, since it just saw a
subject, it might want to output information relevant to
a verb, in case that’s what is coming next.

For example, it might output whether the subject is
singular or plural, so that we know what form a verb
should be conjugated into if that’s what follows next.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


Step-by-Step LSTM Walk Through

Hidden state(h) is overall state of what we have seen
so far.

Cell state(C) is selective memory of the past.

Both these states are trainable with data.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/


LSTM Summary

The LSTM does have the ability to remove or add information
to the cell state, carefully regulated by structures called gates.

The most prominent feature LSTM is the additive component
of its update from t-1 to t, which is lacking in the traditional
recurrent unit.

The traditional recurrent unit always replaces the activation,
or the content of a unit with a new value computed from the
current input and the previous hidden state.

On the other hand LSTM unit keep the existing content and
add the new content on top of it
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM Summary

This additive nature has two advantages.

First, it is easy for each unit to remember the existence of a
specific feature in the input stream for a long series of steps.

Any important feature will not be overwritten but be
maintained as it is.

Second, and perhaps more importantly, this addition
effectively creates shortcut paths that bypass multiple
temporal steps.

Because of these shortcuts vanishing gradients problem will
not occur
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM Summary

This additive nature has two advantages.

First, it is easy for each unit to remember the existence of a specific
feature in the input stream for a long series of steps.

Any important feature will not be overwritten but be maintained as it
is.

Second, and perhaps more importantly, this addition effectively
creates shortcut paths that bypass multiple temporal steps.

These shortcuts allow the error to be back-propagated easily
without too quickly vanishing (if the gating unit is nearly saturated at
1) as a result of passing through multiple, bounded nonlinearities,
thus reducing the difficulty due to vanishing gradients
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
~
C t=f t∗Ct −1 +i t∗C t
ht =o t∗tanh (C t )
f t =σ (w f .[h t−1 , x t ]+ b f )

. .
y y t −1 yt y t +1

L Lt −1 Lt Lt +1

o o(...) o t −1 ot o t +1

V W V W V W V W
Unfold
W h ht −1 ht ht +1 h(...)

U U U U
x x t −1 xt x t +1

You might also like