0% found this document useful (0 votes)

64 views79 pages

RNN

Uploaded by

srinujodu1431

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views79 pages

RNN

Uploaded by

srinujodu1431

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 79

Recurrent Neural Networks(RNN)

Why Recurrent Neural Networks

●
So far we have seen networks such as densely
connected networks and convolutional neural
networks
●
Each input shown to them is processed independently
●
language translation, natural language processing
(nlp), speech recognition, and image captioning
●
Here Data is sequence
●
Traditional neural networks can’t process such kind of
sequential data, it is a major shortcoming.
Limitations of Feed Forward Neural Networks
●
1. Will consider only present input
– In Feed farward networks Previously processed
image(cat) will not effect on next image(dog)
●
2. Fixed-sized vector as input
– (Fixed-sized vector input e.g. an image) and
produce a fixed-sized vector as output (e.g.
probabilities of different classes).

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Limitations of Feed Forward Neural Networks
●
3. Can’t process variable length data
– in CNN all images are re scaled to fixed size before
processing like 64x64

●
In sequential data input can’t be fixed length
●
So we need new type of models to process this kind of
problems

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Why Recurrent Neural Networks
●
Recurrent neural networks address this issue.
●
Humans don’t start their thinking from scratch every
second.
●
As we read essay, you understand each word based on
your understanding of previous words.
●
We don’t throw everything away and start thinking from
scratch again.
●
Our thoughts have persistence.
●
Means we keeping memories of what came before
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks
●
Biological intelligence processes information
incrementally
●
while maintaining an internal model of what it’s
processing, built from past information
●
and constantly updated as new information comes in.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Networks
●
Can you predict where this ball go next?

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

Recurrent Neural Networks
●
Can you predict where this ball go next?

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

Recurrent Neural Networks
●
Can you predict where this ball go next?

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

Recurrent Neural Networks
Text

RNN is a class of artificial neural networks

R N N i s.................

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

Recurrent Neural Networks
●
A chunk of neural network, A,
looks at some input xt and
outputs a value ht.
●
A loop allows information to be
passed from one step of the
network to the next.

ht = A (h t−1 , x t )

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Networks
●
A recurrent neural network can be thought of as multiple
copies of the same network, each passing a message
to a successor.
An unrolled recurrent neural network.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Networks

ht =g t (x t , x t −1 , x t −2 ,... x 2 , x1 )
h3 = A (A ( A (h(−1 ) , x 0 ) , x 1 ) , x 2 )
Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks
●
Let x t be the value of input at time t
●
and let ht be the network output at
time t V
●
Many recurrent neural networks use
W
ht =f ( ht −1 , x t , θ ) f
●
To define the values of their hidden
U
units. To indicate that the state is the
hidden units of the network

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks
●
The unfolding process thus introduces two major
advantages:
– 1. Regardless of the sequence length, the learned
model always has the same input size, because it is
specified in terms of transition from one state to
another state, rather than specified in terms of a
variable-length history of states.
– 2. It is possible to use the same transition function f
with the same parameters at every time step.

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Recurrent Neural Networks

V V V V
W W W
f f f f
U U U U

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Networks

V V V V
W W W
f f f f
U U U U

1. Variable length sequences 2. Parameter sharing across the sequence

3. Maintain information about order 4. track long term dependencies but not too long
RNN - Forward Pass
●
The forward pass of an RNN is the same as that of a
multilayer perceptron with a single hidden layer
●
Except that activations arrive at the hidden layer from
both the current external input and the hidden layer
activations from the previous timestep.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

y y t −1 yt y t +1

L Lt −1 Lt Lt +1

^y ^y t −1 ^y t ^y t +1

o Unfold o t −1 ot o t +1

V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
y z t =b+ Wht −1 +Uxt y t −1 yt y t +1

ht =tanh( z t )
L o t =c +Vht Lt −1 Lt Lt +1

y^ t =softmax(o t )
^y ^y t −1 ^y t ^y t +1

o Unfold o t −1 ot o t +1

V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Types of Recurrent Neural Networks
●
The Recurrent Neural Nets allow us to operate over
sequences of vectors: Sequences in the input, the
output, or in the most general case both.

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Types of Recurrent Neural Networks
Feed farward
Network

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Types of Recurrent Neural Networks
Image caption
generation

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Types of Recurrent Neural Networks
Image caption Sentiment
Generation Classification

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Types of Recurrent Neural Networks
Image caption Sentiment Language
Generation Classification Translation

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Types of Recurrent Neural Networks
Image caption Sentiment Language Music
Generation Classification Translation Generation

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

y z t =b+ Wht −1 +Uxt y t −1 yt y t +1

ht =tanh( z t )
L o t =c +Vht Lt −1 Lt Lt +1

y^ t =softmax(o t )
^y ^y t −1 ^y t ^y t +1

o Unfold o t −1 ot o t +1

V V V V
W W W W
h(...) ht −1 ht ht +1 h(...)
W h
U U U U
x x t −1 xt x t +1

Reference: Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
Back Propagation Through Time(BPTT)
∂L
V =V − α
∂V

∂L
W =W −α
∂W

∂L
U =U− α
∂U
Back Propagation Through Time(BPTT)
∂L ∂ L ∂ L1 ∂ L 2 ∂ Ln
V =V − α = + +...+
∂V ∂V ∂ V ∂ V ∂V
T
∂L ∂ Lt
=∑
∂V t =1 ∂V
T
∂L ∂L ∂ Lt
W =W −α =∑
∂W ∂W t =1 ∂W
T
∂L ∂L ∂ Lt
U =U− α =∑
∂U ∂U t =1 ∂U
Back Propagation Through Time(BPTT)
●
The total loss is simply the sum of the loss over all time
steps
T
L( θ )=∑ L t ( θ )
t =1

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations

∂ L ∂ L1 ∂ L2 ∂ Ln
= + +...+
∂V ∂ V ∂ V ∂V
T
∂L ∂ Lt
=∑
∂V t =1 ∂V

∂ L t ∂ L t ∂ y^ t ∂ o t
=
∂V ∂ y^ t ∂ o t ∂ V

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations

∂ L t ∂ Lt ∂ y^ t ∂ ot ∂ ht
=
∂W ∂ y^ t ∂ ot ∂ ht ∂W

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
ht =tanh (b+Wh t −1+Ux t )

ht −1=tanh(b+Wh t −2+ Uxt −1)

∂ ht ∂h t ∂ ht ∂ ht −1
= +
∂W ∂W ∂ ht −1 ∂ W

h0 ht −2 ht −1 ht ht +1

∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht −1 ∂h t−2
= +
∂W ∂W ∂ ht −1 ∂ W
+
{
∂ ht −2 ∂W } W
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
∂ ht ∂h t ∂ ht ∂ ht −1
= +
∂W ∂W ∂ ht −1 ∂ W

∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht −1 ∂h t−2
= +
∂W ∂W ∂ ht −1 ∂ W
+
{
∂ ht −2 ∂W }
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂h t−1 ∂ ht −2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −1 ∂h t−2 ∂ W
h0 ht −2 ht −1 ht ht +1

W
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂h t−1 ∂ ht −2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −1 ∂h t−2 ∂ W
∂ ht ∂h t ∂ ht ∂ ht −1 ∂ ht ∂ ht− 2
= + +
∂W ∂W ∂ ht −1 ∂ W ∂ ht −2 ∂W
t
∂ ht ∂h t ∂ ht ∂ ht ∂h t−1 ∂h t ∂ ht −2 ∂ ht ∂ hr
= + + =∑
∂W ∂h t ∂W ∂ ht −1 ∂W ∂h t−2 ∂ W r =1 ∂ hr ∂ W
t
∂ L t ∂ Lt ∂ y^ t ∂ ot ∂ ht ∂ h r
=
∂W ∂ y^ t ∂ ot ∂ ht
∑ ∂h ∂W
r =1 r
Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
BPTT - Gradient Calculations
●
Taking the gradient of U is similar to doing it for W
since they both require taking sequential derivativs of
an ht vector.

∂ L t ∂ L t ∂ y^ t ∂ o t ∂ ht
=
∂U ∂ y^ t ∂ o t ∂ ht ∂U
t
∂ L t ∂ L t ∂ y^ t ∂ o t ∂ ht ∂h r
=
∂U ∂ y^ t ∂ o t ∂ ht
∑ ∂ h ∂U
r =1 r

Reference: https://github.com/go2carter/nn-learn/blob/master/grad-deriv-tex/rnn-grad-deriv.pdf
The Problem of Long Term Dependencies
●
One of the appeals of RNNs is the idea that they might
be able to connect previous information to the present
task
●
previous video frames might inform the understanding
of the present frame.
●
But can they?

Reference: http://proceedings.mlr.press/v28/pascanu13.pdf
The Problem of Long Term Dependencies

I used to live in France and I learned how to speak............

Vanishing/Exploding Gradient Problem in RNN

I grew up in France…..............................I speak fluent ............

Difficulty of training RNNs
●
There are two widely known issues with properly
training recurrent neural networks with longer
sequences
1. Vanishing gradient problem
2. Exploding gradient problem

●
In practice many people truncate the backpropagation
to a few steps.
Difficulty of training RNNs
●
The network computes a
single output y ^y
● and the initial state h0 is V
W W W W
initialised with zeros h0 h1 h2 h3 h4
●
The state of the network at U U U U
time t is given by x1 x2 x3 x4
ht = σ ( b+Wh t−1 +Ux t )
An SRN “unrolled” for four time steps (t ∈ [1, 4])
●
When the network computes a
categorical distribution its
output is given by ^y =softmax (c +Vht )
Reference: https://jmlr.org/papers/volume21/18-141/18-141.pdf
Back Propagation Through Time(BPTT)
∂L
V =V − α
∂V

∂L
W =W −α
∂W

∂L
U =U− α
∂U
Difficulty of training RNNs
∂ L ∂ L ∂ ^y ∂h t
=
∂W ∂ ^y ∂ ht ∂W

∂ h4 ∂h 4 ∂ h4 ∂ h3 ∂h 4 ∂ h3 ∂ h2 ∂ h 4 ∂h 3 ∂h 2 ∂h1
= + + +
∂W ∂W ∂ h3 ∂ W ∂h 3 ∂ h2 ∂W ∂ h3 ∂h 2 ∂h 1 ∂W

t
∂ L ∂ L ∂ ^y ∂ ht ∂ h k
=
∂W ∂ ^y ∂ ht
∑ ∂h ∂W
k=1 k

Mitesh M. Khapra CS7015: Deep Learning NPTEL

Difficulty of training RNNs
t
∂ L ∂ L ∂ ^y ∂ ht ∂ h k ∂ h4 ∂ h 4 ∂ h3 ∂ h2
=
∂W ∂ ^y ∂ ht
∑ ∂h ∂W =
k=1 k ∂ h1 ∂ h3 ∂ h2 ∂ h1
t
∂ ht ∂ hi
=∏ ht = σ (b+Wh t−1 +Ux t )
∂ hk i=k +1 ∂ hi−1
t
hi =σ (z i)
∂ ht ∂ hi ∂ z i z i=b+Wh i−1+ Uxi
=∏
∂ hk i=k +1 ∂ z i ∂ hi−1
∂ hi ∂ hi
'
=diag ( σ (z i))W =diag ( σ ' ( z i))
∂ hi−1 ∂ zi
Mitesh M. Khapra CS7015: Deep Learning NPTEL
Difficulty of training RNNs
∂ hi
‖ ‖
∂ hi−1
'
=‖diag ( σ ( z i ))W ‖ ∂ hi
‖ ‖
∂ hi−1
≤ γ ‖W ‖

'
.≤‖diag ( σ ( zi ))‖‖W ‖
∂ hi
' 1
σ ( z i)≤ = γ [if σ is sigmoid ]
‖ ‖
∂ hi−1
≤γ λ

4
σ ' ( z i)≤1= γ [if σ is tanh]

Mitesh M. Khapra CS7015: Deep Learning NPTEL

Tanh Activation Function
●
Since data is centered around 0, the derivatives
are higher

For input between [-1,1], we have For input between [0,1], we have
derivative between [0.42, 1]. derivative between [0.20, 0.25]

Refrence: https://stats.stackexchange.com/questions/101560/tanh-activation-function-vs-sigmoid-
activation-function
Difficulty of training RNNs
t
∂ ht
‖ ‖‖
∂ hk
= ∏ ∂h
i =k+1
∂h i
i−1
‖
t
.≤ ∏ γ λ
i=k+ 1
.≤( γ λ )(t−k )

If ( γ λ )<1 then gradient will vanish

If ( γ λ )>1 then gradient will explode
Mitesh M. Khapra CS7015: Deep Learning NPTEL
Truncated BPTT
LSTM Networks
SimpleRNN
●
All recurrent neural networks have the form of a chain
of repeating modules of neural network.
●
This lets them maintain information in 'memory' over
time
●
But, it can be difficult to train standard RNNs to solve
problems that require learning long-term temporal
dependencies.
●
This is because of vanishing/exploding gradient
problem
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Simple RNN

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM Networks
●
Long Short Term Memory networks – usually just
called “LSTMs”
●
Special kind of RNN
●
LSTMs are explicitly designed to avoid the long-term
dependency problem.
●
Remembering information for long periods of time is
practically their default behavior

https://arxiv.org/pdf/1412.3555.pdf
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
The Core Idea Behind LSTMs
●
The key to LSTMs is the cell state(C), the horizontal
line running through the top of the diagram.
●
It runs straight down the entire chain, with only some
minor linear interactions. It’s very easy for information
to just flow along it unchanged.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

The Core Idea Behind LSTMs
●
Unlike to the traditional recurrent unit which overwrites
its content at each time-step
●
LSTM unit is able to decide whether to keep the
existing memory via the introduced gates.
●
Intuitively, if the LSTM unit detects an important feature
from an input sequence at early stage
●
It easily carries this information (the existence of the
feature) over a long distance
●
Hence, capturing potential long-distance dependencies.
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Step-by-Step LSTM Walk Through
●
Language model trying to predict the next word based
on all the previous ones.
●
The cell state(C) might include the gender of the
present subject, so that the correct pronouns can be
used.
●
When we see a new subject, we want to forget the
gender of the old subject.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Forget Gate
●
The first step in our LSTM is to decide what information
we’re going to throw away from the cell state(C).
●
This decision is made by a sigmoid layer called the
“forget gate layer.”
● It looks at ht−1 and xt, and outputs a number between 0
and 1 for each number in the cell state Ct−1.
●
1 represents “completely keep this”
●
0 represents “completely get rid of this.”
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Forget Gate
h t−1 =[1,2,3]
x t =[4,5,6]

[ht −1 , x t ]=[1,2,3,4,5,6]
if we have 3 units
3 cell states wf = 3*6= 3(3+3)
0 0 0 0 0 −1

[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
b f =[1,2,3 ]
Reference: https://datascience.stackexchange.com/questions/19196/forget-layer-in-a-recurrent-
neural-network-rnn/19269#19269
C t−1 Forget Gate
5 5 5

0 1 1 ft
σ

-5 177 136

+ 1 2 3 b f =[1,2,3 ]
-6 175 133

0 0 0 0 0 −1
ht −1
1 2 3 1 2 3 4 5 6
[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
4 5 6
xt
Forget Gate

1
0 0 0 0 0 −1

[
3 4 5 6 7 8

−6 1
2
3

5
6
−5
−6
w f .[h t−1 , x t ]= 5 6 7 8 9 10 . = 175
4
133
[]
] [ ]
−5 0
[ ][][ ]
w f .[h t−1 , x t ]+ b f = 175 + 2 = 177
133 3 136 [ ] []
f t =σ (w f .[h t−1 , x t ]+ b f )=σ ( 177 )= 1
136 1
Reference: https://datascience.stackexchange.com/questions/19196/forget-layer-in-a-recurrent-
neural-network-rnn/19269#19269
C t−1
5 5 5 0 5 5 C t−1∗f t
0 1 1 ft ~
σ C t=f t∗Ct −1 + i t∗ Ct
-5 177 136

+ 1 2 3 b f =[1,2,3 ]
-6 175 133

0 0 0 0 0 −1
ht −1
1 2 3 1 2 3 4 5 6
[
w f = 5 6 7 8 9 10
3 4 5 6 7 8 ]
4 5 6
xt
Input Gate

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Input Gate – Sigmoid Layer
1 1 1 it i t =σ (w i . [ht−1 , x t ]+ bi )
σ

22 42 64

+ 1 1 1 bi =[1,1,1]
21 41 63

1 1 1 1 1 1
ht −1
1 2 3 1 2 3 4 5 6
[
w i= 2 2 2 2 2 2
3 3 3 3 3 3 ]
4 5 6
xt
Input Gate – Sigmoid Layer
1 1 1 1 1 1

[
w i= 2 2 2 2 2 2
3 3 3 3 3 3 ] 1
bi =[1,1,1]

[
1 1 1 1 1 1

22 1
2
3

5
6

[ ] []
1

1
22
w i . [h t−1 , x t ]+ bi = 2 2 2 2 2 2 . + 1 = 42
3 3 3 3 3 3
4
64

i t =σ (w i . [ht−1 , x t ]+ bi )= σ ( 42 )= 1
64 1
[]
] [][ ]
i t =σ (w i . [ht−1 , x t ]+ bi )
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Input Gate – Tanh Layer
~ ~
1 1 -1 Ct C t=tanh (w C . [ht −1 , x t ]+bC )
tanh
22 42 -60

+ 1 1 1 bC =[1,1,1]
21 41 -61

1 1 1 1 1 1
ht −1
1 2 3 1 2 3 4 5 6
[
w C= 2 2 2 2 2 2
−3 −3 −3 −3 −3 −3 ]
4 5 6
xt
Input Gate – Tanh Layer
1 1 1 1 1 1 ~

[
w C= 2 2 2 2 2 2
3 3 3 3 3 3 ] bC =[1,1,1]

1
C t=tanh (w C . [ht −1 , x t ]+bC )

~
[
1
w C . [ht −1 , x t ]+bC = 2
1
2
1
2
1
2

22
1
2
−3 −3 −3 −3 −3 −3

1
1

C t=tanh (w C . [ht −1 , x t ]+bC )=tanh (

2
3
4
5
6
1

[ ][]
[
22
2 . + 1 = 42
−62

42 )= 1
−62 −1
] ] [][ ]
Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Input Gate

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Input Gate – Update Cell Sate

~
C t=f t∗Ct −1 + i t∗Ct

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Input Gate – Update Cell Sate

[5 5 5 ]
~
C t=f t∗Ct −1 +i t∗C t
0 5 1 1 1

1[] [][][ ][]

C t= 1 ∗ 5 + 1 ∗ 1 = 6
5 1 −1 4

Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Output Gate
●
Finally, we need to decide what we’re going to output.
This output will be based on our cell state, but will be a
filtered version.
●
First, we run a sigmoid layer which decides what parts
of the cell state we’re going to output.
●
Then, we put the cell state through tanh (to push the
values to be between −1 and 1) and multiply it by the
output of the sigmoid gate, so that we only output the
parts we decided to.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Output Gate

C t=[1 6 4 ]

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Output Gate

C t=[1 6 4 ]

result of tanh(Ct) is [0.76,0.9999,0.9993]

Suppose Ot = [0,0.5,1]
ht = [(0*0.76), (0.5*0.99), (1*0.99)]= [0, 0.495,0.99]

Reference: https://statisticalinterference.wordpress.com/2017/05/27/lstms-in-excruciating-detail/
Output Gate
●
For the language model example, since it just saw a
subject, it might want to output information relevant to
a verb, in case that’s what is coming next.
●
For example, it might output whether the subject is
singular or plural, so that we know what form a verb
should be conjugated into if that’s what follows next.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Step-by-Step LSTM Walk Through
●
Hidden state(h) is overall state of what we have seen
so far.
●
Cell state(C) is selective memory of the past.
●
Both these states are trainable with data.

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM Summary
●
The LSTM does have the ability to remove or add information
to the cell state, carefully regulated by structures called gates.
●
The most prominent feature LSTM is the additive component
of its update from t-1 to t, which is lacking in the traditional
recurrent unit.
●
The traditional recurrent unit always replaces the activation,
or the content of a unit with a new value computed from the
current input and the previous hidden state.
●
On the other hand LSTM unit keep the existing content and
add the new content on top of it
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM Summary
●
This additive nature has two advantages.
●
First, it is easy for each unit to remember the existence of a
specific feature in the input stream for a long series of steps.
●
Any important feature will not be overwritten but be
maintained as it is.
●
Second, and perhaps more importantly, this addition
effectively creates shortcut paths that bypass multiple
temporal steps.
●
Because of these shortcuts vanishing gradients problem will
not occur
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM Summary
●
This additive nature has two advantages.
●
First, it is easy for each unit to remember the existence of a specific
feature in the input stream for a long series of steps.
●
Any important feature will not be overwritten but be maintained as it
is.
●
Second, and perhaps more importantly, this addition effectively
creates shortcut paths that bypass multiple temporal steps.
●
These shortcuts allow the error to be back-propagated easily
without too quickly vanishing (if the gating unit is nearly saturated at
1) as a result of passing through multiple, bounded nonlinearities,
thus reducing the difficulty due to vanishing gradients
Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/
~
C t=f t∗Ct −1 +i t∗C t
ht =o t∗tanh (C t )
f t =σ (w f .[h t−1 , x t ]+ b f )

. .
y y t −1 yt y t +1

L Lt −1 Lt Lt +1

o o(...) o t −1 ot o t +1

V W V W V W V W
Unfold
W h ht −1 ht ht +1 h(...)

U U U U
x x t −1 xt x t +1

Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
ch10 Sequence Modelling - Recurrent and Recursive Nets
No ratings yet
ch10 Sequence Modelling - Recurrent and Recursive Nets
45 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
99 pages
Module 4 Recurrent Neural Network
No ratings yet
Module 4 Recurrent Neural Network
78 pages
Unit 3 Deep Learning SPPU BE IT
No ratings yet
Unit 3 Deep Learning SPPU BE IT
30 pages
Unit-Iv DL
No ratings yet
Unit-Iv DL
54 pages
42 Recurrent Neural Networks and LSTM
No ratings yet
42 Recurrent Neural Networks and LSTM
68 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
Deep Learning RNN
100% (1)
Deep Learning RNN
53 pages
Deep Arch MSC 2024
No ratings yet
Deep Arch MSC 2024
83 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
19 pages
Unit 5
No ratings yet
Unit 5
76 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
6 pages
Final PDL - Unit IV
No ratings yet
Final PDL - Unit IV
51 pages
AD3501 DL UNIT 3 Notes - Nil AD3501 DL UNIT 3 Notes - Nil
No ratings yet
AD3501 DL UNIT 3 Notes - Nil AD3501 DL UNIT 3 Notes - Nil
31 pages
Unit V
No ratings yet
Unit V
32 pages
Nria20-Dl - Unit-4 Notes-Final
No ratings yet
Nria20-Dl - Unit-4 Notes-Final
21 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Unit-2 Part-2
No ratings yet
Unit-2 Part-2
42 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
DeepLearning Unit-III
No ratings yet
DeepLearning Unit-III
42 pages
Slides RNN
No ratings yet
Slides RNN
75 pages
Endsem Imp DL Unit 4
No ratings yet
Endsem Imp DL Unit 4
30 pages
What Is A Recurrent Neural Network
No ratings yet
What Is A Recurrent Neural Network
36 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Sequence Modeling Recurrent Neural Networks
No ratings yet
Sequence Modeling Recurrent Neural Networks
18 pages
DL Notes
No ratings yet
DL Notes
35 pages
Unit 3
No ratings yet
Unit 3
30 pages
Lec 4 Recurrent Neural Network Long Short-Term Memory
No ratings yet
Lec 4 Recurrent Neural Network Long Short-Term Memory
32 pages
Unit V Recurrent Neural Networks
No ratings yet
Unit V Recurrent Neural Networks
35 pages
CH4 - AA1.1-Sequence Models
No ratings yet
CH4 - AA1.1-Sequence Models
26 pages
6S191 MIT DeepLearning L2
No ratings yet
6S191 MIT DeepLearning L2
85 pages
Unit 4
No ratings yet
Unit 4
27 pages
DL Unit 4 Part 2
No ratings yet
DL Unit 4 Part 2
8 pages
Time Series RNN LSTM 1746197734
No ratings yet
Time Series RNN LSTM 1746197734
25 pages
Soft Computing 1
No ratings yet
Soft Computing 1
15 pages
AN2DL 04 2324 RecurrentNeuralNetworks
No ratings yet
AN2DL 04 2324 RecurrentNeuralNetworks
34 pages
Unit 4 - Merged
No ratings yet
Unit 4 - Merged
13 pages
IMP - Fundamentals of Deep Learning - Introduction To Recurrent Neural Networks
No ratings yet
IMP - Fundamentals of Deep Learning - Introduction To Recurrent Neural Networks
33 pages
REPORT
No ratings yet
REPORT
24 pages
A Brief Overview of Recurrent Neural Networks (RNN)
No ratings yet
A Brief Overview of Recurrent Neural Networks (RNN)
8 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
Lecture Notes - RRN
No ratings yet
Lecture Notes - RRN
8 pages
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
No ratings yet
Recurrent Neural Networks (RNNS) : A Gentle Introduction and Overview
16 pages
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
21 pages
Module 06
No ratings yet
Module 06
5 pages
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
0% (1)
Unit 4 - Machine Learning - WWW - Rgpvnotes.in
16 pages
Convolutional Neural Networks (CNNS)
No ratings yet
Convolutional Neural Networks (CNNS)
10 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
Recurrent Neural Networks (RNNS) : Foundations and Applications in Sequential Learning
No ratings yet
Recurrent Neural Networks (RNNS) : Foundations and Applications in Sequential Learning
9 pages
15.03.2024 Csa3007 A24+d23+d24
No ratings yet
15.03.2024 Csa3007 A24+d23+d24
8 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
11 pages
Dis6 Sol
No ratings yet
Dis6 Sol
6 pages
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
No ratings yet
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
6 pages
What Is An RNN
No ratings yet
What Is An RNN
6 pages
Survey On Recurrent Neural Network in Natural Lang
No ratings yet
Survey On Recurrent Neural Network in Natural Lang
5 pages
Article On Recurrent Neural Networks
No ratings yet
Article On Recurrent Neural Networks
3 pages
The Unreasonable Effectiveness of Recurrent Neural Networks
No ratings yet
The Unreasonable Effectiveness of Recurrent Neural Networks
1 page
Unit I
No ratings yet
Unit I
28 pages
Unit III
No ratings yet
Unit III
39 pages
CD Notes
No ratings yet
CD Notes
34 pages
Virtual Memory Management
No ratings yet
Virtual Memory Management
33 pages
Deadlocks
No ratings yet
Deadlocks
24 pages
Ay20-21 Cse Sem1 Labs Remedials Regisrations
No ratings yet
Ay20-21 Cse Sem1 Labs Remedials Regisrations
4 pages

RNN

Uploaded by

RNN

Uploaded by

Recurrent Neural Networks(RNN)

Why Recurrent Neural Networks

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

RNN is a class of artificial neural networks

RNN is a class of artificial neural networks

Ava Soleimany MIT 6.S191 Deep Sequence Modeling

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

1. Variable length sequences 2. Parameter sharing across the sequence

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Andrej Karpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/

ht −1=tanh(b+Wh t −2+ Uxt −1)

I used to live in France and I learned how to speak............

I grew up in France…..............................I speak fluent ............

Mitesh M. Khapra CS7015: Deep Learning NPTEL

Mitesh M. Khapra CS7015: Deep Learning NPTEL

If ( γ λ )<1 then gradient will vanish

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

C t=tanh (w C . [ht −1 , x t ]+bC )=tanh (

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

1[] [][][ ][]

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

result of tanh(Ct) is [0.76,0.9999,0.9993]

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Christopher Olah https://colah.github.io/posts/2015-08-Understanding-LSTMs/

You might also like