Deep Learning with Keras and
TensorFlow
Recurrent Neural Networks (RNN)
Learning Objectives
By the end of this lesson, you will be able to:
Implement RNNs for sequential data
Use LSTMs for memory operations within RNNs
Perform gated operations in LSTMs using GRUs
Improve the performance of LSTMs using the Attention
mechanism
Sequence Data
What Is Sequential Data?
The dataset is said to be sequential when the data points are dependent on other data points
within a dataset.
Example: Time Series Data
Sequential Data: Problems
Consider you have a sequential data that contains temperature and humidity values for everyday.
Temperature
Humidity
Goal: To build a neural network that imports the temperature and humidity values of a
given day and predicts if the weather for that day is sunny or rainy.
Sequential Data: Problems
The data then flows to the hidden layers, where the weights and biases are applied.
Weights ([0.23 + 0.72]) + Bias
Cloudy
Sunny
Sequential Data: Problems
A traditional neural network assumes that the data is non-sequential and each data point is
independent of the others.
Cloudy
Sunny
Note: The network does not remember what it gives as an output. It just accepts the next
data point.
Sequential Data: Problems
In the weather data, there is a strong correlation between the weather from one day and the
weather in subsequent days. The former has influence over the latter.
If it was sunny on one day in the
middle of summer, it’s easy to
to presume that it’ll also be sunny
on the following day.
Sequential Data: Solution
An RNN has a mechanism that can handle a sequential dataset.
Data in Data out
RNN
Recurs new state
to itself
RNN Model
The RNN Model
The RNN remembers the analysis done upto a given point by maintaining a state.
Data in Data out
RNN
Recurs new state
to itself
Note: You can think of the state as the memory of RNN which recurs into the net with
each new input.
RNN: Working
The first data point flows into the network as input data, denoted as x.
𝑊𝑥
RNN
x Y
𝑊ℎ
ℎ𝑝𝑟𝑣
Weights are multiplied
by the previously Recurs new state
hidden state to itself
Previous state received by the
hidden units
Weight matrix between input
and hidden units
RNN: Working
Two values are calculated in the hidden layer as shown below:
The new or
updated state, The output of the
denoted as h_new, network is
is used for the denoted as y
next data point
RNN: Working
𝑊𝑥
x Y
𝑊ℎ ℎ𝑛𝑒𝑤
ℎ𝑛𝑒𝑤 = tanh(𝑊ℎ . ℎ𝑝𝑟𝑣 + 𝑊𝑥 . x)
ℎ𝑝𝑟𝑣
Previous state received by the
hidden units
The new state is a function of the previous
state and the input data
RNN: Working
The output of the hidden unit is simply calculated by multiplication of the new hidden state and the
output weight matrix.
𝑊𝑥
x Y
𝑊ℎ ℎ𝑛𝑒𝑤
ℎ𝑝𝑟𝑣
After processing the first data point, a new context is generated that represents the most recent point.
Then, this context is fed back into the net with the next data point and we repeat these steps until all the
data is processed.
A Typical RNN
0
ot-1 ot Ot+1
V V V V
W st-1 st St+1
W
W W W
Unfold
U U U U
x Xt-1 xt Xt+1
ht h0 h1 h2 ht
A A A A A
xt x0 x1 x2 xt
Reduces Complexity
Given function f: h’, y = f (h,x): h and h’ are vectors with the same dimension.
y1 y2 y3
h0 f h1 f h2 f h3 ……
x1 x2 x3
We only need one
function f,
irrespective of the
input and output
sequences.
Applications of RNN
Speech Recognition
The goal is to consume a sequence of data and then produce another sequence.
Data in Data out
RNN
Recurs new state
to itself
Image Captioning
You can create a model that’s capable of understanding the elements in an image.
Data in Data out
RNN
Recurs new state
to itself
Note: There is just one input (the image) and the output is a sequence of words.
Therefore, it is also known as one-to-many.
Sentiment Analysis
RNNs can be used for sentiment analysis, where it focuses only on the final output and not on the
sentiment behind each word.
Data in Data out
RNN
Recurs new state
to itself
Note: The RNN here consumes a sequence of data and produces just one output.
Therefore, it is also known as many-to-one.
Deep RNNs
Problems with Smaller RNN Networks
If 𝑥1 .... 𝑥𝑛 is very large and continues to grow, the fully connected network will become too big.
x1
y1
x2
y2
x3
y3
x4
Bidirectional RNNs
Bidirectional RNNs are constructed by putting two RNNs (f1 and f2) together. Mathematically, these are
defined as y,h=f1(x,h) and z,g = f2(g,x).
x2 x3
g0 f2 g1 f2 g2 f2 g3
z1 z2 z3
p=f3(y,z) f3 p1 f3 p2 f3 p3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3
x1 x2 x3
Deep RNNs
Deep RNNs are constructed by adding more layers to simple RNNs. Mathematically, it can be defined as
h’,y = f1(h,x), g’,z = f2(g,y)
…
z1 z2 z3
g0 f2 g1 f2 ……
g2 f2 g3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3 ……
x1 x2 x3
Pyramid RNNs
Pyramid RNNs speed up the training process by reducing the number of timesteps.
Bidirectional
RNN
Problems with Deep RNNs
Deep RNNs are very hard to train and usually don’t remember data beyond certain timesteps.
…
z1 z2 z3
g0 f2 g1 f2 ……
g2 f2 g3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3 ……
…
z1 z2 z3
x1 x2 x3
g0 f2 g1 f2 ……
g2 f2 g3
y1 y2 y3
h0 f1 h1 f1 h2 f1 h3 ……
x1 x2 x3
The Problem of Vanishing Gradient with RNNs
The problem arises while updating weights in RNNs. These weights connect the hidden layers to
themselves in the unrolled temporal loop.
𝜀 t-3
𝜀 t-2 𝜀 t 𝜀 T+1
Wout Wout Wout Wout Wout
Wrec Wrec Wrec Wrec
Win Win Win Win Win
Note: When any figure is multiplied by a small number, its value decreases very quickly.
Long Short-Term Memory (LSTM)
LSTM Architecture
Decides what
to forget Decides what to
Bits of insert
memory
Combines with the
transformed 𝑥𝑡
LSTM Architecture
Decides which part
of memory to
forget. The part to
be forgotten is
denoted with 0
LSTM Architecture
Decides what bits to
insert in the next
states
Decides what
content to store in
the next states
LSTM Architecture
Decides the content of
the next memory cell,
which is a mixture of the
not forgotten part from
previous cell and
insertion
LSTM Architecture
Decides on what part
of cell to output
Maps bits within -1
and +1 range
A Peephole LSTM
A peephole LSTM allows peeping into the memory.
Information Flow in LSTM
Information Flow in LSTM
𝑥𝑡
Controls Forget Gate 𝑍𝑓 𝜎 𝑊𝑓
ℎ𝑡−1
𝑥𝑡
Controls Input Gate 𝑍𝑖 𝜎 𝑊𝑖
ℎ𝑡−1
𝑥𝑡
Updates Information Z tanh W
ℎ𝑡−1
𝑥𝑡
Controls Output Gate 𝑍0 𝜎 𝑊0
ℎ𝑡−1
Note: Above four matrix computations are done concurrently.
Information Flow in LSTM
𝐶 𝑡−1 𝐶𝑡
𝑍𝑓 . + . 𝑦𝑡
𝑡−1 𝑍𝑖 ct = zf ct-1 + ziz
ℎ
𝑥𝑡 . ht = zo tanh(ct)
yt = σ(W’ ht)
Z
𝑍0 ℎ𝑡
Note: signifies element-wise multiplication.
Information Flow in LSTM
The memory from one state is fed to another state along with the new input.
𝐶 𝑡+1
𝑍𝑓
. +
. 𝑦 𝑡+1
𝐶 𝑡−1 𝐶𝑡
ℎ𝑡
𝑥 𝑡+1
𝑍𝑖
.
𝑍𝑓
. +
. 𝑦𝑡
Z
𝑍0 ℎ𝑡+1
ℎ𝑡−1
𝑥𝑡
𝑍𝑖
.
Z
𝑍0 ℎ𝑡
Stock Price Prediction Using LSTM
Problem Statement: Forecasting stock prices has been a difficult task for many of the
researchers and analysts. There are a lot of complicated financial indicators, as a result of which
the fluctuation of the stock market is highly volatile. The prediction of the market value is of great
importance to help in maximizing the profit of stock option purchase while keeping the risk low.
Objective: Use LSTM approach to predict stock market indices on the dataset [Link].
Note: Prices dataset are fetched from Yahoo Finance, fundamentals are from Nasdaq Financials,
extended by some fields from EDGAR SEC databases.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter
the username and password in the respective fields, and click Login.
Multiclass Classification Using LSTM
Problem Statement: You are given a news aggregator dataset which contains news headlines,
URLs, and categories for 422.937 news stories collected by a web aggregator. These news articles
have to be categorized into business, science and technology, entertainment, and health.
Objective: Perform multiclass classification using LSTM.
Note: Use [Link] for the above task.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter
the username and password in the respective fields, and click Login.
Load Libraries
Import the necessary libraries.
Load Data
Load the .csv file and check the data count in each class.
Note: m-class has way less data than the others, thus the classes are unbalanced.
Balance the Data
Perform shuffling to balance the classes.
Encode the Data
Perform one-hot encoding on the labels data.
Tokenization
Perform tokenization and identify the number of unique tokens.
Create Train and Test Sets
Split the dataset into training and testing sets. Also, define epochs, batch size, and labels for the same.
Define the LSTM Model
Code the LSTM model and fit the same into the processed data.
Check Performance
Evaluate the results on training and testing sets and obtain accuracy of the model.
Plot Metrics
Plot the model’s training accuracy versus validation accuracy and training loss versus validation loss.
Plot Metrics
Perform Predictions
Perform label predictions against random data.
Sentiment Analysis Using LSTM
Problem Statement: Sentiment Analysis is one of the common problems that companies are
working on. The most important application of sentiment analysis comes while working on natural
language processing tasks. The motive of your company behind building a sentiment analyzer is
to determine employee concerns and to develop programs to help improve the likelihood of
employees remaining in their jobs.
Objective: Use LSTM to perform sentiment analysis in Keras.
Note: Use the inbuilt dataset imdb from [Link] for this task.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter
the username and password in the respective fields, and click Login.
Gated Recurrent Unit (GRU)
GRU Architecture
Performs label predictions against random data.
Update Gate
Reset Gate
GRU Architecture
Update Gate Determines how much of the past information (from the
previous time steps) needs to be passed along to the future.
Reset Gate
Current Memory
Current State
GRU Architecture
Update Gate Determines how much of the past information needs to be
forgotten.
Reset Gate
Current Memory
Current State
GRU Architecture
Update Gate The current memory is computed using the reset gate to store
relevant information from the past.
Reset Gate
Current Memory
Current State
Note: Here tanh is the nonlinear activation function.
GRU Architecture
Update Gate At the final stage, h_t vector is calculated such that it holds the
information for the current unit and passes it down to the
network.
Reset Gate
Current Memory
Current State
LSTM vs. GRU
LSTM
Tracks long-term dependencies while mitigating the vanishing or
exploding gradient problems. It does so via input, forget, and
output gates.
Controls the exposure of memory content
GRU
Tracks long-term dependencies using a reset gate and an
update gate
Exposes the entire cell state to other units in the network
The Attention Model
Attention Model
Encoder-Decoder Framework
Encoder: From word sequence to sentence representation
Decoder: From representation to word sequence distribution
Universal Representation: Intermediate representation of meaning
English sentence English sentence
English English
decoder decoder
For unilingual data
For bitext data
French English
encoder encoder
French sentence English sentence
Motivation
Limited representation
Constrained over longer distances
Improving Performance with LSTM
Example:
Reversing the
order
Instead of mapping the sentence a, b, c to the sentence α, β, γ, the LSTM is asked to map c, b, a to
α, β, γ, where α, β, γ is the translation of a, b, c. This way, a is in close proximity to α, b is fairly
close to β, and so on.
Examples of Attention
Example 1
One-to-many Many-to-one Many-to-many Many-to-many
The way an LSTM chooses what to forget and what to insert into memory,
determines what inputs the network will attend to in the generation phase
Example 2
Example 3
Translated
MNIST inputs
Cluttered
Translated
MNIST inputs
Example 4
Similar to the way an LSTM chooses what to forget and insert
into memory, allow a network to choose a path to focus on
in the visual field
The Attention Mechanism
Improving Performance with LSTM
Consider an input (or intermediate) sequence or image
Consider an upper level representation, which can choose where to look, by
assigning a weight or probability to each input position, applied at each position
Higher-level
Softmax over lower
locations conditioned
on context at lower
and higher locations
Lower-level
NMT with Recurrent Nets and Attention Mechanism
Image-to-Text: Caption Generation with Attention
Neural Attention Models
Neural Attention Model for Sentence Summarization
The heatmap
represents a soft
alignment between
the input and the
generated
summary.
Output of the attention-based
summarization system.
Neural Attention Model for Sentence Summarization
Attention-
based
encoder enc3
NNLM (Neural Net
Language Models)
decoder with
additional encoder
element
Neural Attention Model for Sentence Summarization
Decoder: NNLM
Neural Attention Model for Sentence Summarization
Encoder: NNLM
Quora Insincere Questions Classification
Problem Statement: An existential problem for any major website today is how to handle toxic
and divisive content. Quora wants to tackle this problem head-on to keep their platform a place
where users can feel safe sharing their knowledge with the world. As an approach to the solution,
you must create models that identify and flag insincere questions (a question intended to make a
statement rather than look for helpful answers.)
Objective: Predict whether a question asked on Quora is sincere or not.
Note: Use the word embeddings provided along with the datasets to accomplish your goal. Also,
use tf1.14 for accessing [Link].
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter
the username and password in the respective fields, and click Login.
Key Takeaways
RNNs have a mechanism that can handle a sequential dataset
The memory from one state is fed to another state along with the
new input in LSTMs
Attention models use encoder-decoder framework
Knowledge Check
Knowledge
Check Why is an RNN (Recurrent Neural Network) used for machine translation, say
translating English to French?
1
a. It can be trained as an unsupervised learning problem
b. It is strictly more powerful than a Convolutional Neural Network (CNN)
c. It is applicable when the input/output is a sequence (e.g., a sequence of words)
d. RNNs represent the recurrent process of Idea->Code->Experiment->Idea->....
Knowledge
Check Why is an RNN (Recurrent Neural Network) used for machine translation, say
translating English to French?
1
a. It can be trained as an unsupervised learning problem
b. It is strictly more powerful than a Convolutional Neural Network (CNN)
c. It is applicable when the input/output is a sequence (e.g., a sequence of words)
d. RNNs represent the recurrent process of Idea->Code->Experiment->Idea->....
The correct answer is c
RNNs are effective on sequential data.
Knowledge
Check What is the probable approach when dealing with “Vanishing Gradient” problem in
RNNs?
2
a. Use modified architectures like LSTM and GRUs
b. Gradient Clipping
c. Dropout
d. All the above
Knowledge
Check What is the probable approach when dealing with “Vanishing Gradient” problem in
RNNs?
2
a. Use modified architectures like LSTM and GRUs
b. Gradient Clipping
c. Dropout
d. All the above
The correct answer is a
LSTMs and GRUs avoid vanishing gradient problem by incorporating gates within RNNs such that only relevant
information is passed forward.
Stock Price Forecasting
Problem Statement: It's hard not to think of the stock market as a person. It has moods that
can turn from irritable to euphoric. Stock price prediction is of great use for the investors. They
constantly review past pricing history and use it to influence their future investment decisions.
Considering LSTMs as very powerful networks in sequence prediction, build a deep learning
model to predict the future behavior of stock prices.
Objective: Use LSTM for forecasting stock data.
Note: Use the [Link] to train your model and perform the testing on
[Link] file.
Access: Click on the Labs tab on the left side panel of the LMS. Copy or note the username and
password that are generated. Click on the Launch Lab button. On the page that appears, enter
the username and password in the respective fields, and click Login.
Thank You