Unit 4

Unit 4 discusses Recurrent Neural Networks (RNNs), including their limitations such as vanishing gradients and short-term memory issues, and introduces advanced architectures like LSTMs and GRUs designed to overcome these challenges. It covers the architecture, motivation, and implementation of RNNs, along with practical applications in various fields such as NLP, time series prediction, and audio processing. The document highlights the advantages and disadvantages of Vanilla RNNs, LSTMs, and GRUs, emphasizing their respective capabilities in handling sequential data.

Uploaded by

Lavanya Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views50 pages

Unit 4

Uploaded by

Lavanya Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Unit 4: RECURRENT NEURAL NETWORKS

Key Limitation: Handling Branching – Automatic

Differentiation – Motivation – Recurrent Neural Networks –
Implementation of RNNs – RNN Nodes: Vanilla, GRU, LSTM.
Key Limitation: Handling Branching
• RNNs, a class of neural network architectures meant for handling
sequences of data.
• First, it will involve “adding a new dimension” to the ndarrays we
feed our neural networks.
• In RNN each observation will be represented as a two-dimensional
ndarray: one dimension will represent the length of the sequence of
data, and a second dimension will represent the number of features
present at each sequence element.
• The overall input to an RNN will thus be a three dimensional ndarray
of shape [batch_size, sequence_length, num_features]—a batch of
sequences.
Manual Backpropagation
Automatic Differentiation

• Automatic differentiation allows us to compute these gradients via a

completely different route:
• rather than the Operations being the atomic units that make up the network,
we define a class that wraps around the data itself and allows the data to
keep track of the operations performed on it, so that the data can continually
accumulate gradients as it is involved in different operations.
Coding Up Gradient Accumulation
RNN Motivation

1. Use features from time step t = 1 to make predictions for the

corresponding target at t = 1.
2. Use features from time step t = 2 as well as the information from t = 1,
including the value of the target at t = 1, to make predictions for t = 2.
3. Use features from time step t = 3 as well as the accumulated
information from t = 1 and t = 2 to make predictions at t = 3.
Recurrent Neural Networks
• RecurAnt Theory
1. Natural Language Data where the order of words matter
2. Speech data
3. Time series data
4. Video/Music Sequences data
5. Stock markets data
Architecture
Types of Recurrent Neural Networks
• One to One • One to Many RNN
• Image captioning
• Many to One RNN • Many to Many RNN
• Sentiment analysis • Machine translation
Limitations of RNNs
• Vanishing and Exploding Gradients:
• RNNs suffer from the problem of vanishing and exploding gradients, particularly
when modeling long sequences.
• In BPTT, gradients can shrink or grow exponentially, leading to ineffective training
for long-term dependencies.
• Short-Term Memory:
• Standard RNNs struggle with capturing long-term dependencies in sequences due to
the vanishing gradient problem.
• This is why RNNs are often replaced by more advanced architectures like LSTMs
(Long Short-Term Memory) and GRUs (Gated Recurrent Units), which are designed
to mitigate this issue.
• Slow Training:
• The sequential nature of RNNs makes them slower to train compared to
non-recurrent models, as information must be passed through each time step
before the next step can be processed.
Implementation of RNNs
• Lab
RNN Nodes: Vanilla
• A Vanilla Recurrent Neural Network (RNN) refers to the simplest
form of a Recurrent Neural Network, one of the early models used for
sequence learning tasks.
• Vanilla RNNs are designed to process sequential data by maintaining
a hidden state that is updated at every time step, allowing them to
handle data where previous elements in the sequence influence later
ones.
Key Components of a Vanilla RNN
• Input Layer:
• Accepts sequential data as input. In sequence processing tasks, the input is
typically represented as a series of vectors. For example, in Natural Language
Processing (NLP), words in a sentence might be converted into word
embeddings (vector representations) to serve as the input at each time step.
• Hidden Layer:
• The hidden layer is the core of the RNN, where information from the previous
time steps is stored and updated. The hidden state at each time step stores a
summary of the previous inputs in the sequence.
• The hidden state at time step t is computed as:
• ht=tanh(Whhht−1+Wxhxt+bh)
• The recurrent nature of RNNs means that the hidden state at each time step
is influenced by the hidden states of the previous time steps, enabling the
network to "remember" earlier inputs in the sequence.
• Output Layer:
• The output at each time step is computed based on the current hidden state.
This output can be used for tasks such as classification (e.g., predicting the
next word in a sentence) or regression (e.g., forecasting future values in time
series data).
• The output at time step t is computed as: yt=softmax(Whoht+bo)
• Softmax is typically used for multi-class classification tasks to obtain a
probability distribution over the classes
Vanilla RNN Sentiment analysis model:
• Label Encoding: The target variable is first label-encoded using
Scikit-learn’s LabelEncoder so that the categorical labels can be
represented numerically.
• Splitting the Data: The data is split into training, testing, and
validation sets using Scikit-learn’s train_test_split function.
• Tokenization: The text data is then tokenized using Keras’ Tokenizer
class which converts the text into a sequence of integers.
• Padding: The sequences are then padded to ensure that all of them
have the same length. This is done using Keras’ pad_sequences
function.
• Defining the RNN Model: A sequential model is defined using Keras’
Sequential class. An embedding layer is added to the model, followed
by a SimpleRNN layer, a dropout layer to prevent overfitting, and a
dense layer with softmax activation.
• Compiling the Model: The model is then compiled using an optimizer,
a loss function, and a metric to measure performance.
• Early Stopping: Early stopping is set up using Keras’ EarlyStopping
callback to prevent overfitting and improve training efficiency.
• Training the Model: The model is trained on the training set using
Keras’ fit function.
• Evaluating the Model: Finally, the the model is evaluated on the test
set using Keras’ evaluate function to measure its performance.
Limitations of Vanilla RNNs
• Short-Term Memory:
• Vanilla RNNs can only capture dependencies over a short time window. As
the time step grows, the model tends to "forget" the earlier parts of the
sequence due to the vanishing gradient problem.
• Difficulty in Learning Long-Term Dependencies:
• Vanilla RNNs struggle to model sequences where long-term dependencies are
important. For example, in language processing, if the model needs to
remember a subject introduced early in the sentence to predict a verb much
later, Vanilla RNNs may fail to capture this relationship effectively.
• Training Instability:
• Due to the vanishing and exploding gradient problems, training Vanilla RNNs
can be unstable, making it difficult to converge on a good solution.
Applications of Vanilla RNNs
• Despite their limitations, Vanilla RNNs are useful in tasks where
sequence lengths are relatively short or where long-term
dependencies are not crucial. Some applications include:
• Time Series Forecasting: Predicting future values in a time series
based on recent observations.
• Sequence Labeling: Assigning labels to each element in a sequence
(e.g., part-of-speech tagging in NLP).
• Simple Language Models: Predicting the next word or character in a
sequence for text generation.
LSTM
• LSTM (Long Short-Term Memory) is a type of Recurrent Neural
Network (RNN) architecture specifically designed to address the
shortcomings of traditional RNNs, particularly the vanishing gradient
problem that makes it difficult for RNNs to learn long-term
dependencies.
• LSTM introduces specialized gating mechanisms that control the flow
of information in and out of the network's memory (the cell state).
This allows LSTMs to maintain long-term dependencies by selectively
remembering or forgetting information across time steps, making
them highly effective in tasks involving sequential data.
Key Concepts in LSTM
• Cell State:
• The cell state is a key feature that differentiates LSTMs from RNNs. It acts as a
memory that runs through the entire sequence, allowing the LSTM to store
information over long time periods.
• The cell state can be thought of as a conveyor belt that runs through the LSTM, with
minor linear interactions, allowing information to flow unchanged unless the LSTM
chooses to modify it through its gating mechanisms.
• Gates:
• LSTMs use three main types of gates (forget, input, and output gates) to control the
flow of information:
• Forget Gate: Determines which information to discard from the cell state.
• Input Gate: Decides which new information to store in the cell state.
• Output Gate: Controls the output at each time step and what parts of the cell state will be
exposed.
• Each gate is controlled by a sigmoid function, which outputs a value
between 0 and 1, where 0 means "completely forget" or "ignore" and 1
means "completely retain" or "keep."
LSTM Architecture
• The structure of an LSTM unit consists of several components
working together to update the cell state and the hidden state at
each time step.
• Forget Gate:
• The forget gate controls which parts of the previous cell state (Ct−1)
should be retained or forgotten. This gate takes the previous hidden
state ht−1 and the current input xt as inputs, and passes them through
a sigmoid activation function:
• Input Gate:
• The input gate controls how much new information will be added to
the cell state. It consists of two parts:
• A sigmoid layer that decides which values to update.
• A tanh layer that creates new candidate values, which can be added to the
cell state.
• Updating the Cell State:
• The cell state is updated by combining the outputs of the forget gate and the
input gate. The previous cell state Ct−1 is multiplied by the forget gate output ft
(to discard irrelevant information), and the new candidate values C t are scaled by
the input gate it and added to the cell state:

• This update allows the LSTM to both forget old information and add new,
relevant information at each time step.
• Output Gate:
• The output gate determines what part of the current cell state should
be output at time step t, which contributes to both the next hidden
state ht and the final output yt:

• The final hidden state ht is obtained by applying a tanh activation to

the updated cell state and then multiplying it by the output gate ot:
Advantages of LSTMs
• Ability to Learn Long-Term Dependencies:
• LSTMs are specifically designed to remember information for long periods and
selectively forget irrelevant information. This makes them particularly effective for
tasks involving long-term dependencies, such as language modeling, machine
translation, and time-series forecasting.
• Solves Vanishing Gradient Problem:
• LSTMs mitigate the vanishing gradient problem because the cell state allows
gradients to flow unchanged through many time steps. By using the forget and
input gates, LSTMs can regulate the amount of information stored or forgotten at
each time step.
• Selective Memory:
• The gating mechanisms in LSTMs allow the model to decide when to remember or
forget information, making them more powerful than Vanilla RNNs.
• Flexibility:
• LSTMs can handle varying input lengths, making them ideal for tasks involving
sequences of variable length, like text, video, or speech.
Challenges of LSTMs
• Computational Complexity:
• LSTMs are computationally more expensive than RNNs because they involve
multiple gate mechanisms, resulting in a greater number of parameters to
train. This can slow down the training process, especially for large datasets or
long sequences.
• Training Time:
• LSTMs can take longer to train compared to simpler architectures like Vanilla
RNNs or even GRUs (a simpler variant of LSTMs). The complexity of the
LSTM’s structure can lead to longer convergence times.
• Memory Requirements:
• Due to the need to maintain a cell state and multiple gates, LSTMs require
more memory during training and inference.
Applications of LSTMs
• LSTMs are widely used in applications • Audio Processing:
that involve sequence data: • Speech-to-text systems
• Natural Language Processing (NLP): • Music generation
• Language modeling • Sound classification
• Text generation • Video Processing:
• Machine translation • Action recognition in videos
• Named entity recognition (NER) • Video captioning
• Speech recognition • Video summarization
• Chatbots and dialogue systems • Healthcare:
• Time Series Prediction: • Predicting patient outcomes based on
• Forecasting stock prices time-series health data
• Weather prediction
• Anomaly detection in sequential data
(e.g., server logs, sensor data)
GRU
• The Gated Recurrent Unit (GRU) is a type of Recurrent Neural
Network (RNN) architecture introduced as a simpler alternative to the
Long Short-Term Memory (LSTM) network.
• Both GRUs and LSTMs were designed to address the vanishing
gradient problem and enable learning long-term dependencies, but
GRUs are more computationally efficient and have fewer parameters
than LSTMs.
• The GRU architecture simplifies the LSTM by combining certain
elements, reducing the number of gates and directly integrating
memory with the hidden state.
Key Components of GRU
• Update Gate:
• Controls how much of the previous hidden state should be retained and how
much of the current input should be added. It plays a similar role to the
forget and input gates in LSTMs but combines their functions into one.
• Reset Gate:
• Determines how much of the previous hidden state to ignore or "reset" when
updating the hidden state based on new input. This is crucial for tasks where
some parts of the sequence might be irrelevant for certain predictions.
• Unlike LSTMs, GRUs do not have a separate cell state. Instead, they
rely on the hidden state alone, which is modified directly based on
the input and previous hidden state
GRU Architecture
Update gate
• We start with calculating the update gate zt for time step t using the
formula:
• The update gate helps the model to determine how much of the past
information (from previous time steps) needs to be passed along to
the future.
Reset gate
• This gate is used from the model to decide how much of the past
information to forget. To calculate it, we use:
Current memory content
• reset gate to store the relevant information from the past
Final memory at current time step
Key Differences Between GRU and LSTM
• Gates:
• GRUs have two gates (update and reset), while LSTMs have three gates (input,
forget, and output). GRUs simplify the architecture by combining the functions of
input and forget gates into a single update gate.
• Memory Cell:
• GRUs do not have a separate memory cell like LSTMs. Instead, the hidden state h t
directly carries forward the information, simplifying the computation.
• Computational Efficiency:
• GRUs are generally faster and require less memory than LSTMs because they have
fewer parameters. This makes GRUs suitable for applications with limited resources
or real-time processing requirements.
• Performance:
• Empirically, GRUs can perform as well as LSTMs on many tasks, though their
performance can vary depending on the dataset and the nature of the
dependencies. Generally, GRUs work well with simpler sequences, while LSTMs are
preferred when handling long-term dependencies in more complex sequences.
Advantages of GRUs
• Simplicity and Efficiency:
• GRUs have fewer parameters and a simpler architecture, which leads to
faster training and inference times, especially on large datasets or in
resource-constrained environments.
• Better Generalization in Some Tasks:
• In tasks where long-term dependencies aren’t as crucial, GRUs may perform
as well or better than LSTMs. They often generalize better on simpler, shorter
sequences due to their more streamlined structure.
• Alleviates Vanishing Gradient Problem:
• Similar to LSTMs, GRUs also help mitigate the vanishing gradient problem,
allowing for learning longer dependencies in sequence data.
Disadvantages of GRUs
• Less Control over Memory:
• LSTMs offer finer control over memory with their additional cell state and
output gate, which can be advantageous for more complex sequences
requiring long-term memory. GRUs may not perform as well in cases where
fine-grained control over memory is critical.
• Potentially Inferior for Long Sequences:
• In some cases, GRUs may struggle with very long-term dependencies because
they lack the explicit cell state. LSTMs, with their additional cell state, may
have a slight edge in such tasks.
Applications of GRUs
• GRUs are commonly used in • Time-Series Prediction:
applications that involve sequential • Used in stock price prediction,
or time-series data: weather forecasting, and anomaly
detection in various time-series data.
• Natural Language Processing
(NLP): • Healthcare:
• Text classification • GRUs can be applied to predict
• Machine translation patient outcomes using time-series
health data, such as heart rate and
• Question answering blood pressure measurements.
• Sentiment analysis
• Language modeling
• Music and Audio Generation:
• GRUs have been used to generate
• Speech Recognition: music sequences, audio synthesis,
• GRUs are used in end-to-end and sound classification.
speech-to-text systems and other
audio-related tasks where
computational efficiency is crucial.

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
IASSC Lean Six Sigma Yellow Belt Exam Questions - 83q
No ratings yet
IASSC Lean Six Sigma Yellow Belt Exam Questions - 83q
33 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Introduction To Quantitative Methods in Economics
No ratings yet
Introduction To Quantitative Methods in Economics
3 pages
CMZ900 Components and Accessories
No ratings yet
CMZ900 Components and Accessories
9 pages
chapter 2
No ratings yet
chapter 2
68 pages
Deep Arch Msc 2024
No ratings yet
Deep Arch Msc 2024
83 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
No ratings yet
Introduction To Recurrent Neural Networks (RNNS) : Dr. Hans Weber February 9, 2024
9 pages
Lecture Notes_RRN
No ratings yet
Lecture Notes_RRN
8 pages
RNN
No ratings yet
RNN
9 pages
Blue and White Simple Business Plan Presentation
No ratings yet
Blue and White Simple Business Plan Presentation
15 pages
CH4_AA1.1-Sequence Models (1)
No ratings yet
CH4_AA1.1-Sequence Models (1)
26 pages
Unit 4 - MachineLearning
No ratings yet
Unit 4 - MachineLearning
16 pages
42 Recurrent Neural Networks and LSTM
No ratings yet
42 Recurrent Neural Networks and LSTM
68 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
RNN-StannfordBased
No ratings yet
RNN-StannfordBased
102 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
34 pages
Unit 4
No ratings yet
Unit 4
27 pages
lecture 11
No ratings yet
lecture 11
57 pages
Rnn Tutorial
No ratings yet
Rnn Tutorial
41 pages
Unit III- Recurrent Neural Networks
No ratings yet
Unit III- Recurrent Neural Networks
44 pages
Sequence Modeling
No ratings yet
Sequence Modeling
131 pages
Module 06
No ratings yet
Module 06
5 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
36 pages
Unit_3_rcnn
No ratings yet
Unit_3_rcnn
25 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
34 pages
What is an RNN
No ratings yet
What is an RNN
6 pages
RNNs
No ratings yet
RNNs
22 pages
Day 4
No ratings yet
Day 4
22 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
06 - LLM
No ratings yet
06 - LLM
18 pages
DS303_RNN_LSTM
No ratings yet
DS303_RNN_LSTM
16 pages
UNIT-IV DL
No ratings yet
UNIT-IV DL
23 pages
Unit 3 RCNN Updated
No ratings yet
Unit 3 RCNN Updated
28 pages
Machine learning
No ratings yet
Machine learning
17 pages
AAM unit 6 notes
No ratings yet
AAM unit 6 notes
20 pages
ch6_RNN
No ratings yet
ch6_RNN
25 pages
Unit 4 - DL
No ratings yet
Unit 4 - DL
23 pages
RNN introduction
No ratings yet
RNN introduction
22 pages
Module 4-1
No ratings yet
Module 4-1
44 pages
DL_MOD4 (3)
No ratings yet
DL_MOD4 (3)
105 pages
UNIT-IV DL
No ratings yet
UNIT-IV DL
54 pages
RNNs and Their Types - Simple Explanation
No ratings yet
RNNs and Their Types - Simple Explanation
5 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
2 pages
dis6-sol
No ratings yet
dis6-sol
6 pages
RNN.docx
No ratings yet
RNN.docx
10 pages
Aiml C6 DL RNN CS
No ratings yet
Aiml C6 DL RNN CS
42 pages
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
No ratings yet
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
6 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
DL For Sequencial Data
No ratings yet
DL For Sequencial Data
36 pages
Rnn With Lstm
No ratings yet
Rnn With Lstm
36 pages
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
No ratings yet
28-Recurrent Neural Networks - Bidirectional RNNs-19!09!2024
12 pages
GenAI-Module2
No ratings yet
GenAI-Module2
190 pages
2410[1]
No ratings yet
2410[1]
27 pages
LSTM
No ratings yet
LSTM
22 pages
RNN_2
No ratings yet
RNN_2
144 pages
Time Series Rnn Lstm 1746197734
No ratings yet
Time Series Rnn Lstm 1746197734
25 pages
UNIT-5 Foundations of Deep Learning
No ratings yet
UNIT-5 Foundations of Deep Learning
9 pages
DL CO3- PPT 1
No ratings yet
DL CO3- PPT 1
22 pages
RNN
No ratings yet
RNN
23 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
9 RNN LSTM Gru
No ratings yet
9 RNN LSTM Gru
91 pages
Alpha Olefin C14-C16-C18P Blend-English (US)
No ratings yet
Alpha Olefin C14-C16-C18P Blend-English (US)
14 pages
InnOvaTIon (1)
No ratings yet
InnOvaTIon (1)
12 pages
16 Project Management in The Oil and Gas Industry
No ratings yet
16 Project Management in The Oil and Gas Industry
2 pages
21 Selam Negussie
100% (1)
21 Selam Negussie
83 pages
Azolla
100% (1)
Azolla
28 pages
Steel Assignment
No ratings yet
Steel Assignment
31 pages
Times Leader 07-03-2013
No ratings yet
Times Leader 07-03-2013
34 pages
Hard Evidence of Corporate Takeover at All Levels of Government in America
100% (2)
Hard Evidence of Corporate Takeover at All Levels of Government in America
4 pages
Salary Slip Excel Format Download
No ratings yet
Salary Slip Excel Format Download
4 pages
Interim Rates For Wholesale Residential and Business High Speed Access Services
No ratings yet
Interim Rates For Wholesale Residential and Business High Speed Access Services
4 pages
Bill of Exchange Sent For Collection
No ratings yet
Bill of Exchange Sent For Collection
7 pages
Term Sheet Series A - Y Combinator
100% (1)
Term Sheet Series A - Y Combinator
2 pages
7 - Steering System
No ratings yet
7 - Steering System
18 pages
Company Profile Leads
No ratings yet
Company Profile Leads
17 pages
Quick Guide WindPRO 2.7 UK
No ratings yet
Quick Guide WindPRO 2.7 UK
20 pages
Contemporary World Chapter 1. Defining Globalization
No ratings yet
Contemporary World Chapter 1. Defining Globalization
4 pages
MariamNersisyanResume SW
No ratings yet
MariamNersisyanResume SW
3 pages
Use of Remote Sensing and GNSS in Precision Agriculture: Prof Graciela Metternicht
No ratings yet
Use of Remote Sensing and GNSS in Precision Agriculture: Prof Graciela Metternicht
44 pages
NALSAR Respondent
No ratings yet
NALSAR Respondent
33 pages
Signaling Specifications: Technical Data
No ratings yet
Signaling Specifications: Technical Data
122 pages
BC517 NPN Darlington Transistor
No ratings yet
BC517 NPN Darlington Transistor
3 pages
Digital Design I: Laboratory Experiments
No ratings yet
Digital Design I: Laboratory Experiments
9 pages
Inspection Report
No ratings yet
Inspection Report
79 pages
Ordinance No. 29
No ratings yet
Ordinance No. 29
3 pages
360WebPlatform_v2_InstallationGuide_en-US
No ratings yet
360WebPlatform_v2_InstallationGuide_en-US
28 pages
12 Beaker Glass Pyrex 100ml
No ratings yet
12 Beaker Glass Pyrex 100ml
1 page
COT 2-Cookeryg7
100% (1)
COT 2-Cookeryg7
5 pages