0% found this document useful (0 votes)
3 views50 pages

Unit 4

Unit 4 discusses Recurrent Neural Networks (RNNs), including their limitations such as vanishing gradients and short-term memory issues, and introduces advanced architectures like LSTMs and GRUs designed to overcome these challenges. It covers the architecture, motivation, and implementation of RNNs, along with practical applications in various fields such as NLP, time series prediction, and audio processing. The document highlights the advantages and disadvantages of Vanilla RNNs, LSTMs, and GRUs, emphasizing their respective capabilities in handling sequential data.

Uploaded by

Lavanya Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views50 pages

Unit 4

Unit 4 discusses Recurrent Neural Networks (RNNs), including their limitations such as vanishing gradients and short-term memory issues, and introduces advanced architectures like LSTMs and GRUs designed to overcome these challenges. It covers the architecture, motivation, and implementation of RNNs, along with practical applications in various fields such as NLP, time series prediction, and audio processing. The document highlights the advantages and disadvantages of Vanilla RNNs, LSTMs, and GRUs, emphasizing their respective capabilities in handling sequential data.

Uploaded by

Lavanya Priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Unit 4: RECURRENT NEURAL NETWORKS

Key Limitation: Handling Branching – Automatic


Differentiation – Motivation – Recurrent Neural Networks –
Implementation of RNNs – RNN Nodes: Vanilla, GRU, LSTM.
Key Limitation: Handling Branching
• RNNs, a class of neural network architectures meant for handling
sequences of data.
• First, it will involve “adding a new dimension” to the ndarrays we
feed our neural networks.
• In RNN each observation will be represented as a two-dimensional
ndarray: one dimension will represent the length of the sequence of
data, and a second dimension will represent the number of features
present at each sequence element.
• The overall input to an RNN will thus be a three dimensional ndarray
of shape [batch_size, sequence_length, num_features]—a batch of
sequences.
Manual Backpropagation
Automatic Differentiation

• Automatic differentiation allows us to compute these gradients via a


completely different route:
• rather than the Operations being the atomic units that make up the network,
we define a class that wraps around the data itself and allows the data to
keep track of the operations performed on it, so that the data can continually
accumulate gradients as it is involved in different operations.
Coding Up Gradient Accumulation
RNN Motivation

1. Use features from time step t = 1 to make predictions for the


corresponding target at t = 1.
2. Use features from time step t = 2 as well as the information from t = 1,
including the value of the target at t = 1, to make predictions for t = 2.
3. Use features from time step t = 3 as well as the accumulated
information from t = 1 and t = 2 to make predictions at t = 3.
Recurrent Neural Networks
• RecurAnt Theory
1. Natural Language Data where the order of words matter
2. Speech data
3. Time series data
4. Video/Music Sequences data
5. Stock markets data
Architecture
Types of Recurrent Neural Networks
• One to One • One to Many RNN
• Image captioning
• Many to One RNN • Many to Many RNN
• Sentiment analysis • Machine translation
Limitations of RNNs
• Vanishing and Exploding Gradients:
• RNNs suffer from the problem of vanishing and exploding gradients, particularly
when modeling long sequences.
• In BPTT, gradients can shrink or grow exponentially, leading to ineffective training
for long-term dependencies.
• Short-Term Memory:
• Standard RNNs struggle with capturing long-term dependencies in sequences due to
the vanishing gradient problem.
• This is why RNNs are often replaced by more advanced architectures like LSTMs
(Long Short-Term Memory) and GRUs (Gated Recurrent Units), which are designed
to mitigate this issue.
• Slow Training:
• The sequential nature of RNNs makes them slower to train compared to
non-recurrent models, as information must be passed through each time step
before the next step can be processed.
Implementation of RNNs
• Lab
RNN Nodes: Vanilla
• A Vanilla Recurrent Neural Network (RNN) refers to the simplest
form of a Recurrent Neural Network, one of the early models used for
sequence learning tasks.
• Vanilla RNNs are designed to process sequential data by maintaining
a hidden state that is updated at every time step, allowing them to
handle data where previous elements in the sequence influence later
ones.
Key Components of a Vanilla RNN
• Input Layer:
• Accepts sequential data as input. In sequence processing tasks, the input is
typically represented as a series of vectors. For example, in Natural Language
Processing (NLP), words in a sentence might be converted into word
embeddings (vector representations) to serve as the input at each time step.
• Hidden Layer:
• The hidden layer is the core of the RNN, where information from the previous
time steps is stored and updated. The hidden state at each time step stores a
summary of the previous inputs in the sequence.
• The hidden state at time step t is computed as:
• ht=tanh(Whhht−1+Wxhxt+bh)
• The recurrent nature of RNNs means that the hidden state at each time step
is influenced by the hidden states of the previous time steps, enabling the
network to "remember" earlier inputs in the sequence.
• Output Layer:
• The output at each time step is computed based on the current hidden state.
This output can be used for tasks such as classification (e.g., predicting the
next word in a sentence) or regression (e.g., forecasting future values in time
series data).
• The output at time step t is computed as: yt=softmax(Whoht+bo)
• Softmax is typically used for multi-class classification tasks to obtain a
probability distribution over the classes
Vanilla RNN Sentiment analysis model:
• Label Encoding: The target variable is first label-encoded using
Scikit-learn’s LabelEncoder so that the categorical labels can be
represented numerically.
• Splitting the Data: The data is split into training, testing, and
validation sets using Scikit-learn’s train_test_split function.
• Tokenization: The text data is then tokenized using Keras’ Tokenizer
class which converts the text into a sequence of integers.
• Padding: The sequences are then padded to ensure that all of them
have the same length. This is done using Keras’ pad_sequences
function.
• Defining the RNN Model: A sequential model is defined using Keras’
Sequential class. An embedding layer is added to the model, followed
by a SimpleRNN layer, a dropout layer to prevent overfitting, and a
dense layer with softmax activation.
• Compiling the Model: The model is then compiled using an optimizer,
a loss function, and a metric to measure performance.
• Early Stopping: Early stopping is set up using Keras’ EarlyStopping
callback to prevent overfitting and improve training efficiency.
• Training the Model: The model is trained on the training set using
Keras’ fit function.
• Evaluating the Model: Finally, the the model is evaluated on the test
set using Keras’ evaluate function to measure its performance.
Limitations of Vanilla RNNs
• Short-Term Memory:
• Vanilla RNNs can only capture dependencies over a short time window. As
the time step grows, the model tends to "forget" the earlier parts of the
sequence due to the vanishing gradient problem.
• Difficulty in Learning Long-Term Dependencies:
• Vanilla RNNs struggle to model sequences where long-term dependencies are
important. For example, in language processing, if the model needs to
remember a subject introduced early in the sentence to predict a verb much
later, Vanilla RNNs may fail to capture this relationship effectively.
• Training Instability:
• Due to the vanishing and exploding gradient problems, training Vanilla RNNs
can be unstable, making it difficult to converge on a good solution.
Applications of Vanilla RNNs
• Despite their limitations, Vanilla RNNs are useful in tasks where
sequence lengths are relatively short or where long-term
dependencies are not crucial. Some applications include:
• Time Series Forecasting: Predicting future values in a time series
based on recent observations.
• Sequence Labeling: Assigning labels to each element in a sequence
(e.g., part-of-speech tagging in NLP).
• Simple Language Models: Predicting the next word or character in a
sequence for text generation.
LSTM
• LSTM (Long Short-Term Memory) is a type of Recurrent Neural
Network (RNN) architecture specifically designed to address the
shortcomings of traditional RNNs, particularly the vanishing gradient
problem that makes it difficult for RNNs to learn long-term
dependencies.
• LSTM introduces specialized gating mechanisms that control the flow
of information in and out of the network's memory (the cell state).
This allows LSTMs to maintain long-term dependencies by selectively
remembering or forgetting information across time steps, making
them highly effective in tasks involving sequential data.
Key Concepts in LSTM
• Cell State:
• The cell state is a key feature that differentiates LSTMs from RNNs. It acts as a
memory that runs through the entire sequence, allowing the LSTM to store
information over long time periods.
• The cell state can be thought of as a conveyor belt that runs through the LSTM, with
minor linear interactions, allowing information to flow unchanged unless the LSTM
chooses to modify it through its gating mechanisms.
• Gates:
• LSTMs use three main types of gates (forget, input, and output gates) to control the
flow of information:
• Forget Gate: Determines which information to discard from the cell state.
• Input Gate: Decides which new information to store in the cell state.
• Output Gate: Controls the output at each time step and what parts of the cell state will be
exposed.
• Each gate is controlled by a sigmoid function, which outputs a value
between 0 and 1, where 0 means "completely forget" or "ignore" and 1
means "completely retain" or "keep."
LSTM Architecture
• The structure of an LSTM unit consists of several components
working together to update the cell state and the hidden state at
each time step.
• Forget Gate:
• The forget gate controls which parts of the previous cell state (Ct−1)
should be retained or forgotten. This gate takes the previous hidden
state ht−1 and the current input xt as inputs, and passes them through
a sigmoid activation function:
• Input Gate:
• The input gate controls how much new information will be added to
the cell state. It consists of two parts:
• A sigmoid layer that decides which values to update.
• A tanh layer that creates new candidate values, which can be added to the
cell state.
• Updating the Cell State:
• The cell state is updated by combining the outputs of the forget gate and the
input gate. The previous cell state Ct−1 is multiplied by the forget gate output ft
(to discard irrelevant information), and the new candidate values C t are scaled by
the input gate it and added to the cell state:

• This update allows the LSTM to both forget old information and add new,
relevant information at each time step.
• Output Gate:
• The output gate determines what part of the current cell state should
be output at time step t, which contributes to both the next hidden
state ht and the final output yt:

• The final hidden state ht is obtained by applying a tanh activation to


the updated cell state and then multiplying it by the output gate ot:
Advantages of LSTMs
• Ability to Learn Long-Term Dependencies:
• LSTMs are specifically designed to remember information for long periods and
selectively forget irrelevant information. This makes them particularly effective for
tasks involving long-term dependencies, such as language modeling, machine
translation, and time-series forecasting.
• Solves Vanishing Gradient Problem:
• LSTMs mitigate the vanishing gradient problem because the cell state allows
gradients to flow unchanged through many time steps. By using the forget and
input gates, LSTMs can regulate the amount of information stored or forgotten at
each time step.
• Selective Memory:
• The gating mechanisms in LSTMs allow the model to decide when to remember or
forget information, making them more powerful than Vanilla RNNs.
• Flexibility:
• LSTMs can handle varying input lengths, making them ideal for tasks involving
sequences of variable length, like text, video, or speech.
Challenges of LSTMs
• Computational Complexity:
• LSTMs are computationally more expensive than RNNs because they involve
multiple gate mechanisms, resulting in a greater number of parameters to
train. This can slow down the training process, especially for large datasets or
long sequences.
• Training Time:
• LSTMs can take longer to train compared to simpler architectures like Vanilla
RNNs or even GRUs (a simpler variant of LSTMs). The complexity of the
LSTM’s structure can lead to longer convergence times.
• Memory Requirements:
• Due to the need to maintain a cell state and multiple gates, LSTMs require
more memory during training and inference.
Applications of LSTMs
• LSTMs are widely used in applications • Audio Processing:
that involve sequence data: • Speech-to-text systems
• Natural Language Processing (NLP): • Music generation
• Language modeling • Sound classification
• Text generation • Video Processing:
• Machine translation • Action recognition in videos
• Named entity recognition (NER) • Video captioning
• Speech recognition • Video summarization
• Chatbots and dialogue systems • Healthcare:
• Time Series Prediction: • Predicting patient outcomes based on
• Forecasting stock prices time-series health data
• Weather prediction
• Anomaly detection in sequential data
(e.g., server logs, sensor data)
GRU
• The Gated Recurrent Unit (GRU) is a type of Recurrent Neural
Network (RNN) architecture introduced as a simpler alternative to the
Long Short-Term Memory (LSTM) network.
• Both GRUs and LSTMs were designed to address the vanishing
gradient problem and enable learning long-term dependencies, but
GRUs are more computationally efficient and have fewer parameters
than LSTMs.
• The GRU architecture simplifies the LSTM by combining certain
elements, reducing the number of gates and directly integrating
memory with the hidden state.
Key Components of GRU
• Update Gate:
• Controls how much of the previous hidden state should be retained and how
much of the current input should be added. It plays a similar role to the
forget and input gates in LSTMs but combines their functions into one.
• Reset Gate:
• Determines how much of the previous hidden state to ignore or "reset" when
updating the hidden state based on new input. This is crucial for tasks where
some parts of the sequence might be irrelevant for certain predictions.
• Unlike LSTMs, GRUs do not have a separate cell state. Instead, they
rely on the hidden state alone, which is modified directly based on
the input and previous hidden state
GRU Architecture
Update gate
• We start with calculating the update gate zt for time step t using the
formula:
• The update gate helps the model to determine how much of the past
information (from previous time steps) needs to be passed along to
the future.
Reset gate
• This gate is used from the model to decide how much of the past
information to forget. To calculate it, we use:
Current memory content
• reset gate to store the relevant information from the past
Final memory at current time step
Key Differences Between GRU and LSTM
• Gates:
• GRUs have two gates (update and reset), while LSTMs have three gates (input,
forget, and output). GRUs simplify the architecture by combining the functions of
input and forget gates into a single update gate.
• Memory Cell:
• GRUs do not have a separate memory cell like LSTMs. Instead, the hidden state h t
directly carries forward the information, simplifying the computation.
• Computational Efficiency:
• GRUs are generally faster and require less memory than LSTMs because they have
fewer parameters. This makes GRUs suitable for applications with limited resources
or real-time processing requirements.
• Performance:
• Empirically, GRUs can perform as well as LSTMs on many tasks, though their
performance can vary depending on the dataset and the nature of the
dependencies. Generally, GRUs work well with simpler sequences, while LSTMs are
preferred when handling long-term dependencies in more complex sequences.
Advantages of GRUs
• Simplicity and Efficiency:
• GRUs have fewer parameters and a simpler architecture, which leads to
faster training and inference times, especially on large datasets or in
resource-constrained environments.
• Better Generalization in Some Tasks:
• In tasks where long-term dependencies aren’t as crucial, GRUs may perform
as well or better than LSTMs. They often generalize better on simpler, shorter
sequences due to their more streamlined structure.
• Alleviates Vanishing Gradient Problem:
• Similar to LSTMs, GRUs also help mitigate the vanishing gradient problem,
allowing for learning longer dependencies in sequence data.
Disadvantages of GRUs
• Less Control over Memory:
• LSTMs offer finer control over memory with their additional cell state and
output gate, which can be advantageous for more complex sequences
requiring long-term memory. GRUs may not perform as well in cases where
fine-grained control over memory is critical.
• Potentially Inferior for Long Sequences:
• In some cases, GRUs may struggle with very long-term dependencies because
they lack the explicit cell state. LSTMs, with their additional cell state, may
have a slight edge in such tasks.
Applications of GRUs
• GRUs are commonly used in • Time-Series Prediction:
applications that involve sequential • Used in stock price prediction,
or time-series data: weather forecasting, and anomaly
detection in various time-series data.
• Natural Language Processing
(NLP): • Healthcare:
• Text classification • GRUs can be applied to predict
• Machine translation patient outcomes using time-series
health data, such as heart rate and
• Question answering blood pressure measurements.
• Sentiment analysis
• Language modeling
• Music and Audio Generation:
• GRUs have been used to generate
• Speech Recognition: music sequences, audio synthesis,
• GRUs are used in end-to-end and sound classification.
speech-to-text systems and other
audio-related tasks where
computational efficiency is crucial.

You might also like