Unit-4 (2)
Unit-4 (2)
Topics:
Introduction to RNNs, basic building blocks of RNNs and other architectural
details, GRU, LSTMs Encoder Decoder Models, Seq2Seq models
NLP application: Language Translation (Machine Translation) - Attention
mechanism.
Introduction to Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) are a class of neural networks
designed to handle sequential data, where the order of inputs matters.
Unlike traditional feedforward neural networks, which treat each input
independently, RNNs are designed to use information from previous time steps
to inform the current decision, making them highly suitable for tasks involving
sequences such as time series forecasting, natural language processing (NLP),
speech recognition, and more.
Key Characteristics of RNNs:
1. Sequential Processing: RNNs have a built-in mechanism to process
sequential data. At each time step in the sequence, the network receives
an input and passes the output to the next step. The key feature is the
"feedback loop," which allows the network to remember previous states
in the sequence, capturing temporal dependencies.
2. Hidden States: At each time step, an RNN maintains a hidden state,
which is a representation of information learned from previous inputs.
The hidden state is updated as the network processes each new input,
capturing context from earlier parts of the sequence.
3. Parameter Sharing: Unlike traditional neural networks, RNNs share
parameters (weights) across all time steps, making them efficient for
processing sequences of varying lengths. This means the same set of
weights is applied at each step in the sequence, allowing the network to
generalize across different inputs.
4. Memory of Past Events: RNNs retain a form of memory through the
hidden state, which theoretically allows them to remember information
from previous time steps. This memory is crucial for capturing long-
range dependencies in the data.
Basic Structure of an RNN:
An RNN can be represented as:
At each time step t, the input xt is processed, and the hidden state ht is
updated based on the previous hidden state ht-1 and the current input xt.
The output yt can be generated from the hidden state ht, depending on the
specific task.
The general equations for an RNN are:
Hidden state update:
ht = f(Whht-1+Wxxt+b)
where:
o Wh is the weight matrix for the previous hidden state.
o Wx is the weight matrix for the input.
o b is the bias term.
o f is an activation function (usually a non-linear function like tanh
or ReLU).
Output generation:
yt=g(Woht+c)
where:
o Wo is the weight matrix for the output.
o c is the output bias.
o g is an activation function (e.g., Softmax for classification tasks).
Challenges with Traditional RNNs:
While RNNs are effective at processing sequences, they suffer from two main
problems:
1. Vanishing Gradient Problem: During backpropagation, gradients can
become very small, which leads to slow or ineffective learning, especially
for long sequences. This makes it hard for traditional RNNs to capture
long-range dependencies in data.
2. Exploding Gradient Problem: On the flip side, gradients can sometimes
grow exponentially, causing instability during training.
Solutions to Challenges:
To address these challenges, specialized architectures have been developed:
1. Long Short-Term Memory (LSTM): LSTMs are a type of RNN
designed to overcome the vanishing gradient problem. They include
memory cells that can store information for long periods and gates that
control the flow of information, allowing LSTMs to better capture long-
term dependencies.
2. Gated Recurrent Units (GRUs): GRUs are similar to LSTMs but with a
simpler structure. They combine the cell state and hidden state into a
single vector and use fewer gates, making them computationally more
efficient while still addressing the vanishing gradient issue.
Applications of RNNs:
1. Natural Language Processing (NLP):
o Language modelling and text generation.
o Machine translation (e.g., English to French).
o Sentiment analysis and text classification.
2. Speech Recognition:
o Converting spoken language into text.
o Recognizing patterns in voice signals.
3. Time Series Forecasting:
o Predicting future values based on past data, e.g., stock prices or
weather patterns.
4. Video Processing:
o Recognizing patterns in video frames where temporal relationships
between frames matter.
5. Music Generation:
o Composing new music sequences based on learned patterns.
Building blocks of RNN:
The basic building blocks of a Recurrent Neural Network (RNN) involve
several components that enable it to process sequential data. The key elements
are:
1. Input Layer:
o The input layer receives the sequence of data at each time step. For
example, in natural language processing, it could be words or
characters. Each element in the sequence is fed into the network
one at a time.
2. Hidden State:
o The hidden state is a key feature of RNNs. It stores information
about the previous time steps, allowing the network to maintain
memory of the past inputs. This hidden state is updated after each
time step.
o The hidden state at time step t is denoted as ht, and it is computed
based on the input at the current time step xt and the previous
hidden state ht-1.
3. Recurrent Connection:
o This is the crucial feature of RNNs: the recurrent connection
allows information to persist over time. The hidden state ht is
passed as input to the network at the next time step, linking the
current state with the past state.
o Mathematically, this can be represented as:
ht = f(Wh⋅ht-1+Wx⋅xt+b)
where,
ht is the hidden state at time t,
xt is the input at time t,
Wh is the weight matrix for the hidden state,
Wx is the weight matrix for the input,
b is the bias term, and
f is the activation function (like tanh or ReLU).
4) Output Layer:
After processing the input and updating the hidden state, the RNN
generates an output at each time step or at the end of the sequence.
This output could be a prediction or a class label, depending on the
task.
The output yt at each time step t is calculated from the hidden state
yt = Wy⋅ ht+by
ht:
where:
Wy is the weight matrix for the output layer, and by is the
output bias.
5) Activation Function:
An activation function, such as tanh or ReLU, is applied to the
hidden state to introduce non-linearity. This allows the network to
model complex patterns.
These components allow RNNs to model sequential data by retaining
information about previous time steps and using that to influence the current
prediction.
RNN Architecture:
The architecture of an RNN is designed to process sequential data by
maintaining a form of "memory" through its hidden states. The network
processes inputs one at a time, and the hidden states are updated at each time
step to store information about previous inputs in the sequence. Let's go through
the architecture of an RNN in more detail:
Input Layer
Input Sequence: The input layer of an RNN takes a sequence of inputs.
At each time step t, the RNN processes a single input xt, which could
represent any element in the sequence (e.g., a word in a sentence or a
time step in a time series).
Shape: If the input sequence has a length of T, the network will process T
time steps.
Input Representation: Each input xt could be a vector that represents the
current element, such as an embedding vector for words in NLP tasks.
Recurrent Hidden Layer
The heart of an RNN is the recurrent hidden layer. At each time step, this
layer receives the current input xt as well as the hidden state from the previous
time step ht-1. The purpose of the hidden layer is to maintain memory over the
sequence and carry information from one step to the next.
Hidden State Update:
o The hidden state ht at time step t is computed using the current
input xt and the previous hidden state ht-1.
o The update rule is as follows:
ht = f(Wh⋅ht-1+Wx⋅xt+b)
where,
ht is the hidden state at time t,
xt is the input at time t,
Wh is the weight matrix for the hidden state,
Wx is the weight matrix for the input,
b is the bias term, and
f is the activation function (like tanh or ReLU).
Activation Function: The activation function f introduces non-linearity
into the model, enabling the RNN to learn more complex
relationships. Commonly used activation functions are:
Key Takeaways:
Sequential Processing: RNNs process data in a sequential manner, using
a recurrent connection to retain information across time steps.
Hidden States: The hidden state stores information about previous inputs
and is updated at each time step.
Output Generation: The RNN generates outputs either at each time step
or after processing the entire sequence.
Training with BPTT: RNNs are trained with Backpropagation Through
Time, adjusting weights based on the gradient of the loss function.
This architecture makes RNNs powerful for tasks involving sequential
data, but they can struggle with long-term dependencies, which is where
LSTMs and GRUs provide improvements.
GRU:
A GRU (Gated Recurrent Unit) is a type of Recurrent Neural Network
(RNN) architecture designed to address the issues faced by traditional RNNs,
particularly the vanishing gradient problem.
GRUs are often used in sequence modelling tasks such as time series
prediction, language modelling, and machine translation. Below is a detailed
breakdown of GRU in the context of RNNs.
Recurrent Neural Networks (RNNs) Overview:
RNNs are a class of neural networks that are designed to handle
sequential data by maintaining a hidden state that is updated at each timestep
based on both the current input and the previous hidden state.
While RNNs can capture temporal dependencies in sequences, they suffer
from issues when learning long-range dependencies due to the vanishing
gradient problem. This makes it difficult for traditional RNNs to remember
information over long sequences.
The Gated Recurrent Unit (GRU):
The GRU is a modification of the traditional RNN that introduces gates
to control the flow of information. These gates help the network decide which
information should be retained and which should be forgotten, thereby
mitigating the vanishing gradient problem and enabling better learning of long-
range dependencies.
GRU Architecture:
A GRU cell consists of two main gates:
1. Update Gate (z_t): Controls the amount of previous memory to retain in
the current state. It decides how much of the past information should be
passed to the next timestep.
2. Reset Gate (r_t): Controls how much of the previous memory should be
ignored when computing the current state.
Let's break down the steps involved in computing the GRU:
1. Input and Hidden State: At each timestep, the GRU takes an input
vector xt and the previous hidden state ht-1.
2. Update Gate Calculation: The update gate determines how much of the
previous memory should be carried over to the next timestep. It is
computed as follows:
zt = σ (Wz . xt + Uz . ht-1 + bz)
where:
o Wz and Uz are weight matrices for the input and previous hidden
state, respectively.
o bz is a bias term.
o σ is the sigmoid activation function that squashes the values to the
range [0, 1].
3. Reset Gate Calculation: The reset gate determines how much of the
previous memory should be used in computing the candidate hidden state.
It is calculated as:
rt = σ (Wr . xt + Ur . ht-1 + br)
where:
o Wr, Ur, and br are weight matrices and bias term for the reset gate.
~
4. Candidate Hidden State: The candidate hidden state h t is computed
based on the reset gate. This candidate state represents the new potential
memory at the current timestep, but not yet combined with the previous
memory:
h t = tanh (Wh . xt + Uh (rt ⊙ ht-1) + bh)
~
where:
o ⊙ denotes element-wise multiplication (Hadamard product).
o tanh is the hyperbolic tangent activation function.
5. Final Hidden State: Finally, the GRU cell computes the final hidden
state ht by blending the previous hidden state and the candidate hidden
state using the update gate:
ht = (1−zt) ⊙ ht-1 + zt ⊙ h t
~
Here:
o zt controls how much of the previous hidden state ht-1 is retained.
~
o (1−zt) controls how much of the new candidate state h t is taken into
account.
Key Properties of GRUs:
1. Gating Mechanism: The two gates—update and reset—allow the GRU
to selectively remember or forget information, addressing the vanishing
gradient problem more effectively than vanilla RNNs.
2. Fewer Parameters: GRUs are simpler than Long Short-Term Memory
(LSTM) networks, another popular RNN variant, because they have
fewer gates (just two, compared to LSTMs' three). This makes GRUs
computationally more efficient while still achieving strong performance
in many tasks.
3. Memory Retention: The update gate allows the GRU to keep track of
long-term dependencies, while the reset gate enables it to forget irrelevant
parts of the previous hidden state.
Advantages of GRUs:
Better at Capturing Long-Term Dependencies: The gating mechanism
allows the GRU to capture both short-term and long-term dependencies,
unlike vanilla RNNs, which struggle with longer sequences.
Reduced Computational Complexity: GRUs tend to require fewer
parameters than LSTMs due to having only two gates instead of three,
which can lead to faster training and fewer resources required.
Flexibility: GRUs perform comparably to LSTMs in many cases but with
less complexity, making them a popular choice for sequence modelling
tasks.
where:
o ft is the forget gate’s output.
o σ is the sigmoid activation function.
o Wf is the weight matrix for the forget gate.
o ht-1 is the previous hidden state.
o xt is the current input.
2) Input Gate:
Determines how much of the new information should be
stored in the cell state.
Two parts: First, a sigmoid layer decides which values will be
updated. Second, a tanh layer creates new candidate values to be
added to the cell state.
Formula:
it = σ (Wi ⋅ [ht-1 , xt] + bi)
Ct = tanh (WC ⋅ [ht-1 , xt] + bC)
where:
it is the input gate’s output.
Ct is the candidate cell state, a potential update to the
cell state.
tanh is the tanh activation function.
3) Output Gate:
Controls the output of the LSTM unit. It decides what the
next hidden state should be, based on the current input, previous
hidden state, and current cell state.
Formula:
ot = σ (Wo ⋅ [ht-1 , xt] + bo)
ht = ot ⋅ tanh (Ct)
where:
ot is the output gate’s output.
ht is the new hidden state.
Ct is the cell state at time t.
Advantages of LSTM
LSTMs offer several key benefits over traditional RNNs:
Long-term Memory: Due to the cell state and gates, LSTMs
can capture long-term dependencies in sequential data and
avoid the vanishing gradient problem.
Better Performance on Sequential Tasks: LSTMs excel in
tasks like time-series forecasting, machine translation, and
speech recognition, where understanding context over long
sequences is crucial.
Flexibility: LSTMs can be combined with other architectures
like Convolutional Neural Networks (CNNs) and used for a
variety of complex tasks such as video analysis, music
composition, and text generation.
Variants of LSTM
There are several variants of LSTM that improve its performance or adapt
it for specific tasks:
1. Bidirectional LSTMs:
These networks process sequences in both forward and backward
directions. They have two LSTM layers: one processing the sequence
from left to right, and the other from right to left. The final output is a
combination of the two.
2. GRU (Gated Recurrent Unit):
GRUs are similar to LSTMs but use a simpler architecture with only two
gates: an update gate and a reset gate. They are computationally more
efficient but still effective in many tasks.
3. Peephole Connections:
These LSTMs have connections from the cell state to the gates, which
provide additional control over the gating mechanism.
Applications of LSTMs
LSTMs are widely used in many areas, including:
Natural Language Processing (NLP): LSTMs are used for tasks such as
machine translation, speech-to-text, and language modelling.
Time Series Prediction: LSTMs can be used to predict future values in a
sequence, such as stock prices or weather forecasts.
Anomaly Detection: In applications like fraud detection or network
security, LSTMs can help identify unusual patterns in sequential data.
Generative Models: LSTMs are used in generating sequences such as
text, music, or even art.
Conclusion
LSTM networks are a powerful extension of traditional RNNs, designed
to overcome issues such as the vanishing gradient problem. They allow for the
modelling of long-term dependencies in sequential data by using memory cells
and gates to control the flow of information. While more complex than regular
RNNs, LSTMs have become a standard tool in a wide range of applications,
from natural language processing to time series prediction.
Encoder-decoder models:
Encoder-decoder models in Recurrent Neural Networks (RNNs) are a
type of neural network architecture widely used for sequence-to-sequence tasks,
such as machine translation, speech recognition, and text summarization. The
encoder-decoder structure is particularly useful for tasks where the input and
output are sequences of varying lengths, and it consists of two main parts: the
encoder and the decoder.
1. Overview of Encoder-Decoder Architecture in RNNs
The encoder-decoder model has two RNN components:
Encoder: Encodes the input sequence into a fixed-size context vector
(also called a "thought vector").
Decoder: Decodes the context vector into an output sequence.
This architecture is designed to handle tasks where the input and output are
sequences, such as translating a sentence from one language to another, or
generating a summary of a document.
Key Steps in an Encoder-Decoder Model:
1. Encoding Phase (Encoder RNN):
o The input sequence is fed into the encoder RNN one element
(word, character, etc.) at a time.
o The encoder processes each element of the input sequence and
updates its internal hidden state.
o At the end of the input sequence, the final hidden state of the
encoder represents the context vector which contains the
important information about the entire input sequence.
2. Decoding Phase (Decoder RNN):
o The context vector generated by the encoder is then used as the
initial hidden state of the decoder RNN.
o The decoder generates the output sequence, one element at a time,
based on the context vector and the previously generated tokens (or
the ground truth tokens in training).
o The decoder continues producing output until a special end token
(often <EOS>) is predicted or a predefined length is reached.
The Role of RNNs in Encoder-Decoder Models
RNNs are well-suited for encoder-decoder models because of their ability
to handle sequential data. RNNs process one element of the input sequence at a
time while maintaining a hidden state that is updated at each time step.
The idea is that the hidden state at the final time step of the encoder will
contain all the information about the input sequence, which will then be passed
to the decoder.
However, regular RNNs suffer from limitations such as difficulty in
learning long-term dependencies due to the vanishing gradient problem.
This is why more advanced architectures like Long Short-Term
Memory (LSTM) and Gated Recurrent Unit (GRU) are often used in
practice, as they are designed to handle long-range dependencies more
effectively.
Details of the Encoder in RNN Models
In the encoder part, the input sequence is fed step-by-step into the RNN.
Let’s assume the input sequence is X = [x1, x2, ..., xT], where each xt is an
element of the sequence (for example, a word or a character).
The RNN updates its hidden state at each time step t based on the current
input xt and the previous hidden state ht-1:
ht = f (W⋅xt + U⋅ht-1+b)
Where:
ht is the hidden state at time t,
f is the activation function (typically a tanh or ReLU),
W and U are weight matrices,
b is a bias term.
At the final time step t=T, the encoder produces a context vector c, which
is often taken as the last hidden state hT of the encoder. This context vector is
intended to contain all the relevant information about the input sequence.
Details of the Decoder in RNN Models
The decoder generates the output sequence Y = [y1, y2, ..., y 'T ], which can
be of a different length from the input sequence. The decoder is conditioned on
the context vector ccc produced by the encoder.
At each time step, the decoder uses the hidden state ht and the context
vector c to predict the next token in the output sequence. The decoder RNN
takes in the previous hidden state and the previous output token yt-1 to produce
the next token in the sequence.
The decoder updates its hidden state and computes the next output as
follows:
ht = f (W′ ⋅ yt-1 + U′ ⋅ ht −1 + b′)
' '
Where:
yt-1 is the previous output token (for the first token, this is usually a
special token like <START>),
'
ht is the decoder’s hidden state at time t,
Seq2Seq models
Seq2Seq (Sequence-to-Sequence) models are a type of deep learning
architecture designed for tasks where the input and output are sequences. They
are widely used in natural language processing (NLP) tasks such as machine
translation, text summarization, speech recognition, and more.
Seq2Seq models typically rely on Recurrent Neural Networks (RNNs),
Long Short-Term Memory networks (LSTMs), or Gated Recurrent Units
(GRUs), which are designed to process sequences of data by maintaining hidden
states that capture temporal dependencies.
Key Components of Seq2Seq Models
1. Encoder: The encoder is an RNN (or any of its variants like LSTM or
GRU) that processes the input sequence one element at a time. It encodes
the entire input sequence into a single fixed-length context vector (the
final hidden state of the RNN), which contains information about the
input sequence.
2. Decoder: The decoder is another RNN (or its variants) that takes the
context vector produced by the encoder and generates the output
sequence one element at a time. The output of each timestep is used as
input for the next timestep, which allows the decoder to generate a
sequence as output.
3. Context Vector: The context vector is the final hidden state of the
encoder, which serves as a summary of the input sequence. This vector is
passed to the decoder to help generate the output sequence. The idea is
that the encoder summarizes the input sequence into a compact form that
the decoder uses to produce the output sequence.
4. Attention Mechanism (optional but common in modern Seq2Seq
models): In vanilla Seq2Seq models, the decoder uses a single fixed
context vector for generating the output sequence. However, this can be a
limitation, especially when the input sequence is long. The attention
mechanism allows the model to focus on different parts of the input
sequence at each timestep of the decoding process. It dynamically
calculates a weight distribution over the input sequence and uses these
weights to "attend" to specific parts of the sequence.
How Seq2Seq Works
1. Encoder
The encoder processes the input sequence one step at a time (word by
word, for example) and updates its hidden state. At each timestep t, the encoder
receives an input x_t and produces a hidden state h_t. In the simplest form, the
hidden state at timestep t can be computed as:
h_t = RNN (x_t, ht-1)
Where:
x_t is the input at timestep t,
h_t is the hidden state at timestep t,
ht-1 is the previous hidden state.
The final hidden state hT (where T is the length of the input sequence)
serves as the context vector, which summarizes the input sequence.
2. Decoder
The decoder takes the context vector hT and generates the output
sequence. The decoder works similarly to the encoder, but it uses the context
vector and previous decoder outputs to predict the next token in the sequence.
The decoder generates the sequence step by step. For each timestep t, the
decoder outputs a token (e.g., a word in the case of machine translation). The
probability of each possible output token at timestep t is computed using:
yt = Softmax (Wh . ht + b)
Where:
yt is the output of the decoder (the predicted token),
Wh is a weight matrix,
ht is the hidden state at timestep t,
b is a bias term.
3. Sequence Generation
The decoder produces the sequence token by token. At each step, the
decoder receives the context vector (in the simplest case), and at each timestep,
it predicts a token based on the previous output and the hidden states.
4. Training the Seq2Seq Model
During training, the Seq2Seq model is typically trained using teacher
forcing. In teacher forcing, the true output token from the training data is used
as the next input to the decoder at each timestep, instead of using the model's
predicted output. This accelerates the learning process by providing the correct
context at each step.
Attention Mechanism
The attention mechanism was introduced to overcome the limitation of
using a single context vector for long sequences.
The attention mechanism allows the model to focus on different parts of
the input sequence when generating each element of the output sequence.
This is done by computing attention weights for each input token at each
timestep of the decoder.
For each timestep t, the attention mechanism computes an alignment
score between the current decoder hidden state ht and all the encoder
hidden states h1, h2, ..., hT.
These alignment scores are then converted into attention weights (usually
through a Softmax operation), which represent how much the model
should "attend" to each part of the input sequence at the current decoding
step.
The context vector for the decoder at each timestep is a weighted sum of
all encoder hidden states, where the weights are the attention scores.
T
Where αt, i represents the attention weight for the i-th input token at
timestep t.
This allows the model to focus on relevant parts of the input sequence
dynamically, improving the handling of long-range dependencies and producing
more accurate outputs.
Advantages of Seq2Seq Models
1. Modelling Sequential Data: Seq2Seq models are specifically designed
to handle sequential data, making them highly suitable for NLP tasks like
translation, summarization, and speech recognition.
2. Handling Variable-Length Sequences: Seq2Seq models can handle
inputs and outputs of varying lengths, unlike traditional machine learning
models that require fixed-size inputs and outputs.
3. Flexibility with Encoder-Decoder Design: By using different
architectures for the encoder and decoder (e.g., RNN, LSTM, GRU), the
model can be fine-tuned for specific applications.
4. Attention Mechanism: With the attention mechanism, the model can
focus on relevant parts of the input sequence, significantly improving the
performance on long sequences.
Challenges of Seq2Seq Models
1. Vanishing Gradient Problem: RNNs suffer from vanishing gradient
problems, especially when dealing with long sequences. LSTMs and
GRUs help alleviate this issue, but it can still be a challenge.
2. Computationally Expensive: Training Seq2Seq models can be
computationally expensive, especially for long sequences and large
datasets.
3. Fixed-Length Context Vector: In vanilla Seq2Seq models (without
attention), the context vector can struggle to capture information from
long input sequences, leading to poorer performance on long-range
dependencies.
Modern Variations
Transformer Models: The transformer architecture, which uses self-
attention mechanisms, has largely replaced Seq2Seq models in many
NLP tasks due to its superior performance, especially on long sequences.
Pretrained Models: Models like BERT, GPT, and T5 (based on
transformers) have surpassed traditional Seq2Seq models in many tasks
by leveraging large-scale pretraining and fine-tuning techniques.
Applications of Seq2Seq Models
Machine Translation: Translating text from one language to another.
Speech Recognition: Converting spoken words into text.
Text Summarization: Generating summaries of long text documents.
Chatbots: Generating human-like responses in conversation.
Conclusion
Seq2Seq models, particularly those based on RNNs, are powerful tools
for sequence modelling tasks in NLP. While they have been largely supplanted
by transformer-based architectures in some applications, they remain
foundational in the development of sequence models, especially when attention
mechanisms are integrated to improve performance.
context = ∑
i
αi v i
This context vector is then used by the decoder to generate the next token
in the output sequence.
Types of Attention Mechanisms
There are several variations of the attention mechanism, depending on the
specific task or model architecture. Some common types include:
a. Scaled Dot-Product Attention
This is the most widely used form of attention, particularly in
Transformer models. In this case, the attention scores are scaled by the square
root of the dimension of the key vectors to prevent the dot products from
becoming too large.
q ⋅k i
score (q, ki) = d
√ k
where dk is the dimension of the key vector.
b. Multi-Head Attention
Rather than computing a single attention mechanism, multi-head
attention splits the query, key, and value matrices into multiple "heads," each of
which performs attention independently.
The results of these attention heads are then concatenated and
linearly transformed. This allows the model to focus on different aspects of the
input sequence at the same time, improving the model's capacity to capture
various patterns in the data.
c. Self-Attention
Self-attention, often referred to as intra-attention, is a specific
form of attention where the queries, keys, and values all come from the same
input sequence.
This allows the model to capture relationships between words
within the same sequence. Self-attention is a key component of the Transformer
architecture, allowing it to model long-range dependencies efficiently.
d. Additive Attention
Instead of using dot products, additive attention computes a score
using a learned function, such as a feed-forward neural network. This function
typically computes a score based on both the query and key.
Transformers and Attention
The Transformer model, introduced in the paper "Attention is All You
Need" (Vaswani et al., 2017), revolutionized NLP by completely relying on
attention mechanisms and discarding recurrent layers (like LSTMs and GRUs)
for sequence modelling. The Transformer consists of two parts:
Encoder: The encoder processes the input sequence using self-attention
layers to generate a sequence of context-aware embeddings for each
token.
Decoder: The decoder uses self-attention and cross-attention (attention
between encoder output and decoder input) to generate the output
sequence.
The success of Transformers and models like BERT, GPT, and T5 can be
attributed to their reliance on attention mechanisms, allowing them to efficiently
process long sequences and capture complex dependencies in data.
Advantages of Attention Mechanism
Parallelization: Unlike RNN-based models, which process sequences
step by step, attention mechanisms allow for parallel processing of input
data. This leads to faster training times.
Long-Range Dependencies: Attention models can capture long-range
dependencies in sequences, which is difficult for traditional RNNs.
Flexibility: Attention can be applied in various contexts, such as machine
translation, text summarization, question answering, and more. The
flexibility of the mechanism makes it suitable for a wide range of tasks.
Limitations of Attention
Computational Complexity: The time and space complexity of
computing attention scales quadratically with the sequence length. For
very long sequences, this can be computationally expensive.
Optimizations like sparse attention and memory-efficient methods are
being explored to address this.
Interpretability: While attention provides some insights into which
words are being focused on, it's not always a perfect reflection of what
the model "understands" from the data. Attention weights can be
misleading, as they don't always correlate with the importance of a word
in a human sense.
Conclusion
The attention mechanism has become the cornerstone of modern NLP,
providing models with the ability to selectively focus on relevant parts of an
input sequence. Its flexibility, efficiency, and ability to handle long-range
dependencies have made it the foundation of powerful architectures like the
Transformer, which has set new benchmarks for a variety of NLP tasks.