UNIT-3 part2
UNIT-3 part2
Bidirectional Recurrent Neural Networks (BiRNNs) are a type of neural network architecture
designed to capture information from both past and future contexts in sequential data. This is
particularly useful in tasks where the output at a given time step depends on the entire input
sequence, rather than just the past inputs. Below is a summary of the key points discussed in
the text:
This structure is limiting in applications where future context is also important for
making predictions at time tt.
o Forward RNN: Processes the sequence from the start to the end (left to right).
o Backward RNN: Processes the sequence from the end to the start (right to
left).
At each time step tt, the output o(t)o(t) is computed based on the hidden states of both
the forward RNN h(t)h(t) and the backward RNN g(t)g(t).
This allows the network to capture information from both the past and the future,
making it more effective for tasks requiring full-sequence context.
4. Advantages of Bidirectional RNNs:
5. Extension to 2D Inputs:
In this case, four RNNs can be used, each processing the input in one of the four
directions: up, down, left, and right.
At each point (i,j)(i,j) on a 2D grid, the output Oi,jOi,j can capture both local and
long-range dependencies, similar to how BiRNNs work in 1D sequences.
6. Comparison with Convolutional Networks:
Cost: RNNs applied to images are typically more computationally expensive than
convolutional networks.
Convolutional Form: The forward propagation equations for RNNs on images can
be written in a form that resembles a convolution, where the bottom-up input is
computed first, followed by recurrent propagation across the feature map to
incorporate lateral interactions.
7. Applications:
In summary, Bidirectional RNNs are a powerful extension of traditional RNNs that allow for
the incorporation of both past and future context in sequential data, making them highly
effective for tasks where the entire input sequence is relevant to the output.
Key Concepts:
1. Context Representation:
o Encoder (Input RNN): Processes the input sequence and emits the
context CC, typically as a function of its final hidden state.
o Decoder (Output RNN): Generates the output
sequence Y=(y(1),…,y(ny)) conditioned on the context CC.
o The lengths of the input sequence nx and output sequence ny can vary, unlike
previous architectures where nx=ny=t.
3. Training:
o The encoder and decoder RNNs are trained jointly to maximize the average
log probability of the output sequence given the input sequence,
i.e., logP(y(1),…,y(ny)∣x(1),…,x(nx))
5. Flexibility in Architecture:
o There is no constraint that the encoder and decoder must have the same size of
hidden layers.
o A limitation arises when the context CC output by the encoder is too small to
summarize a long sequence effectively.
Applications:
Speech Recognition: Effective in tasks where the input and output sequences differ in
length.
Summary:
Deep Recurrent Networks (DRNs), which extend traditional Recurrent Neural Networks
(RNNs) by introducing depth into the transformations involved in the computation. Here's a
detailed explanation of the key points:
The computation in most RNNs can be decomposed into three main blocks:
1. Input to Hidden State: Transformation from the input to the hidden state.
2. Previous Hidden State to Next Hidden State: Transformation from the
previous hidden state to the next hidden state.
3. Hidden State to Output: Transformation from the hidden state to the output.
In a standard RNN (like the one in Figure 10.3), each of these blocks is associated
with a single weight matrix, representing a shallow transformation (a single layer
within a deep Multi-Layer Perceptron (MLP)).
2. Introducing Depth:
Experimental Evidence: Research (e.g., Graves et al., 2013; Pascanu et al., 2014a)
suggests that introducing depth into these transformations can be beneficial. This is in
line with the idea that deeper networks can perform more complex mappings.
Historical Context: Earlier work (e.g., Schmidhuber, 1992; El Hihi and Bengio,
1996; Jaeger, 2007a) also explored deep RNNs, indicating a long-standing interest in
enhancing RNNs with depth.
Separate MLPs for Each Block: Pascanu et al. (2014a) proposed using separate
MLPs (possibly deep) for each of the three transformation blocks (input-to-hidden,
hidden-to-hidden, and hidden-to-output). This approach is illustrated in Figure 10.13.
4. Challenges and Solutions:
Optimization Difficulty: While adding depth increases the network's capacity, it can
also make optimization more challenging. Deeper networks are generally harder to
train due to longer paths between variables in different time steps.
Skip Connections: To mitigate this issue, skip connections can be introduced in the
hidden-to-hidden path (as shown in Figure 10.13c). These connections create shorter
paths between variables in different time steps, facilitating easier optimization.
5. Impact of Depth:
Increased Capacity: Adding depth to each transformation block allows the network
to capture more complex patterns and dependencies in the data.
Trade-offs: While deeper networks can model more complex functions, they require
careful design to ensure that they remain trainable. Skip connections are one way to
balance depth and trainability.
Summary:
Deep Recurrent Networks enhance traditional RNNs by introducing depth into the
transformations involved in input-to-hidden, hidden-to-hidden, and hidden-to-output
computations. This depth allows the network to perform more complex mappings, but it also
introduces challenges in optimization. Techniques like skip connections can help mitigate
these challenges, making deep RNNs more effective and easier to train. The experimental
evidence supports the idea that deeper architectures are beneficial for tasks requiring complex
sequence modeling.
Key Concepts:
1. Tree-Structured Computation:
3. Tree Structure:
o The structure of the tree is crucial for the performance of recursive networks.
In some cases, the tree structure is fixed (e.g., a balanced binary tree), while in
others, it is determined by external methods (e.g., a parse tree from a natural
language parser).
5. Applications:
o Computer Vision: In tasks like scene parsing, recursive networks can model
the hierarchical structure of objects and their relationships within an image.
6. Challenges:
o Tree Structure Design: Choosing or learning the appropriate tree structure for
a given task remains a challenge. While fixed structures like balanced trees are
simple, they may not always capture the optimal hierarchy for the data.
LEAKY UNITS
Leaky Units are a strategy used in Recurrent Neural Networks (RNNs) to handle long-term
dependencies by allowing information to persist over time. They achieve this by
incorporating linear self-connections with weights close to one, enabling the network to
retain information from the past for extended periods. Here’s a detailed explanation of leaky
units:
Key Concepts:
1. Linear Self-Connections:
o Leaky units have linear self-connections with weights near one, allowing
them to accumulate and retain information over time.
2. Time Constants:
When αα is close to 1, the unit retains information from the past for a
long time, acting like a long-term memory.
o This flexibility allows leaky units to operate at different time scales, making
them useful for capturing both short-term and long-term dependencies.
4. Advantages:
5. Applications:
Summary:
Leaky Units are a powerful mechanism for enabling RNNs to capture long-term
dependencies by incorporating linear self-connections with weights near one. They allow the
network to retain information over extended periods, with the flexibility to operate at
different time scales. By adjusting the time constants (either fixed or learned), leaky units
provide a smooth and adaptive way to manage information flow across time steps, making
them a valuable tool for tasks involving sequential data.
Long Short-Term Memory (LSTM) networks are a specialized type of Recurrent Neural
Network (RNN) designed to address the problem of learning long-term dependencies. LSTMs
introduce self-loops and gating mechanisms to control the flow of information, allowing
gradients to flow for long durations without vanishing or exploding. Here’s a detailed
explanation of LSTMs:
Key Concepts:
1. Core Idea:
o LSTMs introduce self-loops to create paths where gradients can flow for long
durations, enabling the network to learn long-term dependencies.
o The self-loop weight is conditioned on the context rather than being fixed,
allowing the time scale of integration to change dynamically based on the input
sequence.
2. LSTM Cell:
o An LSTM cell has an internal state s(t)s(t) that is updated over time.
Input Gate (g(t)): Controls how much of the new input is added to the
state.
Output Gate (q(t)): Controls how much of the cell state is output to the
hidden state.
3. Gating Mechanisms:
4. Advantages:
o Long-Term Dependencies: LSTMs are designed to capture long-term
dependencies, making them effective for tasks where the context from earlier
time steps is crucial.
o Dynamic Time Scales: The gating mechanisms allow the network to
dynamically adjust the time scale of integration based on the input sequence.
5. Applications:
o LSTMs have been successfully applied in various domains, including:
Unconstrained Handwriting Recognition (Graves et al., 2009)
Speech Recognition (Graves et al., 2013; Graves and Jaitly, 2014)
Machine Translation (Sutskever et al., 2014)
Image Captioning (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et
al., 2015)
Parsing (Vinyals et al., 2014a)
6. Variants and Alternatives:
o Several variants of LSTMs have been proposed to improve performance or
adapt to specific tasks. These include:
Gated Recurrent Units (GRUs): A simplified version of LSTMs with
fewer parameters.
Peephole Connections: Additional connections that allow gates to
inspect the cell state directly.
Depth Gated LSTMs: Incorporating deeper architectures for more
complex tasks.
Summary:
LSTM networks are a powerful extension of traditional RNNs, designed to handle long-term
dependencies through the use of self-loops and gating mechanisms. By dynamically adjusting
the time scale of integration and controlling the flow of information, LSTMs have achieved
state-of-the-art performance in various sequence processing tasks. Their ability to capture
long-term dependencies makes them particularly effective in applications like speech
recognition, machine translation, and image captioning. Variants and alternatives continue to
be explored to further enhance their capabilities.