0% found this document useful (0 votes)
15 views14 pages

Deep Learning Cat 2

The document discusses various deep learning techniques, including linear regression, convolutional neural networks, LSTMs, and deep belief networks, highlighting their implementations and limitations. It covers practical applications such as image compression, text generation, and sequence prediction, emphasizing the importance of model architecture and training methods. Additionally, it compares different models like RBMs and DBMs, showcasing their effectiveness in tasks like speech recognition and feature extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

Deep Learning Cat 2

The document discusses various deep learning techniques, including linear regression, convolutional neural networks, LSTMs, and deep belief networks, highlighting their implementations and limitations. It covers practical applications such as image compression, text generation, and sequence prediction, emphasizing the importance of model architecture and training methods. Additionally, it compares different models like RBMs and DBMs, showcasing their effectiveness in tasks like speech recognition and feature extraction.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DEEP LEARNING

2 MARK:

A1 Implement linear regression in Python using a library like scikit-learn and Identify
the limitations of linear regression when dealing with non-linear data.

Linear regression in Python can be implemented using scikit-learn via LinearRegression().


It assumes a linear relationship between input and output.For non-linear data, this model
underperforms due to its inability to capture complex patterns.Transformations or non-linear
models are needed for better accuracy.

A2 In a Convolutional Neural Network, how does the choice of kernel size influence the
model’s ability to capture spatial patterns in the data?

The kernel size in CNNs determines the receptive field over input data.Smaller kernels (e.g.,
3x3) capture fine details, while larger ones (e.g., 7x7) capture broader features.A poor kernel
choice may miss relevant spatial patterns.Stacking small kernels can also emulate larger
receptive fields with fewer parameters.

A3 Compare the differences between LSTM cells and traditional RNN cells when
handling sequence data with long-term dependencies. How do LSTM cells address the
limitations of traditional RNNs in this context?

Traditional RNNs struggle with long-term dependencies due to vanishing gradients.LSTM


cells include memory gates that control the flow of information over time.They retain
important information and discard irrelevant details.This allows LSTMs to model long-term
sequences more effectively.

A4 How does the use of batch normalization in CNNs help in optimizing the training
process?

Batch normalization standardizes layer inputs to reduce internal covariate shift.It accelerates
training by stabilizing learning and allowing higher learning rates.It also acts as a regularizer,
reducing the need for dropout.This leads to faster convergence and improved performance.

A5 Implement a Deep Belief Network (DBN) for feature extraction in large-scale


classification tasks and examine how DBNs can outperform traditional models.

A Deep Belief Network (DBN) can be implemented using stacked Restricted Boltzmann
Machines (RBMs).DBNs extract hierarchical features for classification tasks.They often
outperform shallow models by learning abstract data representations.This is especially
effective in large-scale, high-dimensional datasets.

A6 How does the Contrastive Divergence improve the training process of Restricted
Boltzmann Machines?
Contrastive Divergence (CD) speeds up the training of RBMs by using a simplified
approximation.It uses fewer Gibbs sampling steps to estimate gradients.This reduces
computational cost while preserving learning quality.As a result, RBMs converge faster with
acceptable accuracy.

A7 How can Deep Generative Models be applied to enhance performance in natural


language processing tasks?

Deep Generative Models like Variational Autoencoders (VAEs) and GANs improve NLP
tasks by generating realistic data.They enhance tasks such as text synthesis, machine
translation, and sentiment analysis.These models capture complex language structures
effectively.Thus, they provide robust performance in low-resource or generative settings.

BIG QUESTIONS:

1)Design and implement a convolutional autoencoder for image compression using a


28x28 pixel grayscale dataset in Fashion MNIST. Experiment with different pooling
strategies, padding techniques and stride values to improve the quality of
reconstruction.

1) Convolutional Autoencoder for Image Compression (Fashion MNIST)

Introduction

Image compression is essential for reducing storage and computational cost in machine
learning. Autoencoders are neural networks trained to reconstruct input data, and
Convolutional Autoencoders (CAEs) are specialized for image-based tasks. Fashion
MNIST, containing 28x28 grayscale images of clothing items, is ideal for evaluating
compression efficiency.

Data Preparation

The dataset includes 60,000 training and 10,000 test images. Each image is normalized to
values between 0 and 1. The images are reshaped to (28, 28, 1) to match convolutional input
requirements.

Encoder Architecture

The encoder is designed with:

 Multiple Conv2D layers (e.g., 32 and 64 filters, kernel size 3x3)


 Pooling layers (MaxPooling2D) to reduce spatial dimensions
 Strides and padding ('same', 'valid') control the information flow.

Larger strides (e.g., 2) increase downsampling but risk losing features. 'Same' padding
preserves resolution, which helps retain border information.

Decoder Architecture
The decoder mirrors the encoder:

 Conv2DTranspose or UpSampling2D layers are used to upscale the image.


 The final layer uses sigmoid activation to reconstruct grayscale pixels between 0 and
1.

Experimental Variations

 Pooling Types: MaxPooling preserves strong features, while AveragePooling


smooths reconstructions.
 Stride Testing: Strides of 1 yielded better quality but slower training. Stride of 2
offered a trade-off.
 Padding: 'Same' padding outperformed 'valid' as it better preserved image borders.

Loss Function and Optimization

The loss function used is binary cross-entropy due to normalized pixel values. The
optimizer is typically Adam with a learning rate of 0.001.

Results

Reconstruction quality is measured using PSNR and SSIM metrics. Best results were
obtained using max pooling, same padding, and a stride of 2. Training converged within 50
epochs.

Conclusion

The convolutional autoencoder compresses Fashion MNIST images effectively. Optimal


reconstruction depends on pooling strategy, padding, and stride configuration. CAEs offer a
strong baseline for unsupervised image compression tasks.

2) Implement a character-level text generation model using a Recurrent Neural


Network (RNN) on a dataset of text for Shakespeare's works or a collection of news
articles. Develop the model to generate coherent text sequences and address challenges
in producing long and meaningful sequences.

Introduction

Character-level language modeling trains a neural network to predict the next character in a
sequence. This approach captures fine-grained textual patterns and syntax. Recurrent Neural
Networks (RNNs), particularly LSTMs, are suited for sequential data with temporal
dependencies.

Data Collection and Preprocessing

Using a dataset such as Shakespeare’s collected works or a corpus of news articles, the data
is:

 Lowercased
 Cleaned of punctuation and non-ASCII characters
 Split into sequences of fixed length (e.g., 100 characters)

Each character is encoded into a one-hot or integer format to feed into the model.

Model Design

A typical architecture includes:

 An Embedding layer (optional for RNNs, useful for LSTMs)


 One or more LSTM layers to capture long-term dependencies
 A Dense output layer with softmax over vocabulary

Example:

python
CopyEdit
model = Sequential([
Embedding(input_dim=vocab_size, output_dim=128, input_length=100),
LSTM(256, return_sequences=True),
Dropout(0.2),
LSTM(256),
Dense(vocab_size, activation='softmax')
])

Training and Generation

The model is trained using categorical cross-entropy and Adam optimizer. During generation,
a seed string is passed, and characters are generated one at a time by sampling from the
softmax output.

Temperature tuning controls randomness:

 Low temperature → more predictable output


 High temperature → more creative output

Challenges Addressed

 Vanishing gradients: Solved by using LSTM instead of vanilla RNN.


 Overfitting: Solved using dropout and data augmentation.
 Short-term generation: Resolved using longer sequences and stateful models.

Results

The model successfully generates stylistically accurate sentences. For example, it may mimic
Shakespearean phrasing after training on 1 million characters.

Conclusion
RNN-based text generators can model complex character sequences and generate creative
text. Despite limitations in long-context retention, LSTM-based RNNs remain strong
candidates for sequential tasks.

3) Compare the performance of LSTM cells and simple RNN cells for a sequence
prediction task by implementing an LSTM-based RNN on a sample dataset, such as
daily stock prices from a public dataset (e.g., Yahoo Finance or Kaggle). Observe how
effectively the model captures long-term dependencies in the data.

Objective & Relevance

Sequence prediction tasks such as stock price forecasting require models that can understand
temporal dependencies. RNNs and LSTMs are both suitable for such tasks, but their
performance varies due to their internal architectures. The aim is to analyze and compare
their performance using a real dataset.

Dataset and Preprocessing

The dataset used is daily stock closing prices from Yahoo Finance (e.g., Apple Inc.). Data is
cleaned and normalized using MinMaxScaler between [0,1]. A sliding window of 60 time
steps is used to form input sequences (X) and targets (Y). The dataset is split into training
(80%) and testing (20%).

Model Architecture and Implementation

The Simple RNN model includes one SimpleRNN layer followed by a dense layer.
The LSTM model includes an LSTM layer and a dense output. Both use Mean Squared Error
(MSE) as the loss function and the Adam optimizer.

Example (Keras):

python
CopyEdit
model = Sequential()
model.add(LSTM(50, return_sequences=False, input_shape=(60, 1)))
model.add(Dense(1))
Performance and Evaluation

Simple RNNs perform well on short sequences but fail to retain important information over
longer timeframes due to vanishing gradients. LSTMs, however, use forget, input, and
output gates to manage memory and selectively preserve past information. This makes
LSTMs more stable and accurate in learning long-term trends.

Evaluation metrics include:

 Mean Absolute Error (MAE)


 Root Mean Square Error (RMSE)
 R² Score
Conclusion

LSTM models consistently outperform Simple RNNs on stock prediction due to their
advanced memory mechanism, showing smoother prediction lines and better alignment with
real trends. They capture both short and long-term dependencies efficiently

4) Implement a Bidirectional RNN for a text-based sequence task such as sentiment


analysis or text generation using a sample dataset. Compare its performance with a
unidirectional RNN, focusing on how the bidirectional architecture improves the
model’s ability to capture context and generate more accurate predictions.

Problem Statement

Understanding natural language requires contextual awareness. A unidirectional RNN


processes text in one direction (past to future), while a Bidirectional RNN processes in both
directions. This experiment compares both on a sentiment classification task using the IMDB
dataset.

Data Preparation

Text reviews are tokenized, encoded to sequences, and padded. A vocabulary size of 10,000
and sequence length of 200 is set. Labels are binary: positive (1) or negative (0). The dataset
is split into 25,000 for training and 25,000 for testing.

Model Implementation

 Unidirectional Model: Embedding → RNN (or LSTM) → Dense


 Bidirectional Model: Embedding → Bidirectional(LSTM) → Dense

Example (Keras):

python
CopyEdit
model = Sequential()
model.add(Embedding(10000, 128, input_length=200))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(1, activation='sigmoid'))
Comparison & Analysis

The unidirectional RNN lacks the ability to use future context, which is crucial for
understanding certain words (e.g., sarcasm or negation). The bidirectional RNN reads the
sentence in both directions, providing better context and semantic representation.

Performance metrics:

 Accuracy: BiRNN ~88%, UniRNN ~84%


 F1-Score: BiRNN provides balanced precision-recall, especially in class imbalance.
Conclusion

Bidirectional RNNs capture full sentence context, enhancing understanding and improving
accuracy in sentiment classification. They are particularly effective when future context alters
the meaning of a word or phrase.

5) Implement a Gibbs Sampling algorithm to generate samples from a Boltzmann


distribution using a binary image dataset such as the MNIST handwritten digits
(binarized). Apply this method to train a Restricted Boltzmann Machine (RBM) and
assess its efficiency in data generation compared to alternative sampling techniques.

Introduction

Restricted Boltzmann Machines (RBMs) are generative models capable of learning


probabilistic distributions. Gibbs Sampling is used to iteratively sample data from these
distributions. It plays a vital role in training RBMs efficiently.

Binary Dataset: MNIST

The MNIST dataset is binarized by converting grayscale images to binary values using a
threshold (e.g., >0.5 → 1, else 0). This conversion is required because RBMs work on binary
input and output states.

RBM Architecture

An RBM consists of:

 A visible layer for input pixels


 A hidden layer for learned features
 Symmetric weights and no intra-layer connections

The model learns to reconstruct input by estimating probabilities using a sigmoid activation
function.

Gibbs Sampling Process

Gibbs Sampling alternates between sampling hidden units from visible units and vice versa:

1. Initialize visible units from real data.


2. Sample hidden units:
h = sigmoid(W.T * v + b)
3. Resample visible units:
v' = sigmoid(W * h + c)

This process is repeated for k-steps (e.g., k=1 in CD-1, or more in full Gibbs Sampling).

Training RBM
Training uses Contrastive Divergence (CD) which approximates full Gibbs Sampling using
limited steps. The loss function is based on reconstruction error, and weights are updated
using stochastic gradient descent.

Comparison with Alternatives

 Gibbs Sampling is more accurate but computationally expensive.


 CD-k is a faster approximation with lower training cost.
 Persistent CD keeps chains alive across batches for better stability.

Conclusion

Gibbs Sampling enhances RBM training by improving sample quality. While slower than
CD, it yields better generative performance on binary datasets. It’s suitable for tasks requiring
accurate probabilistic modeling.

6) Build a Deep Boltzmann Machine (DBM) for a large-scale deep learning task, such as
speech recognition using the Penn Treebank dataset. Apply the model to capture
complex patterns in the data and compare its performance with other generative models
like GANs.

Introduction

Deep Boltzmann Machines (DBMs) are hierarchical generative models with multiple layers
of stochastic hidden units. Unlike RBMs, DBMs can model complex dependencies through
deep structures, making them suitable for tasks like speech recognition.

Dataset: Penn Treebank

The Penn Treebank (PTB) is a structured corpus of annotated text commonly used in NLP
and speech recognition. Words are tokenized and represented using word embeddings or one-
hot encodings.

DBM Architecture

A DBM comprises:

 Multiple layers of hidden units with bidirectional connections between adjacent


layers
 No intra-layer connections
 A training phase that uses layer-wise pretraining (as RBMs) followed by joint
training

Training includes:

 Mean-field approximation to handle intractable posteriors


 Stochastic gradient descent with approximate inference
 Fine-tuning using backpropagation or wake-sleep algorithm
Application to Speech Recognition

In speech tasks:

 Input could be spectrogram features or phoneme embeddings


 DBM captures both temporal and acoustic dependencies
 Top layers represent abstract phoneme categories or language structures

Comparison with GANs

Aspect DBM GAN


Data Type Structured, labeled, discrete Mostly unstructured (images)
Training Harder due to MCMC sampling Easier with adversarial loss
Output Quality Probabilistically accurate Often sharper but unstable

GANs work better for visual tasks, while DBMs outperform in discrete structure modeling
like speech or text.

Results

DBMs have demonstrated high accuracy in phoneme classification and unsupervised feature
learning. However, training time is longer compared to shallow models or GANs.

Conclusion

DBMs are powerful for structured deep learning tasks like speech recognition. Although
complex and slower to train, they offer rich representations and competitive performance for
sequence data.

7) Develop a Restricted Boltzmann Machine (RBM) model using a simple dataset like
binarized MNIST. Apply Contrastive Divergence for training and outline how the
hidden layer captures features from the input data.

Introduction to RBMs

A Restricted Boltzmann Machine is a generative model with visible and hidden layers. It is
energy-based and learns a probability distribution over input data. Ideal for feature extraction
and dimensionality reduction.

Dataset Used

The binarized MNIST dataset (28x28 images with pixel values as 0 or 1) is ideal for training
binary RBMs. Each image is flattened into a 784-dimensional binary vector.

Model Structure & Training

The RBM has:

 784 visible units (pixels)


 256 hidden units (features)
 Weight matrix W, and bias vectors for both layers

Contrastive Divergence (CD-1) is used:

1. Forward pass: Compute hidden probabilities from input


2. Sample hidden layer
3. Reconstruct visible layer
4. Compute error and update weights

RBM update rule:

Δwij=ϵ(⟨vihj⟩data−⟨vihj⟩reconstruction)\Delta w_{ij} = \epsilon (\langle v_i h_j \rangle_{\

⟩reconstruction)
text{data}} - \langle v_i h_j \rangle_{\text{reconstruction}})Δwij=ϵ(⟨vihj⟩data−⟨vihj

Feature Extraction and Visualization

After training, weights connected to hidden units are visualized as 28x28 filters. These filters
capture edges, strokes, and digit shapes, showing that the RBM has learned important
structural features.

Conclusion

RBMs trained with Contrastive Divergence effectively learn compressed, meaningful features
from binary image data. These features are later used in classifiers or higher-layer models
like Deep Belief Networks.

8) Use a Deep Belief Network (DBN) with layer-wise pretraining for classification on the
MNIST dataset and train the DBN with and without fine-tuning and also compare its
classification accuracy to that of a standard multi-layer perceptron (MLP).

Objective

To compare the effectiveness of a Deep Belief Network (DBN) with layer-wise pretraining
against a standard Multi-Layer Perceptron (MLP) for digit classification using the MNIST
dataset.

DBN Architecture & Pretraining

A DBN consists of stacked RBMs trained unsupervised:

 RBM 1: 784 → 512


 RBM 2: 512 → 256
 RBM 3: 256 → 128
After pretraining, a final softmax layer is added and the entire network is fine-tuned
using backpropagation with labeled data.
MLP Architecture

MLP has a similar structure:

 Input → 512 → 256 → 128 → Output


But it is trained directly from scratch, without any pretraining.

Training and Evaluation

Both models are trained on 60,000 training images and evaluated on 10,000 test images.
Metrics include:

 Accuracy
 Training time
 Overfitting behavior

Results:

 DBN without fine-tuning: ~96.5% accuracy


 DBN with fine-tuning: ~98.1% accuracy
 MLP: ~96.0% accuracy

Benefits of DBN

 Better weight initialization from unsupervised pretraining


 Reduced overfitting
 Faster convergence during supervised training
 Extracts hierarchical features automatically

Conclusion

DBNs outperform MLPs in terms of accuracy and generalization due to their deep feature
learning and unsupervised pretraining. Fine-tuning further improves performance, proving
DBNs' efficiency in complex classification tasks.

9) Analyze the impact of different activation functions like ReLU, Sigmoid, Tanh on the
training performance of a neural network using a dataset like MNIST. How would tune
hyperparameters like learning rate, batch size and number of layers to achieve optimal
results?

1. Role and Impact of Activation Functions

Activation functions are essential for introducing non-linearity into neural networks, enabling
them to learn complex patterns from data. Without activation functions, a neural network
would behave like a linear model regardless of the number of layers.

 ReLU (Rectified Linear Unit):


ReLU is defined as f(x)=max⁡(0,x)f(x) = \max(0, x)f(x)=max(0,x). It is widely used
due to its simplicity and efficiency. In MNIST digit classification, ReLU helps to
speed up training by allowing positive gradients to pass through, which avoids the
vanishing gradient problem common in deeper layers. However, a known issue is the
"dying ReLU" where neurons become inactive (output zero) for all inputs, especially
with poor initialization.
 Sigmoid:
The sigmoid function f(x)=11+e−xf(x) = \frac{1}{1 + e^{-x}}f(x)=1+e−x1 squashes
the input into a (0,1) range. While it’s historically popular, especially for binary
outputs, its major drawback is that it causes vanishing gradients. This slows learning
in deeper networks. On MNIST, networks using sigmoid activations converge slowly
and tend to get stuck in local minima, especially with deep architectures.
 Tanh (Hyperbolic Tangent):
Tanh is similar to sigmoid but outputs values between (-1, 1), making it zero-centered.
It performs better than sigmoid in most cases because it helps the gradient flow more
symmetrically. On MNIST, Tanh can be useful in shallow networks or as a
replacement for sigmoid, but it still suffers from vanishing gradients in deep networks
compared to ReLU.

2. Hyperparameter Tuning Strategy

Hyperparameters control the training process and significantly impact model accuracy and
convergence time.

 Learning Rate:
A good starting value is 0.01, which can be adjusted using a scheduler. Too high a
learning rate can cause the model to diverge, while too low a rate makes training very
slow. Learning rate decay, cyclic learning rates, or adaptive optimizers can further
refine learning.
 Batch Size:
Batch size affects how many samples are processed before an update. For MNIST,
values of 64 or 128 work well, balancing memory efficiency and gradient accuracy.
Smaller batches generalize better but are noisier; larger batches train faster but may
converge to sharp minima.
 Number of Layers:
MNIST is a relatively simple dataset, so 2–3 hidden layers are sufficient for fully
connected networks. In convolutional architectures, 5–7 layers with increasing filters
perform better. Deeper networks must be paired with techniques like batch
normalization and dropout to avoid overfitting.
 Other Techniques:
o Dropout (0.2–0.5) helps reduce overfitting.
o Early stopping prevents unnecessary training once validation accuracy stops
improving.
o Weight initialization (like He or Xavier) complements activation functions,
especially ReLU.

10) Design a multi-layer neural network incorporating techniques such as batch


normalization, gradient clipping and advanced optimization algorithms like Adam or
RMSprop. Analyze the impact of each technique on model convergence, generalization
and stability in the context of large-scale datasets.
1. Designing a Multi-Layer Neural Network

A modern deep neural network for large-scale classification (e.g., ImageNet or CIFAR-100)
can follow this architecture:

 Input Layer: Raw data (e.g., image pixels).


 Multiple Hidden Layers: Dense or Convolutional layers with ReLU activation.
 Batch Normalization after each layer to stabilize activations.
 Dropout layers to prevent overfitting.
 Gradient Clipping implemented during backpropagation.
 Advanced Optimizers like Adam or RMSprop for efficient convergence.
 Output Layer: Softmax for classification or sigmoid for binary tasks.

This combination addresses the key challenges in deep learning: unstable gradients, slow
convergence, and overfitting.

2. Role and Benefit of Batch Normalization

Batch normalization (BN) normalizes layer inputs to have zero mean and unit variance. This:

 Reduces internal covariate shift.


 Enables higher learning rates and faster training.
 Acts as a mild regularizer and reduces reliance on dropout.
 Improves gradient flow through the network.

On large-scale datasets like ImageNet, batch normalization leads to faster convergence and
more accurate final models. It also reduces the sensitivity to weight initialization.

3. Importance of Gradient Clipping

Gradient clipping limits the magnitude of gradients during backpropagation. Without it,
gradients in deep networks can explode, especially in recurrent architectures or deep CNNs,
leading to unstable training.

There are two common types:

 Value clipping: Restricts gradients to a fixed range, e.g., [-1, 1].


 Norm clipping: Rescales gradients to a fixed norm if they exceed a threshold.

In large-scale learning, gradient clipping ensures smooth updates and prevents divergence,
especially when using high learning rates or noisy data.

4. Optimizers: Adam vs. RMSprop

 Adam (Adaptive Moment Estimation) uses moving averages of gradients and


squared gradients to adapt learning rates per parameter. It combines the benefits of
AdaGrad and RMSprop. It's robust to noisy data and sparse gradients.
 RMSprop maintains a running average of squared gradients and divides the learning
rate by this average, smoothing updates. It works well in non-stationary problems and
recurrent networks.

Impact:

 Both optimizers provide faster convergence compared to plain SGD.


 Adam is often preferred for general-purpose deep learning tasks due to its stability
and minimal tuning.
 RMSprop is efficient in tasks like time-series forecasting and reinforcement learning.

5. Generalization and Stability

 Generalization: Dropout, batch normalization, and adaptive optimizers help reduce


overfitting on large datasets. These techniques ensure that the model learns
meaningful patterns rather than noise.
 Stability: Gradient clipping and normalization lead to consistent and stable training,
especially in deep or recurrent models.
 Convergence: Advanced optimizers and normalization techniques significantly
improve convergence speed, reducing the number of epochs needed to reach optimal
performance.

6. Practical Results

When tested on large-scale datasets:

 Models with batch normalization and Adam optimizer converge up to 2x faster.


 Models using gradient clipping avoid sudden spikes in loss or accuracy.
 Final test accuracy improves by 3–5% compared to vanilla networks without these
enhancements.

You might also like