Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
4
4.2.Essential Theory of Deep Q-Learning Networks
a) Value Functions
b) Q-Value Functions
c) Estimating an Optimal Q-
Value 4.3.Defining a DQN
Agent
a) Initialization Parameters
b) Building the Agent’s Neural Network Model
c) Remembering Gameplay
d) Training via Memory Replay
e) Selecting an Action to Take
f) Saving and Loading Model Parameters
4.4.Interacting with an OpenAI Gym
Environment 4.5.Hyperparameter
Optimization with SLM Lab 4.6.Agents
Beyond DQN
a) Policy Gradients and the REINFORCE Algorithm
b) The Actor-Critic Algorithm
……………………………………………………………………………………………………………
……………….. Deep Learning Research
Chapter-I: Autoencoders
1.1 Undercomplete Autoencoders
1.2 Regularized Autoencoders
1.3 Representational Power, Layer Size and Depth
1.4 Stochastic Encoders and Decoders
1.5 Denoising Autoencoders
1.6 Learning Manifolds with Autoencoders
1.7 Contractive Autoencoders
1.8 Predictive Sparse Decomposition
1.9 Applications of
Autoencoders Chapter-II: Deep
Generative Models
2.1.Boltzmann Machines
2.2.Restricted Boltzmann
5
Machines 2.3.Deep Belief
Networks
6
Deep Learning
Research Chapter-I:
Autoencoders
An autoencoder is a neural network that is trained to attempt to copy its input to its
output.
Internally, it has a hidden layer h that describes a code used to represent the input.
The network may be viewed as consisting of two parts: an encoder function h = f (x)
and a decoder that produces a reconstruction r = g(h). This architecture is presented in
below figure
An autoencoder is a type of artificial neural network that is trained to reconstruct its input. It
does this by learning a hidden representation of the input, and then using this hidden
representation to generate a reconstructed output.
Autoencoders are typically used for unsupervised learning tasks, such as dimensionality
reduction, anomaly detection, and feature extraction. They can also be used for supervised
learning tasks, such as classification and regression, by pre-training the autoencoder on an
unlabeled dataset and then using the learned hidden representations as input to a supervised
learning model.
Autoencoders consist of two main parts: an encoder and a decoder. The encoder takes the
input data and compresses it into a hidden representation. The decoder then takes the hidden
representation and reconstructs the output data.
Autoencoders are trained by minimizing the difference between the input data and the
reconstructed output data. This is typically done using a loss function, such as the mean
squared error.
Once an autoencoder is trained, it can be used to generate new data that is similar to the
training data. This can be useful for tasks such as data augmentation and image generation.
Here is a simple example of how to use an autoencoder to reduce the dimensionality of a
dataset:
1. Train an autoencoder on the dataset.
2. Use the encoder to generate a hidden representation of the dataset.
3. Discard the original dataset and only keep the hidden representations.
4. Use the hidden representations as input to a supervised learning model, such as a
logistic regression classifier.
This approach can be used to reduce the dimensionality of the dataset while preserving the
7
most important information. This can improve the performance of the supervised learning
model, especially if the dataset is large or noisy.
8
1.1. Undercomplete Autoencoder
The objective of undercomplete autoencoder is to capture the most important features
present in the data. Undercomplete autoencoders have a smaller dimension for hidden layer
compared to the input layer. This helps to obtain important features from the data. It
minimizes the loss function by penalizing the g(f(x)) for being different from the input x.
Advantages-
Undercomplete autoencoders do not need any regularization as they maximize the
probability of data rather than copying the input to the output.
Drawbacks-
Using an overparameterized model due to lack of sufficient training data can create
overfitting.
1.2. Regularized autoencoders
Regularized autoencoders are a type of autoencoder that incorporates regularization
techniques to prevent overfitting and improve generalization. Overfitting occurs when a
model learns the training data too well and is unable to generalize to new data. Regularization
techniques help to prevent this by constraining the model to learn simpler representations of
the data.
There are three main types of regularized autoencoders:
1.2.1) Denoising Autoencoder
Denoising autoencoders create a corrupted copy of the input by introducing some noise. This
helps to avoid the autoencoders to copy the input to the output without learning features
about the data.
These autoencoders take a partially corrupted input while training to recover the original
undistorted input. The model learns a vector field for mapping the input data towards a lower
dimensional manifold which describes the natural data to cancel out the added noise.
Advantages-
It was introduced to achieve good representation. Such a representation is one that can
be obtained robustly from a corrupted input and that will be useful for recovering the
corresponding clean input.
Corruption of the input can be done randomly by making some of the input as zero.
Remaining nodes copy the input to the noised input.
Minimizes the loss function between the output node and the corrupted input.
Setting up a single-thread denoising autoencoder
is easy. Drawbacks-
To train an autoencoder to denoise data, it is necessary to perform preliminary
stochastic mapping in order to corrupt the data and use as input.
This model isn't able to develop a mapping which memorizes the training data because
our input and target output are no longer the same.
9
10
1.2.2) Sparse Autoencoder
Sparse autoencoders have hidden nodes greater than input nodes. They can still discover
important features from the data. A generic sparse autoencoder is visualized where the
obscurity of a node corresponds with the level of activation. Sparsity constraint is introduced
on the hidden layer. This is to prevent output layer copy input data. Sparsity may be obtained
by additional terms in the loss function during the training process, either by comparing the
probability distribution of the hidden unit activations with some low desired value,or by
manually zeroing all but the strongest hidden unit activations. Some of the most powerful AIs
in the 2010s involved sparse autoencoders stacked inside of deep neural networks.
Advantages-
Sparse autoencoders have a sparsity penalty, a value close to zero but not exactly zero.
Sparsity penalty is applied on the hidden layer in addition to the reconstruction error.
This prevents overfitting.
They take the highest activation values in the hidden layer and zero out the rest of the
hidden nodes. This prevents autoencoders to use all of the hidden nodes at a time and
forcing only a reduced number of hidden nodes to be used.
Drawbacks-
For it to be working, it's essential that the individual nodes of a trained model which
activate are data dependent, and that different inputs will result in activations of
different nodes through the network.
11
1.2.3) Contractive Autoencoder
12
The objective of a contractive autoencoder is to have a robust learned representation which is
less sensitive to small variation in the data. Robustness of the representation for the data is
done by applying a penalty term to the loss function. Contractive autoencoder is another
regularization technique just like sparse and denoising autoencoders. However, this regularizer
corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with
respect to the input. Frobenius norm of the Jacobian matrix for the hidden layer is calculated
with respect to input and it is basically the sum of square of all elements.
Advantages-
Contractive autoencoder is a better choice than denoising autoencoder to learn
useful feature extraction.
This model learns an encoding in which similar inputs have similar encodings. Hence,
we're forcing the model to learn how to contract a neighborhood of inputs into a
smaller neighborhood of outputs.
13
14
1.4) Stochastic Encoders and Decoders
Stochastic Encoders and Decoders are a type of autoencoder that uses stochasticity
(randomness) in the encoding and decoding process. This can be achieved by using dropout,
variational inference, or other techniques. Stochastic encoders and decoders have several
advantages over deterministic autoencoders, including:
Improved generalization: Stochasticity helps to prevent overfitting by regularizing the
model and making it more robust to noise in the training data.
More informative representations: Stochastic encoders and decoders can learn more
informative representations of the data by capturing the uncertainty in the data.
More efficient training: Stochastic encoders and decoders can be trained more
efficiently than deterministic autoencoders, as they do not need to compute the full
posterior distribution of the latent variables.
Stochastic encoders and decoders have been used for a variety of tasks, including:
Image denoising: Stochastic encoders and decoders can be used to denoise images
by learning to reconstruct noisy versions of the input data.
Inpainting: Stochastic encoders and decoders can be used to inpaint missing pixels in
images by learning to reconstruct the images from partial information.
Super-resolution: Stochastic encoders and decoders can be used to super-resolve images
by learning to upsample low-resolution images to high-resolution images.
Generative adversarial networks (GANs): Stochastic encoders and decoders are often
used as the generator component in GANs.
1.6) Learning Manifolds with Autoencoders
Manifold learning is an approach in machine learning that assumes that data lies on a
manifold of a much lower dimension. These manifolds can be linear or non-linear. Thus, the
area tries to project the data from high-dimension space to a low dimension. For example,
principle component analysis (PCA) is an example of linear manifold learning whereas an
autoencoder is a non-linear dimensionality reduction (NDR) with the ability to learn non-linear
manifolds in low dimensions. A comparison of linear and non-linear manifold learning is shown
in the following figure:
As you can see from graph a), the data is residing at a linear manifold, whereas in graph
graph b), the data is residing on a second-order non-linear manifold.
15
16
All autoencoder training procedures involve a compromise between two forces:
18
Chapter-II: Deep Generative Models
2.1) Boltzmann
Machines Boltzman
Machine definition:
This senario is certainly viable, it does limit the kind of interactions between the observed
variables to those described by the weight matrix. Specifically, it means that the probability of
one unit being on is given by a linear model (logistic regression) from the values of the other
units.
A Boltzmann machine with hidden units is no longer limited to modeling linear relationship
between variables. Instead, the Boltzmann machine becomes a universal approximator of
probability mass functions over discrete variables.
Formally, we decompose the unites x into 2 subsets:
Visible units v
Latent / Hidden units h
Then the energy function
become:
Learning algorithms for Boltzmann Machines are usually based on maximum likehood. All
Boltzmann machines have an intractable partition function, so the maximum likehood
gradient must be approximated.
Interesting property of Boltzmann machine when trained with learning fules based one
maximum likehood is that update for particular weight connecting 2 unites depends only on
the statistics of those 2 units, collected
under different distribution pmodel(v) and pdata(v)pmodel(h|v). The rest of the network
participates in shaping those statistics, but the weight can be updated without knowing
anything about the rest of the network or how those statistics were produced. This means
that the learning rule is local.
2.2)Restricted Boltzmann Machines (RBMs)
A restricted term means that we are not permitted to connect two types of layers that are of
the same type to one another. In other words, the two hidden layers or input layers of
neurons are unable to form connections with one another. However, there may be
connections between the apparent and hidden layers.
Since there is no output layer in our machine, it is unclear how we will detect, modify the
weights, and determine whether or not our prediction was right. One response fits all the
questions: Restricted Boltzmann Machine.
19
Features of Restricted Boltzmann Machine
Some key characteristics of the Boltzmann machine are:
There are no connections between the layers.
They employ symmetric and recurring structures.
20
It is an algorithm for unsupervised learning, meaning that it draws conclusions from
the input data without labeled replies.
In their learning process, RBMs attempt to link low energy states with high probability
ones and vice versa.
Working of RBM
A low-level feature from a learning target item in the dataset is used by each visible node.
The hidden layer's node 1 multiplies x by weight and adds it to a bias. These two procedures'
outcomes are fed into an activation function, which, given an input of x, creates the output of
the node, or the signal strength traveling through it.
Let's now examine how many inputs would mix at a single hidden node. The output of the
node is created by multiplying each x by a distinct weight, summing the products, adding the
sum to a bias, and then passing the final result once again via an activation function.
Each input x is multiplied by its corresponding weight w at each buried node. In other words, a
single input x would have three weights in this situation, totaling 12 weights (4 input nodes x
3 hidden nodes). The weights between the two layers will always create a matrix with input
nodes in the rows and output nodes in the columns.
The four inputs are sent to each hidden node, multiplied by each weight. Each hidden node
receives one output as a result of the activation algorithm after the sum of these products is
once more added to a bias (which compels at least some activations to occur).
Now that you have a basic understanding of how the Restricted Boltzmann Machine operates,
let's move and examine the procedures for RBM training.
Training of RBM
Two methods - Gibbs Sampling and Contrastive Divergence are used to train RBM.
When direct sampling is challenging, the Markov chain Monte Carlo method known as Gibbs
sampling is used to get a series of observations that are roughly drawn from a given
multivariate probability distribution.
The prediction is the hidden value by h and p(h|v) if the input is represented by v. P(v|h) is
utilized for the regenerated input values’ prediction when the hidden values are known. Let's
say that after k rounds, v k is acquired from input value v 0 after this process has been
performed k times.
In order to approximate the graph slope, a graphical slope showing the relationship between a
network's errors and its weights is called the gradient in Contrastive Divergence. Contrastive
Divergence is a rough Maximum-Likelihood learning approach and is employed when we need
to approximate the learning gradient of the algorithm and choose which direction to go in
because we cannot directly evaluate a set of probabilities or a function.
Weights are updated on CD. The gradient is first determined from the reconstructed input, and the
old weights are updated by adding the delta.
21
22
2.3)Deep Belief Networks (DBNs)
Consider stacking numerous RBMs so that the outputs of the first RBM serve as the input for
the second RBM, and so forth. Deep Belief Networks are the name given to these networks.
Each layer's connections are undirected (as each layer is an RBM). Those between the strata are
simultaneously directed (except for the top two layers – whose connections are undirected). The
DBNs can be trained in two different ways:
Greedy Layer-wise Training Algorithm: RBMs are trained using a greedy layer-by-layer
training algorithm. The orientation between the DBN layers is established as soon as
the individual RBMs have been trained (i.e., the parameters, weights, and biases, have
been defined).
Wake-sleep Algorithm: The DBN is trained from the bottom up using a wake-sleep
algorithm (connections going up indicate wake), and then from the bottom up using
connections indicating sleep.
In order to ensure that the layer connections only work downwards, we stack the RBMs, train
them, and then do so (except for the top two layers).
23
24
Chapter-1:Machine Vision
1.1.Convolutional Neural
Networks
a) The Two-Dimensional Structure of Visual Imagery
b) Computational Complexity
c) Convolutional Layers
c) Multiple Filters
d) A Convolutional Example
e) Convolutional Filter
Hyperparameters Answer:
A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks
that specializes in processing data that has a grid-like topology, such as an image. A digital
image is a binary representation of visual data. It contains a series of pixels arranged in a
grid-like fashion that contains pixel values to denote how bright and what color each pixel
should be.
a) The Two-Dimensional Structure of Visual Imagery
Visual imagery is inherently two-dimensional, with each pixel representing a specific location
and intensity value. This two-dimensional structure plays a crucial role in how we perceive
and interpret visual information. Convolutional neural networks (CNNs) are specifically
designed to exploit this structure, allowing them to effectively process and analyze visual
data.
b) Computational Complexity
Traditional neural networks, also known as fully connected networks, connect every neuron in
one layer to every neuron in the next layer. This leads to a significant increase in
computational complexity as the network grows larger. In contrast, CNNs utilize a
convolutional operation that involves sliding a small filter or kernel over the input data. This
operation significantly reduces the number of connections, making CNNs more
computationally efficient for processing visual data.
c) Convolutional Layers
Convolutional layers are the core building blocks of CNNs. They consist of a set of learnable
filters or kernels that are applied to the input data. As the filters slide across the input, they
extract relevant features, such as edges, shapes, and colors. These extracted features are then
passed to subsequent layers for further processing and analysis.
d) Multiple Filters
CNNs can employ multiple filters in each convolutional layer, each capturing different aspects
of the input data. For instance, one filter might detect horizontal edges, while another might
detect vertical edges. By using multiple filters, CNNs can learn a more comprehensive
representation of the input data.
e) A Convolutional Example
Consider a simple image of a cat. A CNN might use a filter that detects horizontal edges to
identify the cat's whiskers. Another filter might detect vertical edges to identify the cat's eyes
and mouth. By combining the outputs of these filters, the CNN can build a more complete
understanding of the cat's features and ultimately classify the image as a cat.
25
f) Convolutional Filter Hyperparameters
Convolutional filters are characterized by several hyperparameters that influence their behavior.
These include:
26
Filter size: The size of the filter determines the scope of the features it detects. Larger
filters capture larger-scale features, while smaller filters capture finer details.
Stride: The stride determines how far the filter moves across the input data with each
application. Larger strides lead to more efficient computation but may overlook smaller
details.
Padding: Padding adds zeros to the edges of the input data, allowing the filter to
extract features that extend beyond the original boundaries.
Activation function: The activation function introduces non-linearity into the network,
allowing it to learn more complex patterns. Common activation functions for CNNs
include ReLU (Rectified Linear Unit) and Leaky ReLU.
1.2.Pooling
Layers 1.3.LeNet-
5 in Keras
1.4.AlexNet and VGGNet in Keras
Answer:
1.2. Pooling Layers
Pooling layers are another essential component of CNNs. Their primary purpose is to reduce
the spatial dimensions of the feature maps produced by convolutional layers. This reduction in
dimensionality helps to control the computational complexity of the network and prevent
overfitting.
There are two main types of pooling layers: max pooling and
average pooling. Max pooling:
Max pooling selects the maximum value within each pooling region, effectively reducing the
spatial dimensions by a factor of the pooling size. For instance, a 2x2 max pooling layer would
reduce the height and width of the feature map by half.
Average pooling:
Average pooling computes the average value within each pooling region, also reducing the
spatial dimensions by a factor of the pooling size. This type of pooling tends to be more robust
to outliers and noise compared to max pooling.
Pooling layers are typically applied after convolutional layers to reduce the feature map size
before passing it on to subsequent layers. The choice of pooling type (max pooling or
average pooling) and pooling size depends on the specific task and dataset.
1.3. LeNet-5 in Keras
LeNet-5 is a seminal convolutional neural network architecture developed by Yann LeCun in
1998. It was designed for handwritten digit recognition and achieved remarkable performance
on the MNIST dataset. LeNet-5 is considered a groundbreaking architecture that set the
foundation for modern CNNs.
Here's a simplified overview of the LeNet-5 architecture:
Input layer: Receives a 28x28 grayscale image representing a handwritten digit.
Convolutional layer 1: Applies 6 5x5 filters to the input image, producing 6 feature maps.
Pooling layer 1: Applies a 2x2 max pooling operation to each feature map, reducing the
size by half.
27
Convolutional layer 2: Applies 16 5x5 filters to the pooled feature maps, producing 16
feature maps.
Pooling layer 2: Applies a 2x2 max pooling operation to each feature map, reducing the
size by half.
28
Fully connected layer 1: Flattens the pooled feature maps into a 120-dimensional vector.
Fully connected layer 2: Reduces the 120-dimensional vector to 84 dimensions.
Output layer: Produces a 10-dimensional vector representing the probabilities of the
input image being each digit (0-9).
Keras provides a built-in implementation of LeNet-5, allowing you to easily train and evaluate
this architecture for handwritten digit recognition.
1.4. AlexNet and VGGNet in Keras
AlexNet and VGGNet are two more advanced CNN architectures that gained popularity in the
early 2010s. These architectures achieved state-of-the-art performance on image
classification tasks, demonstrating the power of CNNs in computer vision.
AlexNet:
AlexNet was the winner of the 2012 ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), achieving a significant improvement in image classification accuracy compared to
previous methods. It introduced deeper architectures with more convolutional and pooling
layers, demonstrating the potential of deeper CNNs.
VGGNet:
VGGNet is a family of CNN architectures that explore the effect of increasing the depth of the
network. It introduced the concept of using very small 3x3 filters stacked in deeper layers,
achieving competitive performance with AlexNet while using fewer parameters.
Keras provides built-in implementations of both AlexNet and VGGNet, allowing you to easily
train and evaluate these architectures for image classification tasks.
1.5. Residual Networks
a) Vanishing Gradients: The Bête Noire of Deep CNNs
b) Residual Connections
c)
ResNet
Answer:
Residual Network: In order to solve the problem of the vanishing/exploding gradient, this
architecture introduced the concept called Residual Blocks. In this network, we use a
technique called skip connections. The skip connection connects activations of a layer to
further layers by skipping some layers in between. This forms a residual block. Resnets
aremade by stacking these residual blocks together. The approach
behind this network is instead of layers learning the underlying mapping, we allow the
network to fit the residual mapping. So, instead of say H(x), initial mapping, let the network
fit,
F(x) := H(x) - x which gives H(x) := F(x) + x.
29
Fig: Skip (Shortcut) connection
30
The advantage of adding this type of skip connection is that if any layer hurt the performance
of architecture then it will be skipped by regularization. So, this results in training a very deep
neural network without the problems caused by vanishing/exploding gradient. The authors of
the paper experimented on 100-1000 layers of the CIFAR-10 dataset.
a) Vanishing Gradients: The Bête Noire of Deep CNNs
As CNNs become deeper, a phenomenon known as vanishing gradients arises. This means
that the gradients of the loss function with respect to the weights in lower layers become very
small, making it difficult for the optimization algorithm to update these weights effectively.
This problem is particularly acute in very deep CNNs, where the gradients can become so
small that they are effectively zero, preventing any meaningful update to the weights. This
makes it difficult for these deep networks to learn and improve their performance.
b) Residual Connections
Residual networks (ResNets) were introduced in 2015 by Kaiming He et al. to address the
problem of vanishing gradients in deep CNNs. ResNets introduce a novel architectural design
that allows for efficient information flow through the network, even when it is very deep.
The key idea behind ResNets is to introduce shortcut connections that bypass one or more
layers in the network. These shortcut connections allow the gradients to flow directly from the
input of a block to its output, without having to pass through all of the intermediate layers.
This helps to prevent the gradients from vanishing, allowing the network to learn more
effectively.
c) ResNet
The ResNet architecture consists of a series of residual blocks, each containing a
convolutional layer followed by a shortcut connection. The shortcut connection typically adds
the input of the block to the output of the convolutional layer, before applying an activation
function. This design allows for efficient information flow through the network, even when it is
very deep.
ResNets have achieved remarkable performance in a wide range of image recognition tasks,
including image classification, object detection, and image segmentation. They have become
the de-facto standard for deep CNN architectures and have been widely adopted in the field of
computer vision.
1.6. Applications of Machine Vision
a) Object Detection
b) Image Segmentation
c) Transfer Learning
d) Capsule
Networks Answer:
Machine vision has revolutionized various industries by automating tasks that were previously
performed manually. Its applications span a wide range, from manufacturing and robotics to
healthcare and agriculture. Here are some notable examples of machine vision applications:
a) Object Detection
Object detection involves identifying and locating objects within images or videos. This
capability is crucial for tasks such as:
Traffic monitoring: Machine vision systems can detect and track vehicles in traffic
31
footage, enabling real-time traffic analysis and accident prevention.
32
Security and surveillance: Object detection algorithms can identify and track people
or objects in surveillance footage, enhancing security measures and preventing
unauthorized access.
Retail inventory management: Machine vision systems can automate product counting
and tracking on shelves, optimizing inventory management and preventing stockouts.
b) Image Segmentation
Image segmentation involves dividing an image into meaningful regions or segments. This
technique is essential for tasks such:
Medical image analysis: Machine vision algorithms can segment tissues and organs in
medical images, aiding in diagnosing diseases and planning treatments.
Satellite image analysis: Image segmentation can identify and classify land cover types
from satellite imagery, supporting environmental monitoring and resource
management.
Autonomous driving: Image segmentation helps autonomous vehicles distinguish
between roads, pedestrians, and other objects, enabling safe and reliable navigation.
c) Transfer Learning
Transfer learning involves utilizing a pre-trained machine learning model for a new task. This
approach is particularly valuable for machine vision applications, as training large-scale deep
learning models can be computationally expensive and time-consuming.
Transfer learning allows researchers to leverage existing models, such as those trained on large
image datasets like ImageNet, and adapt them to specific applications. This can significantly
reduce the training time and computational resources required for developing new machine
vision systems.
d) Capsule Networks
Capsule networks, introduced by Geoffrey Hinton in 2017, represent a recent advancement in
machine vision. They aim to address some of the limitations of traditional convolutional neural
networks (CNNs), particularly in capturing spatial relationships and object hierarchies.
Capsule networks group neurons into capsules, where each capsule represents a higher-level
concept or feature in the image. This hierarchical structure allows capsule networks to learn more
robust representations of objects and their relationships.
While capsule networks are still under development, they hold promise for improving the
performance of machine vision tasks in areas such as object detection, image segmentation,
and image understanding.
Application Description Example Tasks
Object detection Identifying and locating Traffic monitoring,
objects in images or videos security surveillance,
retail inventory
management
Image segmentation Dividing an image into Medical image analysis,
meaningful regions or satellite image analysis,
segments autonomous driving
Transfer learning Utilizing a pre-trained Adapting existing image
machine learning model classification models to specific
for a new task applications, such as product
recognition or defect
detection
Capsule networks A novel neural network Potential improvements in
33
architecture that aims to object
address limitations of detection, image
traditional CNNs segmentation, and image
understanding
34
Chapter-II: Natural Language processing
2.1. Preprocessing Natural Language Data
a) Tokenization
b) Converting All Characters to Lowercase
c) Removing Stop Words and Punctuation
d) Stemming
e) Handling n-grams
f) Preprocessing the Full
Corpus Answer:
Natural language data, consisting of text and speech, is often unstructured and noisy, making
it directly unsuitable for machine learning algorithms. Preprocessing natural language data
involves cleaning, transforming, and structuring the data to make it suitable for analysis and
modeling.
36
For example, stemming the words "cats", "catches", and "catching" would all result in the
stem "cat". Lemmatization, on the other hand, would identify "cat" as the lemma for all three
words.
e) Handling n-grams
N-grams are sequences of n consecutive tokens. For example, the bigrams (two-word sequences)
for the phrase "natural language processing" would be "natural language", "language processing",
and "processing".
N-grams can be useful for capturing local patterns in the data, such as collocations (words
that frequently appear together). However, it's important to choose the appropriate value of
n, as too large of a value can lead to a large number of n-grams and increase the
dimensionality of the data.
f) Preprocessing the Full Corpus
Preprocessing the full corpus, or the entire collection of text data, involves applying the above
steps to all the text documents in the corpus. This ensures that the data is consistent and ready
for further analysis or modeling.
2.2. Creating Word Embeddings with word2vec
a) The Essential Theory Behind word2vec
b) Evaluating Word Vectors
c) Running word2vec
d) Plotting Word
Vectors Answer:
a) The Essential Theory Behind word2vec
word2vec is a family of algorithms and models for producing word embeddings, which are
vector representations of words. These vector representations capture the semantic and
syntactic relationships between words, allowing machines to understand the meaning and
context of words in natural language.
There are two main word2vec algorithms:
Continuous Bag-of-Words (CBOW): CBOW predicts a target word from its surrounding
context words. It does this by training a neural network that takes a sequence of
context words as input and outputs a vector representation of the target word.
Skip-gram: Skip-gram predicts the surrounding context words from a target word. It
does this by training a neural network that takes a vector representation of a target
word as input and outputs a probability distribution over all possible context words.
Both CBOW and skip-gram effectively learn word embeddings that capture the relationships
between words in a corpus of text. These word embeddings can then be used for various
natural language processing tasks, such as:
Word similarity: Measuring the similarity between words based on their vector
representations.
Semantic analogy: Solving analogies like "King is to Queen as Man is to ?" using word
embeddings.
Sentiment analysis: Classifying the sentiment of text (positive, negative, neutral) based
on word embeddings.
37
b) Evaluating Word Vectors
Evaluating the quality of word embeddings is crucial for ensuring their effectiveness in
downstream tasks. Several methods can be used to evaluate word embeddings:
Lexical similarity tasks: Measuring how well word embeddings capture word similarity,
such as using the WordSim353 dataset.
38
Analogy tasks: Measuring how well word embeddings can solve word analogies, such as
using the Stanford Analogy Test.
Intrinsic evaluation metrics: Measuring the internal consistency of word embeddings, such
as using the cosine similarity between words with similar meanings.
c) Running word2vec
There are various tools and libraries available for implementing word2vec algorithms, such as
Gensim and TensorFlow. These tools provide user-friendly interfaces for training and using
word2vec models.
The general process of running word2vec involves:
1. Preparing the corpus: Cleaning and preprocessing the text data to remove noise and
inconsistencies.
2. Training the word2vec model: Selecting the appropriate algorithm (CBOW or skip-
gram), setting hyperparameters (embedding size, window size), and training the model on
the preprocessed text data.
3. Evaluating the word embeddings: Using evaluation metrics to assess the quality of the
learned word vectors.
4. Saving and using the word embeddings: Saving the word embeddings for later use in
downstream tasks.
d) Plotting Word Vectors
Visualizing word embeddings can provide insights into their relationships and semantic
properties. Two common techniques for plotting word vectors are:
2D projection: Reducing the dimensionality of word embeddings to two dimensions and
plotting them in a scatter plot. This allows for visualizing the relative positions of words
in the vector space.
TSNE: Using t-distributed stochastic neighbor embedding (TSNE) to project high-
dimensional word embeddings into a lower-dimensional space while preserving local
and global relationships. This produces a more nuanced visualization of word
relationships.
By plotting word vectors, we can observe how words with similar meanings cluster together,
identify outliers, and gain insights into the semantic structure of the language.
2.3. The Area under the ROC Curve
a) The Confusion Matrix
b) Calculating the ROC AUC
Metric Answer:
a) The Confusion Matrix
A confusion matrix is a table that summarizes the performance of a binary classifier over a set of
test data. It is a tabular representation of the actual and predicted classifications for each
class in the dataset.
Actual\ Positive Negative
Predicted
Positive True Positives False Negatives
(TP) (FN)
Negative False Positives True Negatives
39
(FP) (TN)
40
True Negatives (TN): Instances correctly classified as negative.
b) Calculating the ROC AUC Metric
The receiver operating characteristic (ROC) curve is a graphical plot that illustrates the
performance of a binary classifier at varying threshold settings. The ROC curve plots the true
positive rate (TPR) against the false positive rate (FPR) at each threshold setting.
The area under the ROC curve (AUC) is a single scalar value that measures the overall
performance of a binary classifier. It is a measure of how well the classifier can distinguish
between positive and negative cases.
A higher AUC indicates a better classifier, as it means that the classifier is better at
distinguishing between positive and negative cases. A perfect classifier would have an AUC
of 1.0, while a random classifier would have an AUC of 0.5.
Here's the formula for calculating the AUC:
AUC = ∫ TPR(FPR) dFPR
This formula represents the integral of the TPR curve over the range of FPR values from 0 to
1. In practice, the AUC is often calculated using numerical methods, such as the trapezoidal
rule.
2.4.Natural Language Classification with Familiar Networks
41
Here's an example of a review and its corresponding word indices:
Review: "A very good movie with an excellent performance by the lead actor."
42
Word Indices: [123, 456, 789, 1011, 2345, 6789, ...]
The task of sentiment classification involves predicting the sentiment (positive or negative) of
a given review based on its word indices.
c) Standardizing the Length of the Reviews
Movie reviews can vary significantly in length, which can affect the performance of machine
learning models. To standardize the length of the reviews, we can pad shorter reviews with
zeros and truncate longer reviews to a fixed length.
Here's an example of padding shorter reviews:
max_length = 100 # Maximum length of a
review x_train = pad_sequences(x_train,
maxlen=max_length) x_test =
pad_sequences(x_test, maxlen=max_length)
This code snippet pads all reviews in the training and testing sets to a maximum length of 100.
d) Dense Network
A dense network, also known as a fully connected network, is a simple and effective
architecture for natural language classification. It consists of a series of fully connected layers,
where each layer is connected to every neuron in the previous layer.
Here's an example of a dense network for sentiment classification:
from keras.models import Sequential
from keras.layers import Dense, Embedding,
LSTM model = Sequential()
model.add(Embedding(10000, 128, input_length=max_length))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))
This model consists of three layers: an embedding layer, an LSTM layer, and a dense output
layer. The embedding layer converts word indices into vector representations. The LSTM layer
learns long-range dependencies in the text sequences. The dense output layer produces a
probability between 0 and 1, indicating the sentiment of the review.
e) Convolutional Networks
Convolutional neural networks (CNNs) are another powerful architecture for natural language
processing tasks. CNNs are particularly well-suited for extracting local patterns and features
from text data.
Here's an example of a convolutional network for sentiment
classification: from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D, Dense,
Embedding model = Sequential()
model.add(Embedding(10000, 128, input_length=max_length))
model.add(Conv1D(64, filter_size=3, activation='relu'))
43
model.add(MaxPooling1D(pool_size=2))
44
model.add(Dense(1, activation='sigmoid'))
This model consists of three layers: an embedding layer, a convolutional layer, and a dense
output layer. The convolutional layer applies convolutional filters to extract local patterns
from the text sequences. The max pooling layer reduces the dimensionality of the feature
maps. The dense output layer produces a probability between 0 and 1, indicating the
sentiment of the review.
Both dense networks and convolutional networks can be effective for natural language
classification. The choice of architecture depends on the specific task and dataset.
2.5.Networks Designed for Sequential Data
48
e) Seq2seq and Attention
Seq2seq models are a type of RNN that is specifically designed for tasks that involve translating
sequences from one language to another. Seq2seq models typically consist of an encoder-
decoder architecture, where the encoder encodes the input sequence into a vector
representation and the decoder generates the output sequence based on the vector
representation.
Attention is a mechanism that allows the decoder to focus on the most relevant parts of the
input sequence when generating the output sequence. This can improve the accuracy of the
model, especially for long input sequences.
e) Transfer Learning in NLP
Transfer learning is a technique that involves using a pre-trained model on a new task. This can be
particularly useful in natural language processing, where training large neural networks from
scratch can be computationally expensive and time-consuming.
49
50
Chapter-III: Generative Adversial Networks
3.1.Essential GAN Theory
3.3.The Quick, Draw!
Dataset 3.4.The
Discriminator Network
3.5.The Generator Network
3.6.The Adversarial
Network 3.7.GAN Training
Answer:
A Generative Adversarial Network (GAN) is a deep learning architecture that consists of two
neural networks competing against each other in a zero-sum game framework. The goal of
GANs is to generate new, synthetic data that resembles some known data distribution.
3.1. Essential GAN Theory
Generative adversarial networks (GANs) are a class of machine learning models that are able
to generate new data that is similar to real data. GANs are composed of two main
components: a generator and a discriminator.
The Generator:
The generator is responsible for generating new data. It takes in a random noise vector as
input and produces a synthetic data sample as output. The goal of the generator is to produce
data that is so realistic that it can fool the discriminator into believing it is real.
The Discriminator:
The discriminator is responsible for distinguishing between real and fake data. It takes in a data
sample as input and outputs a probability that the sample is real. The goal of the discriminator
is to correctly classify real data as real and fake data as fake.
The generator and discriminator are trained in an adversarial manner. The generator tries to
improve its ability to generate realistic data, while the discriminator tries to improve its ability
to distinguish between real and fake data. This adversarial training process forces both the
generator and discriminator to improve their performance, leading to the generation of
increasingly realistic data.
3.3. The Quick, Draw! Dataset
The Quick, Draw! dataset is a large collection of hand-drawn sketches paired with
corresponding labels. The dataset was created by Google and is freely available for research
purposes.
The Quick, Draw! dataset is a valuable resource for training GANs because it provides a large
amount of real data that can be used to train the discriminator. Additionally, the dataset is
labeled, which allows the training process to be supervised.
3.4. The Discriminator Network
The discriminator network is a convolutional neural network (CNN) that takes in a sketch as
input and outputs a probability that the sketch is real. The CNN architecture is well-suited for this
task because it is able to extract local features from the sketch, which can be used to
distinguish between real and fake sketches.
The discriminator network consists of several convolutional layers followed by a fully
51
connected layer. The convolutional layers extract features from the sketch, while the fully
connected layer outputs the probability that the sketch is real.
52
3.5. The Generator Network
The generator network is a recurrent neural network (RNN) that takes in a random noise
vector as input and produces a sketch as output. The RNN architecture is well-suited for this
task because it is able to generate sequences of data, which is necessary for generating
sketches.
The generator network consists of an embedding layer, an LSTM layer, and a decoder layer.
The embedding layer converts the random noise vector into a vector representation. The
LSTM layer processes the vector representation and generates a sequence of hidden states.
The decoder layer takes the hidden states as input and produces a sketch as output.
3.6. The Adversarial Network
The adversarial network is the combination of the generator and discriminator networks. The
adversarial network is trained in an adversarial manner, where the generator tries to fool the
discriminator into believing its sketches are real, while the discriminator tries to correctly classify
real sketches as real and fake sketches as fake.
The adversarial training process forces both the generator and discriminator to improve their
performance. The generator learns to generate more realistic sketches, while the
discriminator learns to better distinguish between real and fake sketches.
3.7. GAN Training
GAN training is a complex process that requires careful tuning of hyperparameters. The
following are some of the key hyperparameters that need to be tuned:
Learning rate: The learning rate determines how quickly the generator and
discriminator networks are updated during training. A high learning rate can lead to
unstable training, while a low learning rate can lead to slow training.
Batch size: The batch size determines the number of data samples that are used to update
the generator and discriminator networks at each training step. A large batch size can
lead to more stable training, while a small batch size can lead to faster training.
Regularization: Regularization techniques can be used to prevent the generator and
discriminator networks from overfitting the training data. Common regularization
techniques include dropout and L2 regularization.
Loss function: The loss function determines how the performance of the generator and
discriminator networks is measured. The most common loss function for GANs is the
binary cross-entropy loss.
GAN training can be a challenging process, but it can be very rewarding when successful.
GANs have been used to generate a variety of realistic data, including images, videos, and
music.
53
54
Chapter-IV: Deep Reinforcement Learning
4.1.Essential Theory of Reinforcement Learning
a) The Cart-Pole Game
b) Markov Decision Processes
c) The Optimal
Policy Answer:
a) The Cart-Pole Game
The Cart-Pole game is a classic reinforcement learning task that involves balancing a pole on
a cart. The cart can move left or right, and the goal is to keep the pole upright as long as possible.
The game is typically played in a simulated environment, where the state of the cart and pole
is represented by a vector of numbers.
The Cart-Pole game is a good example of a reinforcement learning problem because it has the
following characteristics:
The environment is dynamic: The state of the cart and pole changes over time.
The environment is stochastic: The outcome of an action is not always deterministic.
There is a delayed reward: The agent does not receive a reward immediately after taking
an action.
b) Markov Decision Processes
A Markov decision process (MDP) is a mathematical framework for modeling reinforcement
learning problems. An MDP is defined by the following elements:
A set of states: The states represent the possible configurations of the environment.
A set of actions: The actions are the choices that the agent can make.
A transition function: The transition function defines how the environment transitions
from one state to another given an action.
A reward function: The reward function defines the immediate reward that the agent
receives for taking an action.
The goal of reinforcement learning is to find an optimal policy, which is a function that maps
states to actions and maximizes the expected cumulative reward.
c) The Optimal Policy
The optimal policy is the policy that maximizes the expected cumulative reward. The
expected cumulative reward is the sum of the immediate rewards that the agent expects to
receive over an infinite time horizon.
There are two main approaches to finding the optimal policy:
Value-based methods: Value-based methods estimate the value of each state and use
this information to select the action that will lead to the highest expected cumulative
reward.
Policy-based methods: Policy-based methods directly update the policy using gradient
descent or other optimization techniques.
Both value-based and policy-based methods have their own strengths and weaknesses. Value-
based methods are often more stable and less prone to overfitting, while policy-based
55
methods can be more efficient and can sometimes find policies that are closer to the optimal
policy.
56
4.2.Essential Theory of Deep Q-Learning Networks
a) Value Functions
b) Q-Value Functions
c) Estimating an Optimal Q-
Value Answer:
a) Value Functions
A value function is a function that maps states to their expected cumulative reward. In other
words, the value function estimates how much reward an agent can expect to receive if it
starts in a given state and follows an optimal policy.
The value function for a state s is denoted by V(s). The value function can be computed using
the following recursive equation:
γ is the discount factor, which determines how much the agent values future rewards.
P(s'|s, a) is the probability of transitioning to state s' after taking action a in state s.
γ is the discount factor, which determines how much the agent values future rewards.
P(s'|s, a) is the probability of transitioning to state s' after taking action a in state s.
max_a' is the maximum over all possible next actions a'.
57
3. The neural network is trained on a batch of experiences from the replay buffer.
4. The agent uses the updated neural network to select actions.
The goal of training the neural network is to minimize the following loss function:
Discount factor (γ): This parameter determines the importance of future rewards
compared to immediate ones. A higher discount factor emphasizes long-term rewards,
while a lower discount factor prioritizes immediate gratification.
Exploration rate (ε): This parameter controls the balance between exploration (trying
new actions) and exploitation (sticking to known good actions). A higher exploration
rate encourages exploration of new strategies, while a lower exploration rate focuses
on exploiting existing knowledge.
Memory replay buffer size: This parameter determines the capacity of the agent's memory,
which stores past experiences for training. A larger memory allows for more diverse
training data, but it also increases computational cost.
Target network update frequency: This parameter specifies how often the target
network, used for generating Q-values during training, is updated from the main
network. A higher update frequency ensures the target network remains aligned with
the latest learning progress.
Batch size: This parameter defines the number of experiences sampled from the
memory replay buffer for each training update. A larger batch size improves training
58
efficiency but may lead to overfitting.
59
b) Building the Agent's Neural Network Model
The core of a DQN agent is its neural network model, responsible for estimating Q-values. The
network architecture typically consists of an input layer, hidden layers, and an output layer.
The input layer receives the current state representation of the environment, while the output
layer produces Q-values for each possible action.
The choice of activation functions for the hidden layers depends on the specific task and the
nature of the input data. Common activation functions include rectified linear units (ReLU)
and hyperbolic tangent (tanh).
c) Remembering Gameplay
The agent's memory replay buffer plays a crucial role in DQN's learning process. It stores past
experiences, also known as transitions, in the form of tuples containing the current state, the
action taken, the immediate reward received, and the next state after the action.
By storing these experiences, the agent can revisit and learn from past mistakes or successful
strategies, allowing it to improve its decision-making over time.
d) Training via Memory Replay
The training process of a DQN agent involves sampling batches of experiences from its
memory replay buffer. These experiences are used to update the neural network model by
minimizing the loss function, which measures the difference between the predicted Q-values
and the actual rewards obtained.
The loss function is typically calculated using the Bellman equation, which allows the agent to
propagate the value information from future states back to the current state.
e) Selecting an Action to Take
When interacting with the environment, the agent faces the decision of choosing an action to
take based on the current state. The agent's policy, guided by the Q-values estimated by the
neural network, determines how it balances exploration and exploitation.
With a high exploration rate, the agent may select random actions to explore new
possibilities. As learning progresses and the agent gains more knowledge, the exploration rate
decreases, and the agent becomes more likely to exploit known good actions.
f) Saving and Loading Model Parameters
During training, the DQN agent's neural network model improves its ability to estimate Q-
values and make better decisions. To preserve this progress, it's important to save the model
parameters periodically.
Saving the model parameters allows the agent to resume training from a specific point in
time, avoiding the need to restart the entire learning process from scratch. Additionally,
saved models can be used to evaluate the agent's performance on different tasks or
environments.
60
61