0% found this document useful (0 votes)

500 views61 pages

Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)

Uploaded by

korapatiusharani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

500 views61 pages

Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)

Uploaded by

korapatiusharani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 61

UNIT-V

Interactive Applications of Deep Learning: Machine Vision, Natural Language processing,

Generative Adversial Networks, Deep Reinforcement Learning.
Deep Learning Research: Autoencoders, Deep Generative Models: Boltzmann Machines
Restricted Boltzmann Machines, Deep Belief Networks.
………………………………………………………………………………………………………
…………………….. Interactive Applications of Deep Learning
Chapter-1:Machine Vision
1.1.Convolutional Neural Networks
a) The Two-Dimensional Structure of Visual Imagery
b) Computational Complexity
c) Convolutional Layers
c) Multiple Filters
d) A Convolutional Example
e) Convolutional Filter
Hyperparameters 1.2.Pooling Layers
1.3. LeNet-5 in Keras
1.4. AlexNet and VGGNet in
Keras 1.5.Residual
Networks
a) Vanishing Gradients: The Bête Noire of Deep CNNs
b) Residual Connections
c) ResNet
1.6. Applications of Machine Vision
a) Object Detection
b) Image Segmentation
c) Transfer Learning
d) Capsule Networks
Chapter-II: Natural Language processing
2.1. Preprocessing Natural Language Data
a) Tokenization
b) Converting All Characters to Lowercase
c) Removing Stop Words and Punctuation
d) Stemming
e) Handling n-grams
f) Preprocessing the Full Corpus
1
2
2.2. Creating Word Embeddings with word2vec
a) The Essential Theory Behind word2vec
b) Evaluating Word Vectors
c) Running word2vec
d) Plotting Word Vectors
2.3. The Area under the ROC Curve
a) The Confusion Matrix
b) Calculating the ROC AUC Metric
2.4. Natural Language Classification with Familiar Networks
a) Loading the IMDb Film Reviews
b) Examining the IMDb Data
c) Standardizing the Length of the Reviews
d) Dense Network
e) Convolutional Networks
2.5.Networks Designed for
Sequential Data
a) Recurrent Neural Networks
b) Long Short-Term Memory Units
c) Bidirectional LSTMs
d) Stacked Recurrent Models
e) Seq2seq and Attention
e) Transfer Learning in NLP
2.6.Non-sequential Architectures: The Keras
Functional API Chapter-III: Generative Adversial
Networks
3.1. Essential GAN Theory
3.3.The Quick, Draw!
Dataset 3.4.The
Discriminator Network
3.5.The Generator
Network 3.6.The
Adversarial Network
3.7.GAN Training
Chapter-IV: Deep Reinforcement Learning
4.1.Essential Theory of Reinforcement Learning
3
a) The Cart-Pole Game
b) Markov Decision Processes
c) The Optimal Policy

4
4.2.Essential Theory of Deep Q-Learning Networks

a) Value Functions
b) Q-Value Functions
c) Estimating an Optimal Q-
Value 4.3.Defining a DQN
Agent
a) Initialization Parameters
b) Building the Agent’s Neural Network Model
c) Remembering Gameplay
d) Training via Memory Replay
e) Selecting an Action to Take
f) Saving and Loading Model Parameters
4.4.Interacting with an OpenAI Gym
Environment 4.5.Hyperparameter
Optimization with SLM Lab 4.6.Agents
Beyond DQN
a) Policy Gradients and the REINFORCE Algorithm
b) The Actor-Critic Algorithm
……………………………………………………………………………………………………………
……………….. Deep Learning Research
Chapter-I: Autoencoders
1.1 Undercomplete Autoencoders
1.2 Regularized Autoencoders
1.3 Representational Power, Layer Size and Depth
1.4 Stochastic Encoders and Decoders
1.5 Denoising Autoencoders
1.6 Learning Manifolds with Autoencoders
1.7 Contractive Autoencoders
1.8 Predictive Sparse Decomposition
1.9 Applications of
Autoencoders Chapter-II: Deep
Generative Models
2.1.Boltzmann Machines
2.2.Restricted Boltzmann

5
Machines 2.3.Deep Belief
Networks

6
Deep Learning
Research Chapter-I:
Autoencoders
 An autoencoder is a neural network that is trained to attempt to copy its input to its
output.
 Internally, it has a hidden layer h that describes a code used to represent the input.
 The network may be viewed as consisting of two parts: an encoder function h = f (x)
and a decoder that produces a reconstruction r = g(h). This architecture is presented in
below figure

An autoencoder is a type of artificial neural network that is trained to reconstruct its input. It
does this by learning a hidden representation of the input, and then using this hidden
representation to generate a reconstructed output.
Autoencoders are typically used for unsupervised learning tasks, such as dimensionality
reduction, anomaly detection, and feature extraction. They can also be used for supervised
learning tasks, such as classification and regression, by pre-training the autoencoder on an
unlabeled dataset and then using the learned hidden representations as input to a supervised
learning model.
Autoencoders consist of two main parts: an encoder and a decoder. The encoder takes the
input data and compresses it into a hidden representation. The decoder then takes the hidden
representation and reconstructs the output data.
Autoencoders are trained by minimizing the difference between the input data and the
reconstructed output data. This is typically done using a loss function, such as the mean
squared error.
Once an autoencoder is trained, it can be used to generate new data that is similar to the
training data. This can be useful for tasks such as data augmentation and image generation.
Here is a simple example of how to use an autoencoder to reduce the dimensionality of a
dataset:
1. Train an autoencoder on the dataset.
2. Use the encoder to generate a hidden representation of the dataset.
3. Discard the original dataset and only keep the hidden representations.
4. Use the hidden representations as input to a supervised learning model, such as a
logistic regression classifier.
This approach can be used to reduce the dimensionality of the dataset while preserving the

7
most important information. This can improve the performance of the supervised learning
model, especially if the dataset is large or noisy.

8
1.1. Undercomplete Autoencoder
The objective of undercomplete autoencoder is to capture the most important features
present in the data. Undercomplete autoencoders have a smaller dimension for hidden layer
compared to the input layer. This helps to obtain important features from the data. It
minimizes the loss function by penalizing the g(f(x)) for being different from the input x.
Advantages-
 Undercomplete autoencoders do not need any regularization as they maximize the
probability of data rather than copying the input to the output.
Drawbacks-
 Using an overparameterized model due to lack of sufficient training data can create
overfitting.
1.2. Regularized autoencoders
Regularized autoencoders are a type of autoencoder that incorporates regularization
techniques to prevent overfitting and improve generalization. Overfitting occurs when a
model learns the training data too well and is unable to generalize to new data. Regularization
techniques help to prevent this by constraining the model to learn simpler representations of
the data.
There are three main types of regularized autoencoders:
1.2.1) Denoising Autoencoder
Denoising autoencoders create a corrupted copy of the input by introducing some noise. This
helps to avoid the autoencoders to copy the input to the output without learning features
about the data.
These autoencoders take a partially corrupted input while training to recover the original
undistorted input. The model learns a vector field for mapping the input data towards a lower
dimensional manifold which describes the natural data to cancel out the added noise.
Advantages-
 It was introduced to achieve good representation. Such a representation is one that can
be obtained robustly from a corrupted input and that will be useful for recovering the
corresponding clean input.
 Corruption of the input can be done randomly by making some of the input as zero.
Remaining nodes copy the input to the noised input.
 Minimizes the loss function between the output node and the corrupted input.
 Setting up a single-thread denoising autoencoder
is easy. Drawbacks-
 To train an autoencoder to denoise data, it is necessary to perform preliminary
stochastic mapping in order to corrupt the data and use as input.
 This model isn't able to develop a mapping which memorizes the training data because
our input and target output are no longer the same.

9
10
1.2.2) Sparse Autoencoder
Sparse autoencoders have hidden nodes greater than input nodes. They can still discover
important features from the data. A generic sparse autoencoder is visualized where the
obscurity of a node corresponds with the level of activation. Sparsity constraint is introduced
on the hidden layer. This is to prevent output layer copy input data. Sparsity may be obtained
by additional terms in the loss function during the training process, either by comparing the
probability distribution of the hidden unit activations with some low desired value,or by
manually zeroing all but the strongest hidden unit activations. Some of the most powerful AIs
in the 2010s involved sparse autoencoders stacked inside of deep neural networks.
Advantages-
 Sparse autoencoders have a sparsity penalty, a value close to zero but not exactly zero.
Sparsity penalty is applied on the hidden layer in addition to the reconstruction error.
This prevents overfitting.
 They take the highest activation values in the hidden layer and zero out the rest of the
hidden nodes. This prevents autoencoders to use all of the hidden nodes at a time and
forcing only a reduced number of hidden nodes to be used.
Drawbacks-
 For it to be working, it's essential that the individual nodes of a trained model which
activate are data dependent, and that different inputs will result in activations of
different nodes through the network.

11
1.2.3) Contractive Autoencoder

12
The objective of a contractive autoencoder is to have a robust learned representation which is
less sensitive to small variation in the data. Robustness of the representation for the data is
done by applying a penalty term to the loss function. Contractive autoencoder is another
regularization technique just like sparse and denoising autoencoders. However, this regularizer
corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with
respect to the input. Frobenius norm of the Jacobian matrix for the hidden layer is calculated
with respect to input and it is basically the sum of square of all elements.
Advantages-
 Contractive autoencoder is a better choice than denoising autoencoder to learn
useful feature extraction.
 This model learns an encoding in which similar inputs have similar encodings. Hence,
we're forcing the model to learn how to contract a neighborhood of inputs into a
smaller neighborhood of outputs.

1.3) Representational Power, Layer Size and Depth

A feedforward network with a single layer is sufficient to represent any function, but the layer may
be infeasibly large and may fail to learn and generalize correctly. In many circumstances,
using deeper models can reduce the number of units required to represent the desired
function and can reduce the amount of generalization error.
Choosing a deep model encodes a very general belief that the function we want to learn
should involve composition of several simpler functions.
Layer size refers to the number of neurons in each layer of the network. A larger layer size
allows the network to learn more complex representations. However, it is important to note
that increasing the layer size too much can lead to overfitting.
Layer depth refers to the number of layers in the network. A deeper network can learn more
complex representations than a shallower network. However, deeper networks are also more
difficult to train and can be more prone to overfitting.
In general, a larger and deeper autoencoder will have more representational power. However,
it is important to choose the appropriate network size and depth for the specific task at hand.
If the network is too large or too deep, it may overfit the training data and fail to generalize to
new data.

13
14
1.4) Stochastic Encoders and Decoders
Stochastic Encoders and Decoders are a type of autoencoder that uses stochasticity
(randomness) in the encoding and decoding process. This can be achieved by using dropout,
variational inference, or other techniques. Stochastic encoders and decoders have several
advantages over deterministic autoencoders, including:
 Improved generalization: Stochasticity helps to prevent overfitting by regularizing the
model and making it more robust to noise in the training data.
 More informative representations: Stochastic encoders and decoders can learn more
informative representations of the data by capturing the uncertainty in the data.
 More efficient training: Stochastic encoders and decoders can be trained more
efficiently than deterministic autoencoders, as they do not need to compute the full
posterior distribution of the latent variables.
Stochastic encoders and decoders have been used for a variety of tasks, including:
 Image denoising: Stochastic encoders and decoders can be used to denoise images
by learning to reconstruct noisy versions of the input data.
 Inpainting: Stochastic encoders and decoders can be used to inpaint missing pixels in
images by learning to reconstruct the images from partial information.
 Super-resolution: Stochastic encoders and decoders can be used to super-resolve images
by learning to upsample low-resolution images to high-resolution images.
 Generative adversarial networks (GANs): Stochastic encoders and decoders are often
used as the generator component in GANs.
1.6) Learning Manifolds with Autoencoders
Manifold learning is an approach in machine learning that assumes that data lies on a
manifold of a much lower dimension. These manifolds can be linear or non-linear. Thus, the
area tries to project the data from high-dimension space to a low dimension. For example,
principle component analysis (PCA) is an example of linear manifold learning whereas an
autoencoder is a non-linear dimensionality reduction (NDR) with the ability to learn non-linear
manifolds in low dimensions. A comparison of linear and non-linear manifold learning is shown
in the following figure:

As you can see from graph a), the data is residing at a linear manifold, whereas in graph
graph b), the data is residing on a second-order non-linear manifold.

15
16
All autoencoder training procedures involve a compromise between two forces:

1. Learning a representation h of a training example x such that x can be approximately

recovered from h through a decoder. The fact that x is drawn from the training data is crucial,
because it means the autoencoder need not successfully reconstruct inputs that are not
probable under the data generating distribution.
2. Satisfying the constraint or regularization penalty. This can be an architectural constraint
that limits the capacity of the autoencoder, or it can be a regularization term added to the
reconstruction cost. These techniques generally prefer solutions that are less sensitive to the
input.
1.8) Predictive Sparse Decomposition
Predictive Sparse Decomposition (PSD) is a type of sparse coding algorithm that uses a
parametric encoder to predict the sparse representation of the input data. This allows PSD to
be used for out-of-sample extension, i.e., to predict the sparse representation of new data
that was not seen during training.
PSD works by first training a dictionary of overcomplete basis functions using a standard
sparse coding algorithm, such as K-SVD. Once the dictionary is trained, PSD trains a parametric
encoder to predict the sparse representation of the input data. The encoder is trained to
minimize the reconstruction error between the original input data and the reconstructed data
from the sparse representation.
Once the encoder is trained, PSD can be used to predict the sparse representation of new
data by simply feeding the new data to the encoder. The encoder will output the predicted
sparse representation, which can then be used for a variety of tasks, such as classification,
regression, or anomaly detection.
PSD has several advantages over other sparse coding algorithms:
 Out-of-sample extension: PSD can be used to predict the sparse representation of new
data that was not seen during training. This is not possible with standard sparse coding
algorithms, which require the input data to be known in order to compute the sparse
representation.
 Efficient inference: PSD can be used to efficiently predict the sparse representation of
new data. This is because PSD uses a parametric encoder, which is much faster to
compute than the iterative algorithms used by standard sparse coding algorithms.
 Improved generalization: PSD has been shown to improve generalization performance
on a variety of tasks, such as classification and regression. This is because PSD learns to
predict the sparse representation of the data, rather than simply memorizing the training
data.
1.9) Applications of Autoencoders
Autoencoders have been successfully applied to dimensionality reduction and
information retrieval. Lower dimensional representations can improve performance on
many tasks, such as classification. Information retrieval: the task of finding entries in
a database that resemble a query entry. Benefit:
 Usual benefits from dimensionality reduction that other tasks do
 search can be extremely efficient in certain kind of low dimensional spaces. Semantic
hashing: train the D reduction algorithm to produce a code that is low-dimensional and
17
binary, then we can store all database entries in a hash table that maps binary code
vector to entries. We cam also search over slightly less similar entries efficiently, just
by flipping individual bits from the encoding of the query.
Semantic hashing is applied to both textual input and image.

18
Chapter-II: Deep Generative Models
2.1) Boltzmann
Machines Boltzman
Machine definition:

This senario is certainly viable, it does limit the kind of interactions between the observed
variables to those described by the weight matrix. Specifically, it means that the probability of
one unit being on is given by a linear model (logistic regression) from the values of the other
units.
A Boltzmann machine with hidden units is no longer limited to modeling linear relationship
between variables. Instead, the Boltzmann machine becomes a universal approximator of
probability mass functions over discrete variables.
Formally, we decompose the unites x into 2 subsets:
 Visible units v
 Latent / Hidden units h
Then the energy function
become:

Learning algorithms for Boltzmann Machines are usually based on maximum likehood. All
Boltzmann machines have an intractable partition function, so the maximum likehood
gradient must be approximated.
Interesting property of Boltzmann machine when trained with learning fules based one
maximum likehood is that update for particular weight connecting 2 unites depends only on
the statistics of those 2 units, collected
under different distribution pmodel(v) and pdata(v)pmodel(h|v). The rest of the network
participates in shaping those statistics, but the weight can be updated without knowing
anything about the rest of the network or how those statistics were produced. This means
that the learning rule is local.
2.2)Restricted Boltzmann Machines (RBMs)
A restricted term means that we are not permitted to connect two types of layers that are of
the same type to one another. In other words, the two hidden layers or input layers of
neurons are unable to form connections with one another. However, there may be
connections between the apparent and hidden layers.
Since there is no output layer in our machine, it is unclear how we will detect, modify the
weights, and determine whether or not our prediction was right. One response fits all the
questions: Restricted Boltzmann Machine.

19
Features of Restricted Boltzmann Machine
Some key characteristics of the Boltzmann machine are:
 There are no connections between the layers.
 They employ symmetric and recurring structures.

20
 It is an algorithm for unsupervised learning, meaning that it draws conclusions from
the input data without labeled replies.
 In their learning process, RBMs attempt to link low energy states with high probability
ones and vice versa.
Working of RBM
A low-level feature from a learning target item in the dataset is used by each visible node.
The hidden layer's node 1 multiplies x by weight and adds it to a bias. These two procedures'
outcomes are fed into an activation function, which, given an input of x, creates the output of
the node, or the signal strength traveling through it.
Let's now examine how many inputs would mix at a single hidden node. The output of the
node is created by multiplying each x by a distinct weight, summing the products, adding the
sum to a bias, and then passing the final result once again via an activation function.
Each input x is multiplied by its corresponding weight w at each buried node. In other words, a
single input x would have three weights in this situation, totaling 12 weights (4 input nodes x
3 hidden nodes). The weights between the two layers will always create a matrix with input
nodes in the rows and output nodes in the columns.
The four inputs are sent to each hidden node, multiplied by each weight. Each hidden node
receives one output as a result of the activation algorithm after the sum of these products is
once more added to a bias (which compels at least some activations to occur).
Now that you have a basic understanding of how the Restricted Boltzmann Machine operates,
let's move and examine the procedures for RBM training.
Training of RBM
Two methods - Gibbs Sampling and Contrastive Divergence are used to train RBM.
When direct sampling is challenging, the Markov chain Monte Carlo method known as Gibbs
sampling is used to get a series of observations that are roughly drawn from a given
multivariate probability distribution.
The prediction is the hidden value by h and p(h|v) if the input is represented by v. P(v|h) is
utilized for the regenerated input values’ prediction when the hidden values are known. Let's
say that after k rounds, v k is acquired from input value v 0 after this process has been
performed k times.
In order to approximate the graph slope, a graphical slope showing the relationship between a
network's errors and its weights is called the gradient in Contrastive Divergence. Contrastive
Divergence is a rough Maximum-Likelihood learning approach and is employed when we need
to approximate the learning gradient of the algorithm and choose which direction to go in
because we cannot directly evaluate a set of probabilities or a function.
Weights are updated on CD. The gradient is first determined from the reconstructed input, and the
old weights are updated by adding the delta.

21
22
2.3)Deep Belief Networks (DBNs)
Consider stacking numerous RBMs so that the outputs of the first RBM serve as the input for
the second RBM, and so forth. Deep Belief Networks are the name given to these networks.
Each layer's connections are undirected (as each layer is an RBM). Those between the strata are
simultaneously directed (except for the top two layers – whose connections are undirected). The
DBNs can be trained in two different ways:
 Greedy Layer-wise Training Algorithm: RBMs are trained using a greedy layer-by-layer
training algorithm. The orientation between the DBN layers is established as soon as
the individual RBMs have been trained (i.e., the parameters, weights, and biases, have
been defined).
 Wake-sleep Algorithm: The DBN is trained from the bottom up using a wake-sleep
algorithm (connections going up indicate wake), and then from the bottom up using
connections indicating sleep.
In order to ensure that the layer connections only work downwards, we stack the RBMs, train
them, and then do so (except for the top two layers).

23
24
Chapter-1:Machine Vision
1.1.Convolutional Neural
Networks
a) The Two-Dimensional Structure of Visual Imagery
b) Computational Complexity
c) Convolutional Layers
c) Multiple Filters
d) A Convolutional Example
e) Convolutional Filter
Hyperparameters Answer:
A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks
that specializes in processing data that has a grid-like topology, such as an image. A digital
image is a binary representation of visual data. It contains a series of pixels arranged in a
grid-like fashion that contains pixel values to denote how bright and what color each pixel
should be.
a) The Two-Dimensional Structure of Visual Imagery
Visual imagery is inherently two-dimensional, with each pixel representing a specific location
and intensity value. This two-dimensional structure plays a crucial role in how we perceive
and interpret visual information. Convolutional neural networks (CNNs) are specifically
designed to exploit this structure, allowing them to effectively process and analyze visual
data.
b) Computational Complexity
Traditional neural networks, also known as fully connected networks, connect every neuron in
one layer to every neuron in the next layer. This leads to a significant increase in
computational complexity as the network grows larger. In contrast, CNNs utilize a
convolutional operation that involves sliding a small filter or kernel over the input data. This
operation significantly reduces the number of connections, making CNNs more
computationally efficient for processing visual data.
c) Convolutional Layers
Convolutional layers are the core building blocks of CNNs. They consist of a set of learnable
filters or kernels that are applied to the input data. As the filters slide across the input, they
extract relevant features, such as edges, shapes, and colors. These extracted features are then
passed to subsequent layers for further processing and analysis.
d) Multiple Filters
CNNs can employ multiple filters in each convolutional layer, each capturing different aspects
of the input data. For instance, one filter might detect horizontal edges, while another might
detect vertical edges. By using multiple filters, CNNs can learn a more comprehensive
representation of the input data.
e) A Convolutional Example
Consider a simple image of a cat. A CNN might use a filter that detects horizontal edges to
identify the cat's whiskers. Another filter might detect vertical edges to identify the cat's eyes
and mouth. By combining the outputs of these filters, the CNN can build a more complete
understanding of the cat's features and ultimately classify the image as a cat.
25
f) Convolutional Filter Hyperparameters
Convolutional filters are characterized by several hyperparameters that influence their behavior.
These include:

26
 Filter size: The size of the filter determines the scope of the features it detects. Larger
filters capture larger-scale features, while smaller filters capture finer details.
 Stride: The stride determines how far the filter moves across the input data with each
application. Larger strides lead to more efficient computation but may overlook smaller
details.
 Padding: Padding adds zeros to the edges of the input data, allowing the filter to
extract features that extend beyond the original boundaries.
 Activation function: The activation function introduces non-linearity into the network,
allowing it to learn more complex patterns. Common activation functions for CNNs
include ReLU (Rectified Linear Unit) and Leaky ReLU.
1.2.Pooling
Layers 1.3.LeNet-
5 in Keras
1.4.AlexNet and VGGNet in Keras
Answer:
1.2. Pooling Layers
Pooling layers are another essential component of CNNs. Their primary purpose is to reduce
the spatial dimensions of the feature maps produced by convolutional layers. This reduction in
dimensionality helps to control the computational complexity of the network and prevent
overfitting.
There are two main types of pooling layers: max pooling and
average pooling. Max pooling:
Max pooling selects the maximum value within each pooling region, effectively reducing the
spatial dimensions by a factor of the pooling size. For instance, a 2x2 max pooling layer would
reduce the height and width of the feature map by half.
Average pooling:
Average pooling computes the average value within each pooling region, also reducing the
spatial dimensions by a factor of the pooling size. This type of pooling tends to be more robust
to outliers and noise compared to max pooling.
Pooling layers are typically applied after convolutional layers to reduce the feature map size
before passing it on to subsequent layers. The choice of pooling type (max pooling or
average pooling) and pooling size depends on the specific task and dataset.
1.3. LeNet-5 in Keras
LeNet-5 is a seminal convolutional neural network architecture developed by Yann LeCun in
1998. It was designed for handwritten digit recognition and achieved remarkable performance
on the MNIST dataset. LeNet-5 is considered a groundbreaking architecture that set the
foundation for modern CNNs.
Here's a simplified overview of the LeNet-5 architecture:
 Input layer: Receives a 28x28 grayscale image representing a handwritten digit.
 Convolutional layer 1: Applies 6 5x5 filters to the input image, producing 6 feature maps.
 Pooling layer 1: Applies a 2x2 max pooling operation to each feature map, reducing the
size by half.
27
 Convolutional layer 2: Applies 16 5x5 filters to the pooled feature maps, producing 16
feature maps.
 Pooling layer 2: Applies a 2x2 max pooling operation to each feature map, reducing the
size by half.

28
 Fully connected layer 1: Flattens the pooled feature maps into a 120-dimensional vector.
 Fully connected layer 2: Reduces the 120-dimensional vector to 84 dimensions.
 Output layer: Produces a 10-dimensional vector representing the probabilities of the
input image being each digit (0-9).
Keras provides a built-in implementation of LeNet-5, allowing you to easily train and evaluate
this architecture for handwritten digit recognition.
1.4. AlexNet and VGGNet in Keras
AlexNet and VGGNet are two more advanced CNN architectures that gained popularity in the
early 2010s. These architectures achieved state-of-the-art performance on image
classification tasks, demonstrating the power of CNNs in computer vision.
AlexNet:
AlexNet was the winner of the 2012 ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), achieving a significant improvement in image classification accuracy compared to
previous methods. It introduced deeper architectures with more convolutional and pooling
layers, demonstrating the potential of deeper CNNs.
VGGNet:
VGGNet is a family of CNN architectures that explore the effect of increasing the depth of the
network. It introduced the concept of using very small 3x3 filters stacked in deeper layers,
achieving competitive performance with AlexNet while using fewer parameters.
Keras provides built-in implementations of both AlexNet and VGGNet, allowing you to easily
train and evaluate these architectures for image classification tasks.
1.5. Residual Networks
a) Vanishing Gradients: The Bête Noire of Deep CNNs
b) Residual Connections
c)
ResNet
Answer:
Residual Network: In order to solve the problem of the vanishing/exploding gradient, this
architecture introduced the concept called Residual Blocks. In this network, we use a
technique called skip connections. The skip connection connects activations of a layer to
further layers by skipping some layers in between. This forms a residual block. Resnets
aremade by stacking these residual blocks together. The approach
behind this network is instead of layers learning the underlying mapping, we allow the
network to fit the residual mapping. So, instead of say H(x), initial mapping, let the network
fit,
F(x) := H(x) - x which gives H(x) := F(x) + x.

29
Fig: Skip (Shortcut) connection

30
The advantage of adding this type of skip connection is that if any layer hurt the performance
of architecture then it will be skipped by regularization. So, this results in training a very deep
neural network without the problems caused by vanishing/exploding gradient. The authors of
the paper experimented on 100-1000 layers of the CIFAR-10 dataset.
a) Vanishing Gradients: The Bête Noire of Deep CNNs
As CNNs become deeper, a phenomenon known as vanishing gradients arises. This means
that the gradients of the loss function with respect to the weights in lower layers become very
small, making it difficult for the optimization algorithm to update these weights effectively.
This problem is particularly acute in very deep CNNs, where the gradients can become so
small that they are effectively zero, preventing any meaningful update to the weights. This
makes it difficult for these deep networks to learn and improve their performance.
b) Residual Connections
Residual networks (ResNets) were introduced in 2015 by Kaiming He et al. to address the
problem of vanishing gradients in deep CNNs. ResNets introduce a novel architectural design
that allows for efficient information flow through the network, even when it is very deep.
The key idea behind ResNets is to introduce shortcut connections that bypass one or more
layers in the network. These shortcut connections allow the gradients to flow directly from the
input of a block to its output, without having to pass through all of the intermediate layers.
This helps to prevent the gradients from vanishing, allowing the network to learn more
effectively.
c) ResNet
The ResNet architecture consists of a series of residual blocks, each containing a
convolutional layer followed by a shortcut connection. The shortcut connection typically adds
the input of the block to the output of the convolutional layer, before applying an activation
function. This design allows for efficient information flow through the network, even when it is
very deep.
ResNets have achieved remarkable performance in a wide range of image recognition tasks,
including image classification, object detection, and image segmentation. They have become
the de-facto standard for deep CNN architectures and have been widely adopted in the field of
computer vision.
1.6. Applications of Machine Vision
a) Object Detection
b) Image Segmentation
c) Transfer Learning
d) Capsule
Networks Answer:
Machine vision has revolutionized various industries by automating tasks that were previously
performed manually. Its applications span a wide range, from manufacturing and robotics to
healthcare and agriculture. Here are some notable examples of machine vision applications:
a) Object Detection
Object detection involves identifying and locating objects within images or videos. This
capability is crucial for tasks such as:
 Traffic monitoring: Machine vision systems can detect and track vehicles in traffic

31
footage, enabling real-time traffic analysis and accident prevention.

32
 Security and surveillance: Object detection algorithms can identify and track people
or objects in surveillance footage, enhancing security measures and preventing
unauthorized access.
 Retail inventory management: Machine vision systems can automate product counting
and tracking on shelves, optimizing inventory management and preventing stockouts.
b) Image Segmentation
Image segmentation involves dividing an image into meaningful regions or segments. This
technique is essential for tasks such:
 Medical image analysis: Machine vision algorithms can segment tissues and organs in
medical images, aiding in diagnosing diseases and planning treatments.
 Satellite image analysis: Image segmentation can identify and classify land cover types
from satellite imagery, supporting environmental monitoring and resource
management.
 Autonomous driving: Image segmentation helps autonomous vehicles distinguish
between roads, pedestrians, and other objects, enabling safe and reliable navigation.
c) Transfer Learning
Transfer learning involves utilizing a pre-trained machine learning model for a new task. This
approach is particularly valuable for machine vision applications, as training large-scale deep
learning models can be computationally expensive and time-consuming.
Transfer learning allows researchers to leverage existing models, such as those trained on large
image datasets like ImageNet, and adapt them to specific applications. This can significantly
reduce the training time and computational resources required for developing new machine
vision systems.
d) Capsule Networks
Capsule networks, introduced by Geoffrey Hinton in 2017, represent a recent advancement in
machine vision. They aim to address some of the limitations of traditional convolutional neural
networks (CNNs), particularly in capturing spatial relationships and object hierarchies.
Capsule networks group neurons into capsules, where each capsule represents a higher-level
concept or feature in the image. This hierarchical structure allows capsule networks to learn more
robust representations of objects and their relationships.
While capsule networks are still under development, they hold promise for improving the
performance of machine vision tasks in areas such as object detection, image segmentation,
and image understanding.
Application Description Example Tasks
Object detection Identifying and locating Traffic monitoring,
objects in images or videos security surveillance,
retail inventory
management
Image segmentation Dividing an image into Medical image analysis,
meaningful regions or satellite image analysis,
segments autonomous driving
Transfer learning Utilizing a pre-trained Adapting existing image
machine learning model classification models to specific
for a new task applications, such as product
recognition or defect
detection
Capsule networks A novel neural network Potential improvements in
33
architecture that aims to object
address limitations of detection, image
traditional CNNs segmentation, and image
understanding

34
Chapter-II: Natural Language processing
2.1. Preprocessing Natural Language Data
a) Tokenization
b) Converting All Characters to Lowercase
c) Removing Stop Words and Punctuation
d) Stemming
e) Handling n-grams
f) Preprocessing the Full
Corpus Answer:
Natural language data, consisting of text and speech, is often unstructured and noisy, making
it directly unsuitable for machine learning algorithms. Preprocessing natural language data
involves cleaning, transforming, and structuring the data to make it suitable for analysis and
modeling.

Here are some essential steps in preprocessing natural language data:

a) Tokenization
Tokenization is the process of breaking down a text into individual units, called tokens.
Tokens can be words, punctuation marks, or even individual characters. Tokenization is the
first step in many natural language processing tasks, as it allows the algorithm to understand
the basic structure of the text.
b) Converting All Characters to Lowercase
Converting all characters to lowercase is a simple but effective way to reduce the
dimensionality of the data. This is because many words have different forms depending on
their capitalization, such as "dog" and "Dog". By converting all characters to lowercase, we
can treat these words as equivalent, reducing the number of unique tokens.
c) Removing Stop Words and Punctuation
Stop words are common words that don't carry much meaning, such as "the", "a", and "an".
Removing stop words can reduce the noise in the data and improve the performance of
machine learning algorithms. Punctuation marks can also be removed, as they often don't
contribute to the meaning of the text.
d) Stemming and Lemmatization
Stemming and lemmatization are techniques for reducing words to their base forms.
35
Stemming is a more aggressive process that removes affixes (prefixes and suffixes) from
words, while lemmatization takes into account the context of the word to identify its root
form.

36
For example, stemming the words "cats", "catches", and "catching" would all result in the
stem "cat". Lemmatization, on the other hand, would identify "cat" as the lemma for all three
words.
e) Handling n-grams
N-grams are sequences of n consecutive tokens. For example, the bigrams (two-word sequences)
for the phrase "natural language processing" would be "natural language", "language processing",
and "processing".
N-grams can be useful for capturing local patterns in the data, such as collocations (words
that frequently appear together). However, it's important to choose the appropriate value of
n, as too large of a value can lead to a large number of n-grams and increase the
dimensionality of the data.
f) Preprocessing the Full Corpus
Preprocessing the full corpus, or the entire collection of text data, involves applying the above
steps to all the text documents in the corpus. This ensures that the data is consistent and ready
for further analysis or modeling.
2.2. Creating Word Embeddings with word2vec
a) The Essential Theory Behind word2vec
b) Evaluating Word Vectors
c) Running word2vec
d) Plotting Word
Vectors Answer:
a) The Essential Theory Behind word2vec
word2vec is a family of algorithms and models for producing word embeddings, which are
vector representations of words. These vector representations capture the semantic and
syntactic relationships between words, allowing machines to understand the meaning and
context of words in natural language.
There are two main word2vec algorithms:
 Continuous Bag-of-Words (CBOW): CBOW predicts a target word from its surrounding
context words. It does this by training a neural network that takes a sequence of
context words as input and outputs a vector representation of the target word.
 Skip-gram: Skip-gram predicts the surrounding context words from a target word. It
does this by training a neural network that takes a vector representation of a target
word as input and outputs a probability distribution over all possible context words.
Both CBOW and skip-gram effectively learn word embeddings that capture the relationships
between words in a corpus of text. These word embeddings can then be used for various
natural language processing tasks, such as:
 Word similarity: Measuring the similarity between words based on their vector
representations.
 Semantic analogy: Solving analogies like "King is to Queen as Man is to ?" using word
embeddings.
 Sentiment analysis: Classifying the sentiment of text (positive, negative, neutral) based
on word embeddings.

37
b) Evaluating Word Vectors
Evaluating the quality of word embeddings is crucial for ensuring their effectiveness in
downstream tasks. Several methods can be used to evaluate word embeddings:
 Lexical similarity tasks: Measuring how well word embeddings capture word similarity,
such as using the WordSim353 dataset.

38
 Analogy tasks: Measuring how well word embeddings can solve word analogies, such as
using the Stanford Analogy Test.
 Intrinsic evaluation metrics: Measuring the internal consistency of word embeddings, such
as using the cosine similarity between words with similar meanings.
c) Running word2vec
There are various tools and libraries available for implementing word2vec algorithms, such as
Gensim and TensorFlow. These tools provide user-friendly interfaces for training and using
word2vec models.
The general process of running word2vec involves:
1. Preparing the corpus: Cleaning and preprocessing the text data to remove noise and
inconsistencies.
2. Training the word2vec model: Selecting the appropriate algorithm (CBOW or skip-
gram), setting hyperparameters (embedding size, window size), and training the model on
the preprocessed text data.
3. Evaluating the word embeddings: Using evaluation metrics to assess the quality of the
learned word vectors.
4. Saving and using the word embeddings: Saving the word embeddings for later use in
downstream tasks.
d) Plotting Word Vectors
Visualizing word embeddings can provide insights into their relationships and semantic
properties. Two common techniques for plotting word vectors are:
 2D projection: Reducing the dimensionality of word embeddings to two dimensions and
plotting them in a scatter plot. This allows for visualizing the relative positions of words
in the vector space.
 TSNE: Using t-distributed stochastic neighbor embedding (TSNE) to project high-
dimensional word embeddings into a lower-dimensional space while preserving local
and global relationships. This produces a more nuanced visualization of word
relationships.
By plotting word vectors, we can observe how words with similar meanings cluster together,
identify outliers, and gain insights into the semantic structure of the language.
2.3. The Area under the ROC Curve
a) The Confusion Matrix
b) Calculating the ROC AUC
Metric Answer:
a) The Confusion Matrix
A confusion matrix is a table that summarizes the performance of a binary classifier over a set of
test data. It is a tabular representation of the actual and predicted classifications for each
class in the dataset.
Actual\ Positive Negative
Predicted
Positive True Positives False Negatives
(TP) (FN)
Negative False Positives True Negatives
39
(FP) (TN)

 True Positives (TP): Instances correctly classified as positive.

 False Positives (FP): Instances incorrectly classified as positive.
 False Negatives (FN): Instances incorrectly classified as negative.

40
 True Negatives (TN): Instances correctly classified as negative.
b) Calculating the ROC AUC Metric
The receiver operating characteristic (ROC) curve is a graphical plot that illustrates the
performance of a binary classifier at varying threshold settings. The ROC curve plots the true
positive rate (TPR) against the false positive rate (FPR) at each threshold setting.
The area under the ROC curve (AUC) is a single scalar value that measures the overall
performance of a binary classifier. It is a measure of how well the classifier can distinguish
between positive and negative cases.

TP / (TP + FN) Probability of classifying a positive case as

positive
FP / (FP + TN) Probability of classifying a negative case as
positive

A higher AUC indicates a better classifier, as it means that the classifier is better at
distinguishing between positive and negative cases. A perfect classifier would have an AUC
of 1.0, while a random classifier would have an AUC of 0.5.
Here's the formula for calculating the AUC:
AUC = ∫ TPR(FPR) dFPR
This formula represents the integral of the TPR curve over the range of FPR values from 0 to
1. In practice, the AUC is often calculated using numerical methods, such as the trapezoidal
rule.
2.4.Natural Language Classification with Familiar Networks

a) Loading the IMDb Film Reviews

b) Examining the IMDb Data
c) Standardizing the Length of the Reviews
d) Dense Network
e) Convolutional
Networks Answer:
a) Loading the IMDb Film Reviews
The IMDb dataset for sentiment analysis consists of a collection of movie reviews labeled as
either positive or negative. This dataset is a valuable resource for training and evaluating natural
language processing models for sentiment classification.
Here's the code snippet for loading the IMDb dataset using Keras:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)
This code snippet loads the IMDb dataset, limits the vocabulary to the 10,000 most frequently
occurring words, and splits the data into training and testing sets.
b) Examining the IMDb Data
The IMDb dataset contains movie reviews in the form of text sequences. Each review is
represented as a list of integers, where each integer corresponds to a unique word in the
vocabulary.

41
Here's an example of a review and its corresponding word indices:
Review: "A very good movie with an excellent performance by the lead actor."

42
Word Indices: [123, 456, 789, 1011, 2345, 6789, ...]
The task of sentiment classification involves predicting the sentiment (positive or negative) of
a given review based on its word indices.
c) Standardizing the Length of the Reviews
Movie reviews can vary significantly in length, which can affect the performance of machine
learning models. To standardize the length of the reviews, we can pad shorter reviews with
zeros and truncate longer reviews to a fixed length.
Here's an example of padding shorter reviews:
max_length = 100 # Maximum length of a
review x_train = pad_sequences(x_train,
maxlen=max_length) x_test =
pad_sequences(x_test, maxlen=max_length)
This code snippet pads all reviews in the training and testing sets to a maximum length of 100.
d) Dense Network
A dense network, also known as a fully connected network, is a simple and effective
architecture for natural language classification. It consists of a series of fully connected layers,
where each layer is connected to every neuron in the previous layer.
Here's an example of a dense network for sentiment classification:
from keras.models import Sequential
from keras.layers import Dense, Embedding,
LSTM model = Sequential()
model.add(Embedding(10000, 128, input_length=max_length))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))
This model consists of three layers: an embedding layer, an LSTM layer, and a dense output
layer. The embedding layer converts word indices into vector representations. The LSTM layer
learns long-range dependencies in the text sequences. The dense output layer produces a
probability between 0 and 1, indicating the sentiment of the review.
e) Convolutional Networks
Convolutional neural networks (CNNs) are another powerful architecture for natural language
processing tasks. CNNs are particularly well-suited for extracting local patterns and features
from text data.
Here's an example of a convolutional network for sentiment
classification: from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D, Dense,
Embedding model = Sequential()
model.add(Embedding(10000, 128, input_length=max_length))
model.add(Conv1D(64, filter_size=3, activation='relu'))

43
model.add(MaxPooling1D(pool_size=2))

44
model.add(Dense(1, activation='sigmoid'))
This model consists of three layers: an embedding layer, a convolutional layer, and a dense
output layer. The convolutional layer applies convolutional filters to extract local patterns
from the text sequences. The max pooling layer reduces the dimensionality of the feature
maps. The dense output layer produces a probability between 0 and 1, indicating the
sentiment of the review.
Both dense networks and convolutional networks can be effective for natural language
classification. The choice of architecture depends on the specific task and dataset.
2.5.Networks Designed for Sequential Data

a) Recurrent Neural Networks

b) Long Short-Term Memory Units
c) Bidirectional LSTMs
d) Stacked Recurrent Models
e) Seq2seq and Attention
e) Transfer Learning in
NLP Answer:
2.5. Networks Designed for Sequential Data
a) Recurrent Neural Networks
Recurrent neural networks (RNNs) are a class of neural networks specifically designed for
processing sequential data, such as text, speech, and time series data. RNNs are able to capture
long-range dependencies in sequential data by maintaining an internal state that is updated as
the data is processed.

Fig:Recurrent Neural Network (RNN)

The basic unit of an RNN is a recurrent unit, which is a neural network that takes its own state
as input in addition to the current input. This allows the recurrent unit to remember
information about previous inputs, which is crucial for understanding sequential data.
b) Long Short-Term Memory Units
Long short-term memory (LSTM) units are a type of recurrent unit that are specifically designed to
address the vanishing gradient problem, which is a common issue in RNNs that can make it
difficult to learn long-range dependencies. LSTM units have a complex gating mechanism that
allows them to selectively remember and forget information, making them more effective at
learning long-range dependencies than traditional RNNs.
45
46
Fig:Long ShortTerm Memory (LSTM)
Unit
LSTM units have become the dominant architecture for RNNs and are widely used in a variety
of natural language processing tasks, such as machine translation, sentiment analysis, and
speech recognition.
c) Bidirectional LSTMs
Bidirectional LSTMs (Bi-LSTMs) are a type of LSTM that takes advantage of the fact that
natural language is often processed in two directions, from left to right and from right to left. Bi-
LSTMs process the input sequence in both directions and then combine the two representations
to produce a more comprehensive understanding of the input.

Fig:Bidirectional LSTM (BiLSTM) Unit

Bi-LSTMs have been shown to be particularly effective for tasks that require understanding
both the context and the order of words in a sentence, such as machine translation and
named entity recognition.
d) Stacked Recurrent Models
Stacked recurrent models are a type of RNN that consists of multiple layers of recurrent units
stacked on top of each other. Each layer of recurrent units takes the output of the previous
layer as input, allowing the model to capture more complex patterns in the data.

Fig:Stacked Recurrent Model

47
Stacked recurrent models have been shown to be effective for tasks that require a deep
understanding of the input sequence, such as machine translation and speech recognition.

48
e) Seq2seq and Attention
Seq2seq models are a type of RNN that is specifically designed for tasks that involve translating
sequences from one language to another. Seq2seq models typically consist of an encoder-
decoder architecture, where the encoder encodes the input sequence into a vector
representation and the decoder generates the output sequence based on the vector
representation.
Attention is a mechanism that allows the decoder to focus on the most relevant parts of the
input sequence when generating the output sequence. This can improve the accuracy of the
model, especially for long input sequences.
e) Transfer Learning in NLP
Transfer learning is a technique that involves using a pre-trained model on a new task. This can be
particularly useful in natural language processing, where training large neural networks from
scratch can be computationally expensive and time-consuming.

Transfer Learning in NLP

There are several ways to apply transfer learning in NLP. One common approach is to use the
output of a pre- trained model as input to a new model. Another approach is to fine-tune the
parameters of a pre-trained model on the new task.
Transfer learning has been shown to be effective for a variety of NLP tasks, including machine
translation, sentiment analysis, and text classification.

49
50
Chapter-III: Generative Adversial Networks
3.1.Essential GAN Theory
3.3.The Quick, Draw!
Dataset 3.4.The
Discriminator Network
3.5.The Generator Network
3.6.The Adversarial
Network 3.7.GAN Training
Answer:
A Generative Adversarial Network (GAN) is a deep learning architecture that consists of two
neural networks competing against each other in a zero-sum game framework. The goal of
GANs is to generate new, synthetic data that resembles some known data distribution.
3.1. Essential GAN Theory
Generative adversarial networks (GANs) are a class of machine learning models that are able
to generate new data that is similar to real data. GANs are composed of two main
components: a generator and a discriminator.
The Generator:
The generator is responsible for generating new data. It takes in a random noise vector as
input and produces a synthetic data sample as output. The goal of the generator is to produce
data that is so realistic that it can fool the discriminator into believing it is real.
The Discriminator:
The discriminator is responsible for distinguishing between real and fake data. It takes in a data
sample as input and outputs a probability that the sample is real. The goal of the discriminator
is to correctly classify real data as real and fake data as fake.
The generator and discriminator are trained in an adversarial manner. The generator tries to
improve its ability to generate realistic data, while the discriminator tries to improve its ability
to distinguish between real and fake data. This adversarial training process forces both the
generator and discriminator to improve their performance, leading to the generation of
increasingly realistic data.
3.3. The Quick, Draw! Dataset
The Quick, Draw! dataset is a large collection of hand-drawn sketches paired with
corresponding labels. The dataset was created by Google and is freely available for research
purposes.
The Quick, Draw! dataset is a valuable resource for training GANs because it provides a large
amount of real data that can be used to train the discriminator. Additionally, the dataset is
labeled, which allows the training process to be supervised.
3.4. The Discriminator Network
The discriminator network is a convolutional neural network (CNN) that takes in a sketch as
input and outputs a probability that the sketch is real. The CNN architecture is well-suited for this
task because it is able to extract local features from the sketch, which can be used to
distinguish between real and fake sketches.
The discriminator network consists of several convolutional layers followed by a fully
51
connected layer. The convolutional layers extract features from the sketch, while the fully
connected layer outputs the probability that the sketch is real.

52
3.5. The Generator Network
The generator network is a recurrent neural network (RNN) that takes in a random noise
vector as input and produces a sketch as output. The RNN architecture is well-suited for this
task because it is able to generate sequences of data, which is necessary for generating
sketches.
The generator network consists of an embedding layer, an LSTM layer, and a decoder layer.
The embedding layer converts the random noise vector into a vector representation. The
LSTM layer processes the vector representation and generates a sequence of hidden states.
The decoder layer takes the hidden states as input and produces a sketch as output.
3.6. The Adversarial Network
The adversarial network is the combination of the generator and discriminator networks. The
adversarial network is trained in an adversarial manner, where the generator tries to fool the
discriminator into believing its sketches are real, while the discriminator tries to correctly classify
real sketches as real and fake sketches as fake.
The adversarial training process forces both the generator and discriminator to improve their
performance. The generator learns to generate more realistic sketches, while the
discriminator learns to better distinguish between real and fake sketches.
3.7. GAN Training
GAN training is a complex process that requires careful tuning of hyperparameters. The
following are some of the key hyperparameters that need to be tuned:
 Learning rate: The learning rate determines how quickly the generator and
discriminator networks are updated during training. A high learning rate can lead to
unstable training, while a low learning rate can lead to slow training.
 Batch size: The batch size determines the number of data samples that are used to update
the generator and discriminator networks at each training step. A large batch size can
lead to more stable training, while a small batch size can lead to faster training.
 Regularization: Regularization techniques can be used to prevent the generator and
discriminator networks from overfitting the training data. Common regularization
techniques include dropout and L2 regularization.
 Loss function: The loss function determines how the performance of the generator and
discriminator networks is measured. The most common loss function for GANs is the
binary cross-entropy loss.
GAN training can be a challenging process, but it can be very rewarding when successful.
GANs have been used to generate a variety of realistic data, including images, videos, and
music.

53
54
Chapter-IV: Deep Reinforcement Learning
4.1.Essential Theory of Reinforcement Learning
a) The Cart-Pole Game
b) Markov Decision Processes
c) The Optimal
Policy Answer:
a) The Cart-Pole Game
The Cart-Pole game is a classic reinforcement learning task that involves balancing a pole on
a cart. The cart can move left or right, and the goal is to keep the pole upright as long as possible.
The game is typically played in a simulated environment, where the state of the cart and pole
is represented by a vector of numbers.
The Cart-Pole game is a good example of a reinforcement learning problem because it has the
following characteristics:
 The environment is dynamic: The state of the cart and pole changes over time.
 The environment is stochastic: The outcome of an action is not always deterministic.
 There is a delayed reward: The agent does not receive a reward immediately after taking
an action.
b) Markov Decision Processes
A Markov decision process (MDP) is a mathematical framework for modeling reinforcement
learning problems. An MDP is defined by the following elements:
 A set of states: The states represent the possible configurations of the environment.
 A set of actions: The actions are the choices that the agent can make.
 A transition function: The transition function defines how the environment transitions
from one state to another given an action.
 A reward function: The reward function defines the immediate reward that the agent
receives for taking an action.
The goal of reinforcement learning is to find an optimal policy, which is a function that maps
states to actions and maximizes the expected cumulative reward.
c) The Optimal Policy
The optimal policy is the policy that maximizes the expected cumulative reward. The
expected cumulative reward is the sum of the immediate rewards that the agent expects to
receive over an infinite time horizon.
There are two main approaches to finding the optimal policy:
 Value-based methods: Value-based methods estimate the value of each state and use
this information to select the action that will lead to the highest expected cumulative
reward.
 Policy-based methods: Policy-based methods directly update the policy using gradient
descent or other optimization techniques.
Both value-based and policy-based methods have their own strengths and weaknesses. Value-
based methods are often more stable and less prone to overfitting, while policy-based

55
methods can be more efficient and can sometimes find policies that are closer to the optimal
policy.

56
4.2.Essential Theory of Deep Q-Learning Networks

a) Value Functions
b) Q-Value Functions
c) Estimating an Optimal Q-
Value Answer:
a) Value Functions
A value function is a function that maps states to their expected cumulative reward. In other
words, the value function estimates how much reward an agent can expect to receive if it
starts in a given state and follows an optimal policy.
The value function for a state s is denoted by V(s). The value function can be computed using
the following recursive equation:

V(s) = max_a [R(s, a) + γ Σ_s' P(s'|s, a)

V(s')] where:
 R(s, a) is the immediate reward for taking action a in state s.

 γ is the discount factor, which determines how much the agent values future rewards.
 P(s'|s, a) is the probability of transitioning to state s' after taking action a in state s.

 Σ_s' is the sum over all possible next states s'.

b) Q-Value Functions
A Q-value function is a function that maps state-action pairs to their expected cumulative
reward. In other words, the Q-value function estimates how much reward an agent can expect
to receive if it takes a given action in a given state and then follows an optimal policy.
The Q-value function for a state-action pair (s, a) is denoted by Q(s, a). The Q-value function can be
computed using the following recursive equation:

Q(s, a) = R(s, a) + γ max_a' Σ_s' P(s'|s, a) V(s')

where:
 R(s, a) is the immediate reward for taking action a in state s.

 γ is the discount factor, which determines how much the agent values future rewards.
 P(s'|s, a) is the probability of transitioning to state s' after taking action a in state s.
 max_a' is the maximum over all possible next actions a'.

 Σ_s' is the sum over all possible next states s'.

c) Estimating an Optimal Q-Value
Deep Q-learning networks (DQNs) are a type of reinforcement learning algorithm that uses a
neural network to estimate the Q-value function. DQNs are trained using an experience replay
buffer, which stores past experiences of the agent.
The DQN algorithm works as follows:
1. The agent interacts with the environment and collects experiences.
2. The experiences are stored in an experience replay buffer.

57
3. The neural network is trained on a batch of experiences from the replay buffer.
4. The agent uses the updated neural network to select actions.
The goal of training the neural network is to minimize the following loss function:

L = Σ_(s,a,r,s') [Q(s, a) - (r + γ max_a' Q(s',

a'))]^2 where:
 (s, a, r, s') is an experience from the replay buffer.
 Q(s, a) is the predicted Q-value for the experience.
 r is the immediate reward for the experience.

 γ is the discount factor.

 max_a' Q(s', a') is the maximum predicted Q-value for the next state in the experience.
DQNs have been shown to be very effective in a variety of reinforcement learning tasks, including
Atari games, robotics, and finance.
4.3. Defining a DQN Agent
a) Initialization Parameters
b) Building the Agent’s Neural Network Model
c) Remembering Gameplay
d) Training via Memory Replay
e) Selecting an Action to Take
f) Saving and Loading Model
Parameters Answer:
a) Initialization Parameters
Before diving into the agent's architecture and training process, it's crucial to define the
essential initialization parameters that govern the agent's behavior and learning process.
These parameters include:

 Discount factor (γ): This parameter determines the importance of future rewards
compared to immediate ones. A higher discount factor emphasizes long-term rewards,
while a lower discount factor prioritizes immediate gratification.

 Exploration rate (ε): This parameter controls the balance between exploration (trying
new actions) and exploitation (sticking to known good actions). A higher exploration
rate encourages exploration of new strategies, while a lower exploration rate focuses
on exploiting existing knowledge.
 Memory replay buffer size: This parameter determines the capacity of the agent's memory,
which stores past experiences for training. A larger memory allows for more diverse
training data, but it also increases computational cost.
 Target network update frequency: This parameter specifies how often the target
network, used for generating Q-values during training, is updated from the main
network. A higher update frequency ensures the target network remains aligned with
the latest learning progress.
 Batch size: This parameter defines the number of experiences sampled from the
memory replay buffer for each training update. A larger batch size improves training
58
efficiency but may lead to overfitting.

59
b) Building the Agent's Neural Network Model
The core of a DQN agent is its neural network model, responsible for estimating Q-values. The
network architecture typically consists of an input layer, hidden layers, and an output layer.
The input layer receives the current state representation of the environment, while the output
layer produces Q-values for each possible action.
The choice of activation functions for the hidden layers depends on the specific task and the
nature of the input data. Common activation functions include rectified linear units (ReLU)
and hyperbolic tangent (tanh).
c) Remembering Gameplay
The agent's memory replay buffer plays a crucial role in DQN's learning process. It stores past
experiences, also known as transitions, in the form of tuples containing the current state, the
action taken, the immediate reward received, and the next state after the action.
By storing these experiences, the agent can revisit and learn from past mistakes or successful
strategies, allowing it to improve its decision-making over time.
d) Training via Memory Replay
The training process of a DQN agent involves sampling batches of experiences from its
memory replay buffer. These experiences are used to update the neural network model by
minimizing the loss function, which measures the difference between the predicted Q-values
and the actual rewards obtained.
The loss function is typically calculated using the Bellman equation, which allows the agent to
propagate the value information from future states back to the current state.
e) Selecting an Action to Take
When interacting with the environment, the agent faces the decision of choosing an action to
take based on the current state. The agent's policy, guided by the Q-values estimated by the
neural network, determines how it balances exploration and exploitation.
With a high exploration rate, the agent may select random actions to explore new
possibilities. As learning progresses and the agent gains more knowledge, the exploration rate
decreases, and the agent becomes more likely to exploit known good actions.
f) Saving and Loading Model Parameters
During training, the DQN agent's neural network model improves its ability to estimate Q-
values and make better decisions. To preserve this progress, it's important to save the model
parameters periodically.
Saving the model parameters allows the agent to resume training from a specific point in
time, avoiding the need to restart the entire learning process from scratch. Additionally,
saved models can be used to evaluate the agent's performance on different tasks or
environments.

60
61

Unit IV V Deep Learning Material
No ratings yet
Unit IV V Deep Learning Material
32 pages
Non-Linear Programming Overview
No ratings yet
Non-Linear Programming Overview
54 pages
Cloud IaaS Analysis and Providers
No ratings yet
Cloud IaaS Analysis and Providers
4 pages
MCS102 Module1 Detailed
No ratings yet
MCS102 Module1 Detailed
5 pages
Computational Methods and Techniques
No ratings yet
Computational Methods and Techniques
15 pages
Unit Iii - Cloud Virtualization
100% (1)
Unit Iii - Cloud Virtualization
10 pages
Machine Learning Course Code 3710216
No ratings yet
Machine Learning Course Code 3710216
2 pages
Optimization For Data Science - Lecture1 - Slides
No ratings yet
Optimization For Data Science - Lecture1 - Slides
9 pages
Unit 5 (DL)
No ratings yet
Unit 5 (DL)
18 pages
Study Materials - Restricted Boltzmann Machine
No ratings yet
Study Materials - Restricted Boltzmann Machine
6 pages
Introduction To Data Science UNIT-3
100% (1)
Introduction To Data Science UNIT-3
28 pages
Least Cost Routing Algorithm
No ratings yet
Least Cost Routing Algorithm
7 pages
TechnKnowledge DL U1-2 Reduced
No ratings yet
TechnKnowledge DL U1-2 Reduced
42 pages
Al 3451-Machine Learning 2025
100% (1)
Al 3451-Machine Learning 2025
3 pages
Unit 5
No ratings yet
Unit 5
25 pages
AI&ML BM4251 Unit 1-5 Notes
No ratings yet
AI&ML BM4251 Unit 1-5 Notes
116 pages
DAA Lab Manual Overview
No ratings yet
DAA Lab Manual Overview
53 pages
Introduction To Data Science UNIT - IV
No ratings yet
Introduction To Data Science UNIT - IV
45 pages
Vehicle Counting with AI: A Study
No ratings yet
Vehicle Counting with AI: A Study
11 pages
CNN Basics for AI Enthusiasts
No ratings yet
CNN Basics for AI Enthusiasts
29 pages
Unit-3 of Ai
No ratings yet
Unit-3 of Ai
19 pages
Object Detection With Tracking and Counting Object Using Machine Learning in Python
No ratings yet
Object Detection With Tracking and Counting Object Using Machine Learning in Python
54 pages
Advantages of IP for IoT Optimization
No ratings yet
Advantages of IP for IoT Optimization
20 pages
Neural Network Optimization Guide
No ratings yet
Neural Network Optimization Guide
51 pages
HPC for Medical Imaging & Vision
No ratings yet
HPC for Medical Imaging & Vision
3 pages
ML Notes Updated
No ratings yet
ML Notes Updated
60 pages
EC360 Soft Computing Course Overview
No ratings yet
EC360 Soft Computing Course Overview
2 pages
Soft Computing
No ratings yet
Soft Computing
52 pages
DL Mod 3
No ratings yet
DL Mod 3
65 pages
2.4 Design of HDFS - HDFS Concepts
No ratings yet
2.4 Design of HDFS - HDFS Concepts
16 pages
DLT Unit-5
No ratings yet
DLT Unit-5
10 pages
Unit 3
No ratings yet
Unit 3
14 pages
Module 5
No ratings yet
Module 5
31 pages
Causal Ordering of Messages in Distributed System
No ratings yet
Causal Ordering of Messages in Distributed System
4 pages
AML - Theory - Syllabus - Chandigarh University
No ratings yet
AML - Theory - Syllabus - Chandigarh University
4 pages
Unit - 1 Deep Learning Techniques
No ratings yet
Unit - 1 Deep Learning Techniques
18 pages
JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES (WWW - Jntumaterials.co - In)
No ratings yet
JNTUK R20 UNIT-IV DEEP LEARNING TECHNIQUES (WWW - Jntumaterials.co - In)
26 pages
Deep Learning Course Overview
100% (1)
Deep Learning Course Overview
122 pages
Unit I:: Artificial Intelligence
No ratings yet
Unit I:: Artificial Intelligence
45 pages
Ch-4 Pre-Trained Models and Fine-Tuning
No ratings yet
Ch-4 Pre-Trained Models and Fine-Tuning
13 pages
ML 4 PPT Unit Iv
No ratings yet
ML 4 PPT Unit Iv
71 pages
Cloud Computing Exam Prep Guide
No ratings yet
Cloud Computing Exam Prep Guide
32 pages
Module-1 DL
No ratings yet
Module-1 DL
53 pages
Enhancing LLM Performance (Etc.) (Z-Library)
No ratings yet
Enhancing LLM Performance (Etc.) (Z-Library)
189 pages
Birge Vieta Method
No ratings yet
Birge Vieta Method
5 pages
Module 1
No ratings yet
Module 1
24 pages
Unit - Ii Cloud Computing Architecture: Architecture and Event-Driven Architecture
No ratings yet
Unit - Ii Cloud Computing Architecture: Architecture and Event-Driven Architecture
30 pages
Big Data Analytics Course Syllabus
No ratings yet
Big Data Analytics Course Syllabus
4 pages
AI-based Self-Driving Car
No ratings yet
AI-based Self-Driving Car
9 pages
Backtracking ADA
No ratings yet
Backtracking ADA
20 pages
LeNet-5 and AlexNet Architectures Explained
No ratings yet
LeNet-5 and AlexNet Architectures Explained
13 pages
Ait307 QP
No ratings yet
Ait307 QP
3 pages
Unit 1
100% (1)
Unit 1
19 pages
UNIT-1 Material
No ratings yet
UNIT-1 Material
26 pages
Unit 4
100% (1)
Unit 4
12 pages
Fuel Final
No ratings yet
Fuel Final
25 pages
ML CT Question Paper 2023 24
No ratings yet
ML CT Question Paper 2023 24
2 pages
04 - IoT - Unit 4 - Data Handling & Analytics
No ratings yet
04 - IoT - Unit 4 - Data Handling & Analytics
52 pages
Unit 5 Machine Learning Aktu
No ratings yet
Unit 5 Machine Learning Aktu
7 pages
Deep Learning: Autoencoders Overview
100% (1)
Deep Learning: Autoencoders Overview
31 pages
Feb 6 Newspaper
No ratings yet
Feb 6 Newspaper
48 pages
Definition of A Political Party
No ratings yet
Definition of A Political Party
8 pages
Python Innovative Projects 2
No ratings yet
Python Innovative Projects 2
2 pages
Nov 3 Ca
No ratings yet
Nov 3 Ca
7 pages
DL Unit 5 Perfect Pdf. - 1
No ratings yet
DL Unit 5 Perfect Pdf. - 1
17 pages
DL Unit 3 Important Questions and Answers PDF .. - 1
No ratings yet
DL Unit 3 Important Questions and Answers PDF .. - 1
8 pages
Unit 4 Exact PDF MPMC
No ratings yet
Unit 4 Exact PDF MPMC
46 pages
DL Unit 2 Important Questions and Answers Pdf.. - 1
No ratings yet
DL Unit 2 Important Questions and Answers Pdf.. - 1
30 pages
Unit 1 Ps Advantages and Disadvantages - 1
No ratings yet
Unit 1 Ps Advantages and Disadvantages - 1
8 pages
3rd Unit Advantages and Disadvantages - 1
No ratings yet
3rd Unit Advantages and Disadvantages - 1
9 pages
Unit 2 Ps Importance - 1
No ratings yet
Unit 2 Ps Importance - 1
2 pages
Unit 4 Ps Some Topics - 1
No ratings yet
Unit 4 Ps Some Topics - 1
10 pages
DL Important Questions and Answers Unit 1.. - 1
No ratings yet
DL Important Questions and Answers Unit 1.. - 1
36 pages
Topics of CC Unit 2
No ratings yet
Topics of CC Unit 2
9 pages
Gemini PDF Ps Unit 5
No ratings yet
Gemini PDF Ps Unit 5
8 pages
Gemini GD Pros Cons
No ratings yet
Gemini GD Pros Cons
26 pages
Aptitude Short Tricks AI
No ratings yet
Aptitude Short Tricks AI
14 pages
Unit 3 MPMC Perfect Material
No ratings yet
Unit 3 MPMC Perfect Material
46 pages
Topics of MPMC Unit 5
No ratings yet
Topics of MPMC Unit 5
13 pages
Upsc23engeness006 Need Wants Greed 121692886633214
No ratings yet
Upsc23engeness006 Need Wants Greed 121692886633214
27 pages
Upsc23engeness008 Life Values and Humanity 941692886633214
No ratings yet
Upsc23engeness008 Life Values and Humanity 941692886633214
21 pages
MPMC Tutorial Questions
No ratings yet
MPMC Tutorial Questions
2 pages
Essay Writing: Step-by-Step Guide
No ratings yet
Essay Writing: Step-by-Step Guide
27 pages
Upsc22engeness005 Change Truth and Wisdom 971692886633214
No ratings yet
Upsc22engeness005 Change Truth and Wisdom 971692886633214
54 pages
Or Find It On This Address-Https://Youtu - Be/W8Lmu - Zmluc
No ratings yet
Or Find It On This Address-Https://Youtu - Be/W8Lmu - Zmluc
7 pages
Java Packages & Exceptions Guide
No ratings yet
Java Packages & Exceptions Guide
64 pages
Intro to Classification in ML
No ratings yet
Intro to Classification in ML
2 pages
Wireless Network Security - Padma
No ratings yet
Wireless Network Security - Padma
49 pages
VXL It-Java-1
No ratings yet
VXL It-Java-1
41 pages
Spirituality Ethics
No ratings yet
Spirituality Ethics
6 pages
Introduction to PCA and Dimensionality Reduction
No ratings yet
Introduction to PCA and Dimensionality Reduction
20 pages
ML Module 1
No ratings yet
ML Module 1
26 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
Unit 1
No ratings yet
Unit 1
27 pages
Engineering Students' EDA Report
No ratings yet
Engineering Students' EDA Report
36 pages
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
008 Parameterized Neural Ordinary Differential Equations: Applications To Computational Physics Problems
No ratings yet
008 Parameterized Neural Ordinary Differential Equations: Applications To Computational Physics Problems
19 pages
Linear Algebra and AI1
No ratings yet
Linear Algebra and AI1
46 pages
1.factor Analysis and Dimension Reduction in R by G. David... - Z-Library
No ratings yet
1.factor Analysis and Dimension Reduction in R by G. David... - Z-Library
4 pages
Artificial Intelligence in Renewable Systems For Transformation Towards Intelligent Buildings
No ratings yet
Artificial Intelligence in Renewable Systems For Transformation Towards Intelligent Buildings
19 pages
CH-8 Computer Vision
No ratings yet
CH-8 Computer Vision
13 pages
CV 3
No ratings yet
CV 3
2 pages
Machine Learning Make Your Own Recommender System Machine Learning With Python For Beginners Book 3
No ratings yet
Machine Learning Make Your Own Recommender System Machine Learning With Python For Beginners Book 3
88 pages
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
No ratings yet
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
2 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
9 pages
2022-Prediction of Stuck Pipe Incidents Using Models Powered by Deep Learningand Machine Learning
No ratings yet
2022-Prediction of Stuck Pipe Incidents Using Models Powered by Deep Learningand Machine Learning
13 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
B.tech .AIT AIT 411 Data Mining and Business Intelligence 20 Compressed
No ratings yet
B.tech .AIT AIT 411 Data Mining and Business Intelligence 20 Compressed
140 pages
Unsupervised Learning Unit1
No ratings yet
Unsupervised Learning Unit1
48 pages
Internal
No ratings yet
Internal
267 pages
MLwith JS
No ratings yet
MLwith JS
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
Beginner's Guide to Dimension Reduction
No ratings yet
Beginner's Guide to Dimension Reduction
5 pages
AI Tool for Mental Health Screening
No ratings yet
AI Tool for Mental Health Screening
13 pages
Fake News Detection and Classification U
No ratings yet
Fake News Detection and Classification U
3 pages
ML-unit-4 Q&A
No ratings yet
ML-unit-4 Q&A
19 pages
Paper 111-2009: Taiyeong Lee, David Duling, and Dominique Latour SAS Institute Inc., Cary, NC
No ratings yet
Paper 111-2009: Taiyeong Lee, David Duling, and Dominique Latour SAS Institute Inc., Cary, NC
15 pages
Data Preparation For ML in Practice v213
No ratings yet
Data Preparation For ML in Practice v213
78 pages
AI Glossary
No ratings yet
AI Glossary
45 pages

Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)

Uploaded by

Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)

Uploaded by

UNIT-V

Interactive Applications of Deep Learning: Machine Vision, Natural Language processing,

1.3) Representational Power, Layer Size and Depth

1. Learning a representation h of a training example x such that x can be approximately

Here are some essential steps in preprocessing natural language data:

 True Positives (TP): Instances correctly classified as positive.

TP / (TP + FN) Probability of classifying a positive case as

a) Loading the IMDb Film Reviews

a) Recurrent Neural Networks

Fig:Recurrent Neural Network (RNN)

Fig:Bidirectional LSTM (BiLSTM) Unit

Fig:Stacked Recurrent Model

Transfer Learning in NLP

V(s) = max_a [R(s, a) + γ Σ_s' P(s'|s, a)

 Σ_s' is the sum over all possible next states s'.

Q(s, a) = R(s, a) + γ max_a' Σ_s' P(s'|s, a) V(s')

 Σ_s' is the sum over all possible next states s'.

L = Σ_(s,a,r,s') [Q(s, a) - (r + γ max_a' Q(s',

 γ is the discount factor.

You might also like