0% found this document useful (0 votes)
27 views47 pages

Unit 4 Notes

Activation Maximization is a technique in machine learning used to interpret and optimize neural networks, enhancing their accuracy and efficiency. It has practical applications in social media marketing, epidemic control, and energy management, but also faces limitations such as sensitivity to initialization and local optima. Various optimization techniques, including Gradient Descent and its variants, are employed to improve model performance by adjusting parameters to minimize loss functions.

Uploaded by

sindhu251104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views47 pages

Unit 4 Notes

Activation Maximization is a technique in machine learning used to interpret and optimize neural networks, enhancing their accuracy and efficiency. It has practical applications in social media marketing, epidemic control, and energy management, but also faces limitations such as sensitivity to initialization and local optima. Various optimization techniques, including Gradient Descent and its variants, are employed to improve model performance by adjusting parameters to minimize loss functions.

Uploaded by

sindhu251104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit - 4

Activation Maximization

A technique for understanding and optimizing neural networks' performance.

Activation Maximization is a method used in machine learning to interpret and optimize the
performance of neural networks. It helps researchers and developers gain insights into the inner
workings of these complex models, enabling them to improve their accuracy and efficiency.

In recent years, various studies have explored the concept of activation maximization in different
contexts. For instance, researchers have investigated its application in social networks, aiming to
maximize the coverage of information propagation by considering both active and informed nodes.
Another study focused on energy-efficient wireless communication, where a hybrid active-passive
intelligent reflecting surface was used to optimize the number of active and passive elements for
maximizing energy efficiency.

Moreover, activation maximization has been applied to influence maximization in online social
networks, where the goal is to select a subset of users that maximizes the expected total activity
benefit. This problem has been extended to continuous domains, leading to the development of
efficient algorithms for solving the continuous activity maximization problem.

Practical applications of activation maximization include:

1. Social media marketing: By identifying influential users in a network, businesses can target their
marketing efforts more effectively, leading to increased brand awareness and customer
engagement.

2. Epidemic control: Understanding the dynamics of information propagation in social networks


can help public health officials design strategies to control the spread of infectious diseases.

3. Energy management: Optimizing the number of active and passive elements in wireless
communication systems can lead to more energy-efficient networks, reducing power consumption
and environmental impact.

A company case study that demonstrates the use of activation maximization is the development of
a 3-step system for estimating real-time energy expenditure of individuals using smartphone
sensors. By recognizing physical activities and daily routines, the system can estimate energy
expenditure with a mean error of 26% of the expected estimation, providing valuable insights for
health and fitness applications.

What are the limitations of Activation Maximization?

Activation Maximization has some limitations, including: 1. Sensitivity to initialization: The


optimization process can be sensitive to the initial input values, potentially leading to different
results depending on the starting point. 2. Local optima: The optimization process may get stuck
in local optima, resulting in suboptimal solutions. 3. Interpretability: While activation
maximization can provide insights into the features learned by a neuron, interpreting these features
can still be challenging, especially in deep networks with many layers.

ML models such as deep neural networks (DNNs) are capable of producing complex real-world
predictions. In order to get insight into the workings of the model and verify that the model is not
overfitting the data, it is often desirable to explain its predictions. For linear and mildly nonlinear
models, simple techniques based on Taylor expansions can be used, however, for highly nonlinear
DNN models, the task of explanation becomes more difficult.

Optimization Rule in Deep Neural Networks

In machine learning, optimizers and loss functions are two components that help improve the
performance of the model. By calculating the difference between the expected and actual outputs
of a model, a loss function evaluates the effectiveness of a model. Among the loss functions are
log loss, hinge loss, and mean square loss. By modifying the model’s parameters to reduce the
loss function value, the optimizer contributes to its improvement. RMSProp, ADAM, and SGD
are a few examples of optimizers. The optimizer’s job is to determine which combination of the
neural network’s weights and biases will give it the best chance to generate accurate predictions.

Optimization Rule in Deep Neural Networks


There are various optimization techniques to change model weights and learning rates, like
Gradient Descent, Stochastic Gradient Descent, Stochastic Gradient descent with momentum,
Mini-Batch Gradient Descent, AdaGrad, RMSProp, AdaDelta, and Adam. These optimization
techniques play a critical role in the training of neural networks, as they help improve the model
by adjusting its parameters to minimize the loss of function value. Choosing the best optimizer
depends on the application.
Before we proceed, it’s essential to acquaint yourself with a few terms
1. The epoch is the number of times the algorithm iterates over the entire training
dataset.
2. Batch weights refer to the number of samples used for updating the model
parameters.
3. A sample is a single record of data in a dataset.
4. Learning Rate is a parameter determining the scale of model weight updates
5. Weights and Bias are learnable parameters in a model that regulate the signal
between two neurons.

Gradient Descent

A derivative or gradient indicates the direction of increase of the function. Thus a negative
derivative or gradient would indicate the direction of decrease of the function. This fact is used
to minimize the value of the function.
In gradient descent, we initialize the variables with random values.
1. We calculate the derivative/gradient for each variable.
2. We take steps in the direction of the negative derivate/gradient using a learning rate.
The learning rate controls the descent. Too large learning rate may result in
oscillations while a small learning rate results in slow convergence and hence the
optimal value of the learning rate is critical
3. This is iteratively done until we reach a convergence criteria.

Formula :

where,
• θ(k+1) is the updated parameter vector at the (k+1)th iteration.
• θk is the current parameter vector at the kth iteration.
• α is the learning rate, which is a positive scalar that determines the step size for each
iteration.
• ∇J(θk) is the gradient of the cost or loss function J with respect to the parameters θk

In Gradient Descent, a single step is taken by considering the entirety of the training data. The
process involves calculating the average of the gradients for all training examples, and this mean
gradient is employed to update the parameters. This constitutes a singular step in the Gradient
Descent process within a single epoch or iteration.

Advantages
• Easy to implement and compute
Disadvantages
• Chances of getting stuck in local minima.
• If dataset is too large it becomes computationally expensive and requires large memory

Gradient descent with Armijo Goldstein condition:


Its a variant of gradient descent in which we ensure that the step size taken is sufficient enough to
reduce the objective function thereby avoiding small steps. Here the step size is determined
through a line search which must satisfy Armijo condition. Below is the process

1. Initialization : We set a initial guess for the function f(x)


2. Gradient : We compute the gradient of the objective function ∇f(x)
3. Line Search : Here we take a large step size( and check if the reduction in function
value (using updated value and old value) satisfies below conditions know as Armjio
condition ,

Here
• We are trying to find value at x(t) at time steep t and x(t-1) is the value at
step t-1
• α is the step size
• c is a constant between 0 to 1.
• If we do not get the required reduction we reduce the step size by beta β ∈
(0, 1) iteratively till the above condition know as Armjio is satisfied
• Why this value ? It has been shown mathematically through Taylor series
first order expansion that the minimum decrease in f(x) should be at least
“step size * ∇f(x)2 “. These theoretical value is not practically possible to
achieve that’s why we multiply by a fraction c.
4. Update : Update the solution parameters with the chosen step size.
5. Convergence Check: This can be done by examining the magnitude of the gradient,
the change in the objective function value, or other convergence criteria

Gradient descent with Armijo Full Relaxation condition:

It is an optimization algorithm that combines the Armijo line search condition with a full Newton
step. It considers both the first derivative and second derivative (Hessian) information to find a
step size that ensures sufficient decrease in the objective function while incorporating information
about the curvature of the function.

1. Initialization : We set a initial guess for the function f(x)


2. Gradient : We compute the gradient of the objective function ∇f(x)
3. Line Search : Here the step size should satisfy the below condition :

Here,
• H(x) is Hessian.
• 0<c<b <1 are constants that determine how much the function must
decrease and how much the curvature of the function is taken into account
• If we do not get the required reduction we reduce the step size by beta β ∈
(0, 1) iteratively till the above condition is satisfied
4. Update : Update the solution parameters with the chosen step size.
5. Convergence Check: This can be done by examining the magnitude of the gradient,
the change in the objective function value, or other convergence criteria

Stochastic Gradient Descent (SGD):


It’s a variation of the Gradient Descent algorithm. In Gradient Descent, we analyze the entire
dataset in each step, which may not be efficient when dealing with very large datasets. To address
this issue, we have Stochastic Gradient Descent (SGD). In Stochastic Gradient Descent, we
process just one example at a time to perform a single step. So, if the dataset contains 10000 rows,
SGD will update the model parameters 10000 times in a single cycle through the dataset, as
opposed to just once in the case of Gradient Descent.
Here’s the process:
1. Select an example from the dataset.
2. Calculate its gradient.
3. Utilize the calculated gradient from step 2 to update the model weights.
4. Repeat steps 1 to 3 for all examples in the training dataset.
5. Completing a full pass through all the examples constitutes one epoch.
6. Repeat this entire process for several epochs as specified during training.
Advantages
• Requires less memory
• May get new minima
Disadvantages
• SGD algorithm is noisier and takes more iterations as compared to gradient descent.

Mini Batch Stochastic Gradient Descent:

We utilize a mini-batch stochastic gradient descent, which consists of a predetermined number of


training examples, smaller than the full dataset. This approach combines the advantages of the
previously mentioned variants. In one epoch, following the creation of fixed-size mini-batches, we
execute the following steps:
1. Select a mini-batch.
2. Compute the mean gradient of the mini-batch.
3. Apply the mean gradient obtained in step 2 to update the model’s weights.
4. Repeat steps 1 to 2 for all the mini-batches that have been created.
Advantages
• Requires medium amount of memory
• Less time required to converge when compared to SGD
Disadvantage
• May get stuck at local minima

SGD with Momentum:

In Stochastic Gradient Descent, we don’t calculate the precise derivative of our loss function.
Instead, we estimate it using a small batch. This results in “noisy” derivatives, which implies that
we don’t always move in the optimal direction. To address this issue, Momentum was introduced
to mitigate the noise in SGD. It speeds up convergence towards the relevant direction and
diminishes fluctuations in irrelevant directions.
The concept behind Momentum involves denoising the derivatives by employing an exponential
weighting average by assigning more weight to recent updates compared to previous ones
Update for the momentum term (often denoted as “v” or “m”):

Here
• v(t+1) is the updated momentum at time t+1.
• vt is the momentum at time t.
• β is the momentum coefficient (typically a value between 0 and 1).
• ∇J(θt) is the gradient of the cost or loss function with respect to the parameters at time t.
Then, we update the parameters using the momentum term

Formula :
• θ(t+1) is the updated parameter vector at time t+1.
• θt is the current parameter vector at time t.
• α is the learning rate.
Advantages
• Mitigates parameter oscillations and reduces parameter variance.
• Achieves faster convergence compared to standard gradient descent.
Disadvantage
• Introduces an additional hyper-parameter that must be chosen manually and with
precision
AdaGrad

ADAGRAD, short for adaptive gradient, signifies that the learning rates are adjusted or adapted
over time based on previous gradients. A limitation of the previously discussed optimizers is the
use of a fixed learning rate for all parameters throughout each cycle. This can hinder the training
features which often exhibit small average gradients causing them to train at a slower pace. While
one potential solution is to set different learning rates for each feature, this can become complex .
AdaGrad addresses this issue by implementing the concept that the more a feature has been
updated in the past, the less it will be updated in the future. This provides an opportunity for other
features, such as sparse features, to catch up. AdaGrad, as an optimizer, dynamically adjusts the
learning rate for each parameter at every time step ‘t’.
For each parameter θ:
• Initialize a sum of squared gradients variable to zero:
• G0 = 0
• At each time step t:
• Compute the gradient of the cost or loss function with respect to the
parameter θ at time t: ∇J(θt).
• Update the sum of squared gradients:
Gt = G(t-1) + (∇J(θt))2
• Update the parameter θ using the following formula:

Where
• Gt is the sum of squared gradients at time t.
• θt is the current parameter at time t.
• θ(t+1) is the updated parameter at time t+1.
• α (alpha) is the learning rate, which is a positive scalar.
• ∇J(θt) is the gradient of the cost or loss function with respect to
the parameter θt at time t.
• ε (epsilon) is a small constant added to the denominator to
prevent division by zero. It is typically a very small value, such
as 1e-8.
Advantages:
• Adaptive learning rates facilitate effective training of all features.
Disadvantages:
• With a large number of iterations, the learning rate diminishes to extremely small
values, causing slow convergence.
RMSProp

The challenge with AdaGrad lies in its notably slow convergence. This is primarily due to the fact
that the sum of squared gradients only accumulates and never diminishes. To address this
limitation, RMSProp, short for Root Mean Square Propagation, introduces a decay factor. More
precisely, it transforms the sum of squared gradients into a decayed sum of squared gradients. The
decay rate indicates that only recent gradient squared values are relevant, while those from the
distant past are effectively disregarded. Instead of accumulating all previously squared gradients,
RMSProp restricts the window of accumulated past gradients to a fixed size ‘w’. It achieves this
by using an exponentially moving average instead of the sum of all gradients.
• Initialize a moving average of squared gradients variable:
• E[g2]0 = 0
• Set a decay rate (typically close to 1), denoted as γ (gamma).
• At each time step t:
• Compute the gradient of the cost or loss function with respect to the
parameter θ at time t: ∇J(θt).
• Update the moving average of squared gradients:

• Update the parameter θ using the following formula:

Where,
• E[g2]t is the moving average of squared gradients at time t.
• θt is the current parameter at time t.
• θ(t+1) is the updated parameter at time t+1.
• α (alpha) is the learning rate, which is a positive scalar.
• ∇J(θt) is the gradient of the cost or loss function with respect to
the parameter θt at time t.
• γ (gamma) is the decay rate, typically close to 1.ε (epsilon) is a
small constant added to the denominator to prevent division by
zero. It is typically a very small value, such as 1e-8.
Advantages:
• Prevents the learning rate from decaying, allowing continuous training without
premature stopping.
Disadvantages:
• Involves higher computational complexity due to increased parameter, making it
more computationally expensive.

Adam
Adam, which stands for Adaptive Moment Estimation, combines the strengths of both Momentum
and RMSProp. Adam is the preferred choice for many deep learning applications in recent years
For each parameter θ:
• Initialize the first moment vector (mean of gradients) m0 to zeros:
• m0 = 0
• Initialize the second moment vector (uncentered variance of gradients) v0 to zeros:
• v0 = 0
• Set the exponential decay rates for the moments (typically close to 1), denoted as β₁
(beta_1) and β₂ (beta_2).
• Set the small constant ε (epsilon) to prevent division by zero, typically a small value
like 1e-8.
• At each time step t:
• Compute the gradient of the cost or loss function with respect to the
parameter θ at time t: ∇J(θ_t).
• Update the moving average of sum of gradients:

• Update the moving average of squared gradients :

• Correct for bias in the moment estimates:

• Update the parameter θ using the following formula:

• Where:
• θt is the current parameter at time t.
• θ(t+1) is the updated parameter at time t+1.
• α (alpha) is the learning rate, which is a positive scalar.
• ∇J(θt) is the gradient of the cost or loss function with respect to
the parameter θt at time t.
• β₁ (beta_1) and β₂ (beta_2) are the exponential decay rates for
the first and second moments, typically close to 1.
• ε (epsilon) is a small constant added to the denominator to
prevent division by zero, typically a very small value like 1e-8.
Advantages:
• The method is fast and converges rapidly.
Disadvantages:
• Takes lot of memory due to large number of parameters and hence computationally
costly.
Comparison with SGD Optimizer
Let us see how each of the subsequent optimizers tackled different issues of SGD which finally
lead to ADAM which is now widely used optimizer .
• Mini batch SGD is less noisy when compared to SGD however it comes at an
increase in computation cost/memory. Also it suffers with same problem of local
minima and fixed learning rate.

Updating processes during SGD and Mini-Batch Gradient Descent

• The usage of a momentum term in SGD with momentum helps to denoises the
gradients and converge faster as compared to SGD without momentum. However it
still used a fixed learning rate.

Updating process of SGD with momentum vs SGD without Momentum

• AdaGrad an optimization of the SGD algorithm uses an adaptive learning rate (LR)
algorithm which can automatically adjust the learning rate and increase prediction
accuracy. However AdaGrad is slow in convergence due to the fact that it accumulates
the gradient.
• RMSProp modifies AdaGrad in a way that it accumulates the gradient into an
exponentially weighted average. RMSProp discards past gradient and preserves only
current knowledge on the gradient. This makes convergences faster.
• Adam is a blend of RMSProp and Momentum. The fixed learning rate issue is
resolved using the adaptive learning rate of RMSProp and the issue of local Minima is
addressed using Momentum. Due to its overall performance, Adam is often
recommended as the default optimizer for various applications. However ADAM uses
lot of memory.
Each optimizer exhibits unique strengths and weaknesses, and the optimal choice depends on the
particular deep learning task and the characteristics of the dataset. The selection of an optimizer
can profoundly influence the speed and quality of convergence during training, ultimately
impacting the final performance of the deep learning model.
Hyperparameter tuning
A Machine Learning model is defined as a mathematical model with several parameters that need
to be learned from the data. By training a model with existing data, we can fit the model
parameters.
However, there is another kind of parameter, known as Hyperparameters, that cannot be directly
learned from the regular training process. They are usually fixed before the actual training process
begins. These parameters express important properties of the model such as its complexity or how
fast it should learn. This article aims to explore various strategies to tune hyperparameters for
Machine learning models.
Hyperparameter tuning is the process of selecting the optimal values for a machine
learning model’s hyperparameters. Hyperparameters are settings that control the learning process
of the model, such as the learning rate, the number of neurons in a neural network, or the kernel
size in a support vector machine. The goal of hyperparameter tuning is to find the values that lead
to the best performance on a given task.
What are Hyperparameters?
In the context of machine learning, hyperparameters are configuration variables that are set before
the training process of a model begins. They control the learning process itself, rather than being
learned from the data. Hyperparameters are often used to tune the performance of a model, and
they can have a significant impact on the model’s accuracy, generalization, and other metrics.
Different Ways of Hyperparameters Tuning
Hyperparameters are configuration variables that control the learning process of a machine
learning model. They are distinct from model parameters, which are the weights and biases that
are learned from the data. There are several different types of hyperparameters:
Hyperparameters in Neural Networks
Neural networks have several essential hyperparameters that need to be adjusted, including:
• Learning rate: This hyperparameter controls the step size taken by the optimizer
during each iteration of training. Too small a learning rate can result in slow
convergence, while too large a learning rate can lead to instability and divergence.
• Epochs: This hyperparameter represents the number of times the entire training
dataset is passed through the model during training. Increasing the number of epochs
can improve the model’s performance but may lead to overfitting if not done
carefully.
• Number of layers: This hyperparameter determines the depth of the model, which
can have a significant impact on its complexity and learning ability.
• Number of nodes per layer: This hyperparameter determines the width of the
model, influencing its capacity to represent complex relationships in the data.
• Architecture: This hyperparameter determines the overall structure of the neural
network, including the number of layers, the number of neurons per layer, and the
connections between layers. The optimal architecture depends on the complexity of
the task and the size of the dataset
• Activation function: This hyperparameter introduces non-linearity into the
model, allowing it to learn complex decision boundaries. Common activation
functions include sigmoid, tanh, and Rectified Linear Unit (ReLU).
Hyperparameters in Support Vector Machine
We take into account some essential hyperparameters for fine-tuning SVMs:
• C: The regularization parameter that controls the trade-off between the margin and
the number of training errors. A larger value of C penalizes training errors more
heavily, resulting in a smaller margin but potentially better generalization
performance. A smaller value of C allows for more training errors but may lead to
overfitting.
• Kernel: The kernel function that defines the similarity between data points.
Different kernels can capture different relationships between data points, and the
choice of kernel can significantly impact the performance of the SVM. Common
kernels include linear, polynomial, radial basis function (RBF), and sigmoid.
• Gamma: The parameter that controls the influence of support vectors on the
decision boundary. A larger value of gamma indicates that nearby support vectors
have a stronger influence, while a smaller value indicates that distant support vectors
have a weaker influence. The choice of gamma is particularly important for RBF
kernels.
Hyperparameters in XGBoost
The following essential XGBoost hyperparameters need to be adjusted:
• learning_rate: This hyperparameter determines the step size taken by the optimizer
during each iteration of training. A larger learning rate can lead to faster convergence,
but it may also increase the risk of overfitting. A smaller learning rate may result in
slower convergence but can help prevent overfitting.
• n_estimators: This hyperparameter determines the number of boosting trees to be
trained. A larger number of trees can improve the model’s accuracy, but it can also
increase the risk of overfitting. A smaller number of trees may result in lower
accuracy but can help prevent overfitting.
• max_depth: This hyperparameter determines the maximum depth of each tree in
the ensemble. A larger max_depth can allow the trees to capture more complex
relationships in the data, but it can also increase the risk of overfitting. A smaller
max_depth may result in less complex trees but can help prevent overfitting.
• min_child_weight: This hyperparameter determines the minimum sum of instance
weight (hessian) needed in a child node. A larger min_child_weight can help prevent
overfitting by requiring more data to influence the splitting of trees. A smaller
min_child_weight may allow for more aggressive tree splitting but can increase the
risk of overfitting.
• subsample: This hyperparameter determines the percentage of rows used for each
tree construction. A smaller subsample can improve the efficiency of training but may
reduce the model’s accuracy. A larger subsample can increase the accuracy but may
make training more computationally expensive.
Some other examples of model hyperparameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. Number of Trees and Depth of Trees for Random Forests.
3. The learning rate for training a neural network.
4. Number of Clusters for Clustering Algorithms.
5. The k in k-nearest neighbors.

CNN- Convolutional Neural Network

What is a CNN?

A Convolutional Neural Network (CNN or ConvNet) is a deep learning algorithm specifically


designed for any task where object recognition is crucial such as image classification, detection,
and segmentation. Many real-life applications, such as self-driving cars, surveillance cameras, and
more, use CNNs.

The importance of CNNs

These are several reasons why CNNs are important, as highlighted below:

• Unlike traditional machine learning models like SVM and decision trees that require manual
feature extractions, CNNs can perform automatic feature extraction at scale, making them
efficient.
• The convolutions layers make CNNs translation invariant, meaning they can recognize patterns
from data and extract features regardless of their position, whether the image is rotated, scaled, or
shifted.
• Multiple pre-trained CNN models such as VGG-16, ResNet50, Inceptionv3, and EfficientNet are
proved to have reached state-of-the-art results and can be fine-tuned on news tasks using a
relatively small amount of data.
• CNNs can also be used for non-image classification problems and are not limited to natural
language processing, time series analysis, and speech recognition.

Architecture of a CNN

CNNs’ architecture tries to mimic the structure of neurons in the human visual system composed
of multiple layers, where each one is responsible for detecting a specific feature in the data. As
illustrated in the image below, the typical CNN is made of a combination of four main layers:
1. Convolutional layers
2. Rectified Linear Unit (ReLU for short)
3. Pooling layers
4. Fully connected layers
Let’s understand how each of these layers works using the following example of classification of
the handwritten digit.

Convolution layers

This is the first building block of a CNN. As the name suggests, the main mathematical task
performed is called convolution, which is the application of a sliding window function to a matrix
of pixels representing an image. The sliding function applied to the matrix is called kernel or filter,
and both can be used interchangeably.

In the convolution layer, several filters of equal size are applied, and each filter is used to recognize
a specific pattern from the image, such as the curving of the digits, the edges, the whole shape of
the digits, and more.

Let’s consider this 32x32 grayscale image of a handwritten digit. The values in the matrix are
given for illustration purposes.
Also, let’s consider the kernel used for the convolution. It is a matrix with a dimension of 3x3. The
weights of each element of the kernel is represented in the grid. Zero weights are represented in
the black grids and ones in the white grid.

Do we have to manually find these weights?

In real life, the weights of the kernels are determined during the training process of the neural
network.

Using these two matrices, we can perform the convolution operation by taking applying the dot
product, and work as follows:

1. Apply the kernel matrix from the top-left corner to the right.
2. Perform element-wise multiplication.
3. Sum the values of the products.
4. The resulting value corresponds to the first value (top-left corner) in the convoluted matrix.
5. Move the kernel down with respect to the size of the sliding window.
6. Repeat from step 1 to 5 until the image matrix is fully covered.
The dimension of the convoluted matrix depends on the size of the sliding window. The higher the
sliding window, the smaller the dimension.
Another name associated with the kernel in the literature is feature detector because the weights
can be fine-tuned to detect specific features in the input image.

For instance:

• Averaging neighboring pixels kernel can be used to blur the input image.
• Subtracting neighboring kernel is used to perform edge detection.
The more convolution layers the network has, the better the layer is at detecting more abstract
features.

Activation function

A ReLU activation function is applied after each convolution operation. This function helps the
network learn non-linear relationships between the features in the image, hence making the
network more robust for identifying different patterns. It also helps to mitigate the vanishing
gradient problems.

Pooling layer

The goal of the pooling layer is to pull the most significant features from the convoluted matrix.
This is done by applying some aggregation operations, which reduces the dimension of the feature
map (convoluted matrix), hence reducing the memory used while training the network. Pooling is
also relevant for mitigating overfitting.

The most common aggregation functions that can be applied are:

• Max pooling which is the maximum value of the feature map


• Sum pooling corresponds to the sum of all the values of the feature map
• Average pooling is the average of all the values.
Below is an illustration of each of the previous example:

Also, the dimension of the feature map becomes smaller as the polling function is applied.

The last pooling layer flattens its feature map so that it can be processed by the fully connected
layer.

Fully connected layers

These layers are in the last layer of the convolutional neural network, and their inputs correspond
to the flattened one-dimensional matrix generated by the last pooling layer. ReLU activations
functions are applied to them for non-linearity.

Finally, a softmax prediction layer is used to generate probability values for each of the possible
output labels, and the final label predicted is the one with the highest probability score.

Dropout

Dropout is a regularization technic applied to improve the generalization capability of the neural
networks with a large number of parameters. It consists of randomly dropping some neurons during
the training process, which forces the remaining neurons to learn new features from the input data.
RNN

Recurrent Neural Network (RNN)

• RNN intuition, Vanishing Gradient


• Problem, Tackling Vanishing
• Gradient Problem
• Exploding Gradient Problem,
• Tackling Exploding Gradient Problem
• Long Short-Term Memory,
• Applications of Recurrent Neural Networks

What is Recurrent Neural Network (RNN)?

Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous
step is fed as input to the current step.

In traditional neural networks, all the inputs and outputs are independent of each other.

In some cases when it is required to predict the next word of a sentence, the previous words are
required and hence there is a need to remember the previous words. Thus RNN came into
existence, which solved this issue with the help of a Hidden Layer.

The main and most important feature of RNN is its Hidden state, which remembers some
information about a sequence. The state is also referred to as Memory State since it remembers
the previous input to the network. It uses the same parameters for each input as it performs the
same task on all the inputs or hidden layers to produce the output.

This reduces the complexity of parameters, unlike other neural networks.


How RNN differs from Feedforward Neural Network?

Artificial neural networks that do not have looping nodes are called feed forward neural networks.
Because all information is only passed forward, this kind of neural network is also referred to as a
multi-layer neural network.

Information moves from the input layer to the output layer – if any hidden layers are present –
unidirectionally in a feedforward neural network. These networks are appropriate for image
classification tasks, for example, where input and output are independent. Nevertheless, their
inability to retain previous inputs automatically renders them less useful for sequential data
analysis.

Recurrent Neuron and RNN Unfolding

The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit, which
is not explicitly called a “Recurrent Neuron.” This unit has the unique ability to maintain a hidden
state, allowing the network to capture sequential dependencies by remembering previous inputs
while processing. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) versions
improve the RNN’s ability to handle long-term dependencies.
Types Of RNN

There are four types of RNNs based on the number of inputs and outputs in the network.

1. One to One

2. One to Many

3. Many to One

4. Many to Many

One to One

This type of RNN behaves the same as any simple Neural network it is also known as Vanilla
Neural Network. In this Neural network, there is only one input and one output.
One To Many

In this type of RNN, there is one input and many outputs associated with it. One of the most used
examples of this network is Image captioning where given an image we predict a sentence having
Multiple words.

Many to One

In this type of network, Many inputs are fed to the network at several states of the network
generating only one output. This type of network is used in the problems like sentimental analysis.
Where we give multiple words as input and predict only the sentiment of the sentence as output.
Many to Many

In this type of neural network, there are multiple inputs and multiple outputs corresponding to a
problem. One Example of this Problem will be language translation. In language translation, we
provide multiple words from one language as input and predict multiple words from the second
language as output.

Recurrent Neural Network Architecture

RNNs have the same input and output architecture as any other deep neural architecture. However,
differences arise in the way information flows from input to output. Unlike Deep neural networks
where we have different weight matrices for each Dense network in RNN, the weight across the
network remains the same. It calculates state hidden state Hi for every input Xi . By using the
following formulas:

h= σ(UX + Wh-1 + B)

Y = O(Vh + C)
Hence

Y = f (X, h , W, U, V, B, C)

Here S is the State matrix which has element si as the state of the network at timestep i
The parameters in the network are W, U, V, c, b which are shared across timestep

How does RNN work?

The Recurrent Neural Network consists of multiple fixed activation function units, one for each
time step. Each unit has an internal state which is called the hidden state of the unit. This hidden
state signifies the past knowledge that the network currently holds at a given time step. This hidden
state is updated at every time step to signify the change in the knowledge of the network about the
past. The hidden state is updated using the following recurrence relation:-

The formula for calculating the current state:

where,

• ht -> current state

• ht-1 -> previous state

• xt -> input state

Formula for applying Activation function(tanh)

where,

• whh -> weight at recurrent neuron

• wxh -> weight at input neuron

The formula for calculating output:


• Yt -> output

• Why -> weight at output layer

These parameters are updated using Backpropagation. However, since RNN works on sequential
data here we use an updated backpropagation which is known as Backpropagation through time.

Backpropagation Through Time (BPTT)

In RNN the neural network is in an ordered fashion and since in the ordered network each variable
is computed one at a time in a specified order like first h1 then h2 then h3 so on. Hence we will
apply backpropagation throughout all these hidden time states sequentially.

The output of the neural network is used to calculate and collect the errors once it has trained on a
time set and given you an output. The network is then rolled back up, and weights are recalculated
and adjusted to account for the faults.

Two issues of Standard RNNs

There are two key challenges that RNNs have had to overcome, but in order to comprehend them,
one must first grasp what a gradient is.
With regard to its inputs, a gradient is a partial derivative. If you’re not sure what that implies,
consider this: a gradient quantifies how much the output of a function varies when the inputs are
changed slightly.

A function’s slope is also known as its gradient. The steeper the slope, the faster a model can learn,
the higher the gradient. The model, on the other hand, will stop learning if the slope is zero. A
gradient is used to measure the change in all weights in relation to the change in error.

• Exploding Gradients: Exploding gradients occur when the algorithm gives the weights
an absurdly high priority for no apparent reason. Fortunately, truncating or squashing the
gradients is a simple solution to this problem.
• Vanishing Gradients: Vanishing gradients occur when the gradient values are too small,
causing the model to stop learning or take far too long. This was a big issue in the 1990s,
and it was far more difficult to address than the exploding gradients. Fortunately, Sepp
Hochreiter and Juergen Schmidhuber’s LSTM concept solved the problem.

RNN Applications

Recurrent Neural Networks are used to tackle a variety of problems involving sequence data. There
are many different types of sequence data, but the following are the most common: Audio, Text,
Video, Biological sequences.

Using RNN models and sequence datasets, you may tackle a variety of problems, including :

• Speech recognition
• Generation of music
• Automated Translations
• Analysis of video action
• Sequence study of the genome and DNA
Basic Python Implementation (RNN with Keras)

Import the required libraries

import numpy as np
import tensorflow as tf
from tensorflow import keras
from [Link] import layers

Here’s a simple Sequential model that processes integer sequences, embeds each integer into a 64-
dimensional vector, and then uses an LSTM layer to handle the sequence of vectors.

model = [Link]()
[Link]([Link](input_dim=1000, output_dim=64))
[Link]([Link](128))
[Link]([Link](10))
[Link]()

Output:

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 64) 64000
_________________________________________________________________
lstm (LSTM) (None, 128) 98816
_________________________________________________________________
dense (Dense) (None, 10) 1290
=================================================================
Total params: 164,106
Trainable params: 164,106
Non-trainable params: 0

• Recurrent Neural Networks are a versatile tool that can be used in a variety of situations.
They’re employed in a variety of methods for language modelling and text generators.
They’re also employed in voice recognition.
• This type of neural network is used to create labels for images that aren’t tagged when
paired with Convolutional Neural Networks. It’s incredible how well this combination
works.
• However, there is one flaw with recurrent neural networks. They have trouble learning
long-range dependencies, which means they don’t comprehend relationships between data
that are separated by several steps.
• When anticipating words, for example, we may require more context than simply one prior
word. This is known as the vanishing gradient problem, and it is solved using a special type
of Recurrent Neural Network called Long-Short Term Memory Networks (LSTM), which
is a larger topic that will be discussed in future articles.

Issues of Standard RNNs

1. Vanishing Gradient: Text generation, machine translation, and stock market prediction
are just a few examples of the time-dependent and sequential data problems that can be
modelled with recurrent neural networks. You will discover, though, that the gradient
problem makes training RNN difficult.

2. Exploding Gradient: An Exploding Gradient occurs when a neural network is being


trained and the slope tends to grow exponentially rather than decay. Large error gradients
that build up during training lead to very large updates to the neural network model weights,
which is the source of this issue.

Training through RNN

1. A single-time step of the input is provided to the network.

2. Then calculate its current state using a set of current input and the previous state.

3. The current ht becomes ht-1 for the next time step.

4. One can go as many time steps according to the problem and join the information from all
the previous states.

5. Once all the time steps are completed the final current state is used to calculate the output.

6. The output is then compared to the actual output i.e the target output and the error is
generated.

7. The error is then back-propagated to the network to update the weights and hence the
network (RNN) is trained using Backpropagation through time.

Advantages and Disadvantages of Recurrent Neural Network


Advantages

1. An RNN remembers each and every piece of information through time. It is useful in time
series prediction only because of the feature to remember previous inputs as well. This is
called Long Short Term Memory.

2. Recurrent neural networks are even used with convolutional layers to extend the effective
pixel neighborhood.

Disadvantages
1. Gradient vanishing and exploding problems.

2. Training an RNN is a very difficult task.

3. It cannot process very long sequences if using tanh or relu as an activation function.

Applications of Recurrent Neural Network

1. Language Modelling and Generating Text

2. Speech Recognition

3. Machine Translation

4. Image Recognition, Face detection

5. Time series Forecasting

Variation Of Recurrent Neural Network (RNN)

To overcome the problems like vanishing gradient and exploding gradient descent several new
advanced versions of RNNs are formed some of these are as;

1. Bidirectional Neural Network (BiNN)

2. Long Short-Term Memory (LSTM)

Bidirectional Neural Network (BiNN)

A BiNN is a variation of a Recurrent Neural Network in which the input information flows in both
direction and then the output of both direction are combined to produce the input. BiNN is useful
in situations when the context of the input is more important such as Nlp tasks and Time-series
analysis problems.

Long Short-Term Memory (LSTM)

Long Short-Term Memory works on the read-write-and-forget principle where given the input
information network reads and writes the most useful information from the data and it forgets
about the information which is not important in predicting the output. For doing this three new
gates are introduced in the RNN. In this way, only the selected information is passed through the
network.

Difference between RNN and Simple Neural Network

RNN is considered to be the better version of deep neural when the data is sequential. There are
significant differences between the RNN and deep neural networks they are listed as:
Recurrent Neural Network Deep Neural Network

Weights are same across all the layers number Weights are different for each layer of the
of a Recurrent Neural Network network

Recurrent Neural Networks are used when the A Simple Deep Neural network does not have
data is sequential and the number of inputs is not any special method for sequential data also here
predefined. the the number of inputs is fixed

The Numbers of parameter in the RNN are


The Numbers of Parameter are lower than RNN
higher than in simple DNN

Exploding and vanishing gradients is the the These problems also occur in DNN but these are
major drawback of RNN not the major problem with DNN

RNN Code Implementation

Imported libraries:

Imported some necessary libraries such as numpy, tensorflow for numerical calculation an model
building.

import numpy as np

import tensorflow as tf

from [Link] import Sequential

from [Link] import SimpleRNN, Dense

Input Generation:

Generated some example data using text.

text = "This is GeeksforGeeks a software training institute"

chars = sorted(list(set(text)))

char_to_index = {char: i for i, char in enumerate(chars)}

index_to_char = {i: char for i, char in enumerate(chars)}

Created input sequences and corresponding labels for further implementation.


seq_length = 3

sequences = []

labels = []

for i in range(len(text) - seq_length):

seq = text[i:i+seq_length]

label = text[i+seq_length]

[Link]([char_to_index[char] for char in seq])

[Link](char_to_index[label])

Converted sequences and labels into numpy arrays and used one-hot encoding to convert text into
vector.

X = [Link](sequences)

y = [Link](labels)

X_one_hot = tf.one_hot(X, len(chars))

y_one_hot = tf.one_hot(y, len(chars))

Model Building:

Build RNN Model using ‘relu’ and ‘softmax‘ activation function.

model = Sequential()

[Link](SimpleRNN(50, input_shape=(seq_length, len(chars)), activation='relu'))

[Link](Dense(len(chars), activation='softmax'))

Model Compilation:

The [Link] line builds the neural network for training by specifying the optimizer (Adam),
the loss function (categorical crossentropy), and the training metric (accuracy).
[Link](optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Model Training:

Using the input sequences (X_one_hot) and corresponding labels (y_one_hot) for 100 epochs, the
model is trained using the [Link] line, which optimises the model parameters to minimise the
categorical crossentropy loss.

[Link](X_one_hot, y_one_hot, epochs=100)

output:

Epoch 1/100
2/2 [==============================] - 2s 54ms/step - loss: 2.8327 - accuracy:
0.0000e+00
Epoch 2/100
2/2 [==============================] - 0s 16ms/step - loss: 2.8121 - accuracy:
0.0000e+00
Epoch 3/100
‘’’’
‘’’’’
Epoch 99/100
2/2 [==============================] - 0s 15ms/step - loss: 0.4588 - accuracy: 0.9583
Epoch 100/100
2/2 [==============================] - 0s 10ms/step - loss: 0.4469 - accuracy: 0.9583

Model Prediction:

Generated text using pre-trained model.

start_seq = "This is G"


generated_text = start_seq

for i in range(50):
x = [Link]([[char_to_index[char] for char in generated_text[-seq_length:]]])
x_one_hot = tf.one_hot(x, len(chars))
prediction = [Link](x_one_hot)
next_index = [Link](prediction)
next_char = index_to_char[next_index]
generated_text += next_char

print("Generated Text:")
print(generated_text)

output:
1/1 [==============================] - 1s 517ms/step
1/1 [==============================] - 0s 75ms/step
1/1 [==============================] - 0s 101ms/step
- 0s 21ms/step


- 0s 20ms/step
1/1 [==============================] - 0s 20ms/step
Generated Text:
This is Geeks a software training instituteais is is is is

LSTM

Long Short-Term Memory is an improved version of recurrent neural network designed by


Hochreiter & Schmidhuber. LSTM is well-suited for sequence prediction tasks and excels in
capturing long-term dependencies. Its applications extend to tasks involving time series and
sequences. LSTM’s strength lies in its ability to grasp the order dependence crucial for solving
intricate problems, such as machine translation and speech recognition. The article provides an in-
depth introduction to LSTM, covering the LSTM model, architecture, working principles, and the
critical role they play in various applications.
What is LSTM?
A traditional RNN has a single hidden state that is passed through time, which can make it difficult
for the network to learn long-term dependencies. LSTMs address this problem by introducing a
memory cell, which is a container that can hold information for an extended period. LSTM
networks are capable of learning long-term dependencies in sequential data, which makes them
well-suited for tasks such as language translation, speech recognition, and time series forecasting.
LSTMs can also be used in combination with other neural network architectures, such
as Convolutional Neural Networks (CNNs) for image and video analysis.
The memory cell is controlled by three gates: the input gate, the forget gate, and the output gate.
These gates decide what information to add to, remove from, and output from the memory cell.
The input gate controls what information is added to the memory cell. The forget gate controls
what information is removed from the memory cell. And the output gate controls what information
is output from the memory cell. This allows LSTM networks to selectively retain or discard
information as it flows through the network, which allows them to learn long-term dependencies.
Bidirectional LSTM
Bidirectional LSTM (Bi LSTM/ BLSTM) is recurrent neural network (RNN) that is able to process
sequential data in both forward and backward directions. This allows Bi LSTM to learn longer-
range dependencies in sequential data than traditional LSTMs, which can only process sequential
data in one direction.
• Bi LSTMs are made up of two LSTM networks, one that processes the input
sequence in the forward direction and one that processes the input sequence in the
backward direction. The outputs of the two LSTM networks are then combined to
produce the final output.
• Bi LSTM have been shown to achieve state-of-the-art results on a wide variety of
tasks, including machine translation, speech recognition, and text summarization.
LSTMs can be stacked to create deep LSTM networks, which can learn even more complex
patterns in sequential data. Each LSTM layer captures different levels of abstraction and temporal
dependencies in the input data.
Architecture and Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different memory
blocks called cells.

Information is retained by the cells and the memory manipulations are done by the gates. There
are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias. The resultant is passed through
an activation function which gives a binary output. If for a particular cell state the output is 0, the
piece of information is forgotten and for output 1, the information is retained for future use. The
equation for the forget gate is:

where:
• W_f represents the weight matrix associated with the forget gate.
• [h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
• b_f is the bias with the forget gate.
• σ is the sigmoid activation function.
Input gate
The addition of useful information to the cell state is done by the input gate. First, the information
is regulated using the sigmoid function and filter the values to be remembered similar to the forget
gate using inputs ht-1 and xt. . Then, a vector is created using tanh function that gives an output
from -1 to +1, which contains all the possible values from ht-1 and xt. At last, the values of the
vector and the regulated values are multiplied to obtain the useful information. The equation for
the input gate is:

We multiply the previous state by ft, disregarding the information we had previously chosen to
ignore. Next, we include it∗Ct. This represents the updated candidate values, adjusted for the
amount that we chose to update each state value.

where
• ⊙ denotes element-wise multiplication
• tanh is tanh activation function
Output gate
The task of extracting useful information from the current cell state to be presented as output is
done by the output gate. First, a vector is generated by applying tanh function on the cell. Then,
the information is regulated using the sigmoid function and filter by the values to be remembered
using inputs ht-1 and xt. At last, the values of the vector and the regulated values are multiplied to
be sent as an output and input to the next cell. The equation for the output gate is:

LTSM vs RNN
LSTM (Long Short-term RNN (Recurrent Neural
Feature Memory) Network)

Has a special memory unit


that allows it to learn long- Does not have a memory
term dependencies in unit
Memory sequential data
LSTM (Long Short-term RNN (Recurrent Neural
Feature Memory) Network)

Can be trained to process


Can only be trained to
sequential data in both
process sequential data in
forward and backward
one direction
Directionality directions

More difficult to train than


RNN due to the complexity
Easier to train than LSTM
of the gates and memory
Training unit

Long-term dependency
Yes Limited
learning

Ability to learn sequential


Yes Yes
data

Natural language
Machine translation, speech
processing, machine
recognition, text
translation, speech
summarization, natural
recognition, image
language processing, time
processing, video
series forecasting
Applications processing

Advantages and Disadvantages of LSTM

The advantages of LSTM (Long-Short Term Memory) are as follows:


• Long-term dependencies can be captured by LSTM networks. They have a memory
cell that is capable of long-term information storage.
• In traditional RNNs, there is a problem of vanishing and exploding gradients when
models are trained over long sequences. By using a gating mechanism that selectively
recalls or forgets information, LSTM networks deal with this problem.
• LSTM enables the model to capture and remember the important context, even when
there is a significant time gap between relevant events in the sequence. So where
understanding context is important, LSTMS are used. eg. machine translation.
The disadvantages of LSTM (Long-Short Term Memory) are as follows:
• Compared to simpler architectures like feed-forward neural networks LSTM
networks are computationally more expensive. This can limit their scalability for large-
scale datasets or constrained environments.
• Training LSTM networks can be more time-consuming compared to simpler models
due to their computational complexity. So training LSTMs often requires more data
and longer training times to achieve high performance.
• Since it is processed word by word in a sequential manner, it is hard to parallelize the
work of processing the sentences.
Applications of LSTM
Some of the famous applications of LSTM includes:
• Language Modeling: LSTMs have been used for natural language processing tasks
such as language modeling, machine translation, and text summarization. They can be
trained to generate coherent and grammatically correct sentences by learning the
dependencies between words in a sentence.
• Speech Recognition: LSTMs have been used for speech recognition tasks such as
transcribing speech to text and recognizing spoken commands. They can be trained to
recognize patterns in speech and match them to the corresponding text.
• Time Series Forecasting: LSTMs have been used for time series forecasting tasks
such as predicting stock prices, weather, and energy consumption. They can learn
patterns in time series data and use them to make predictions about future events.
• Anomaly Detection: LSTMs have been used for anomaly detection tasks such as
detecting fraud and network intrusion. They can be trained to identify patterns in data
that deviate from the norm and flag them as potential anomalies.
• Recommender Systems: LSTMs have been used for recommendation tasks such as
recommending movies, music, and books. They can learn patterns in user behavior and
use them to make personalized recommendations.
• Video Analysis: LSTMs have been used for video analysis tasks such as object
detection, activity recognition, and action classification. They can be used in
combination with other neural network architectures, such as Convolutional Neural
Networks (CNNs), to analyze video data and extract useful information.
Conclusion
Long Short-Term Memory (LSTM) is a powerful type of recurrent neural network (RNN) that is
well-suited for handling sequential data with long-term dependencies. It addresses the vanishing
gradient problem, a common limitation of RNNs, by introducing a gating mechanism that controls
the flow of information through the network. This allows LSTMs to learn and retain information
from the past, making them effective for tasks like machine translation, speech recognition, and
natural language processing.
Also Check:
• Long short-term memory (LSTM) RNN in Tensorflow
• Text Generation using Recurrent Long Short Term Memory Network
• Long Short Term Memory Networks Explanation
Frequently Asked Questions (FAQs)
1. What is LSTM?
LSTM is a type of recurrent neural network (RNN) that is designed to address the vanishing
gradient problem, which is a common issue with RNNs. LSTMs have a special architecture that
allows them to learn long-term dependencies in sequences of data, which makes them well-suited
for tasks such as machine translation, speech recognition, and text generation.
2. How does LSTM work?
LSTMs use a cell state to store information about past inputs. This cell state is updated at each
step of the network, and the network uses it to make predictions about the current input. The cell
state is updated using a series of gates that control how much information is allowed to flow into
and out of the cell.
3. What is the major difference between lstm and bidirectional lstm?
The vanishing gradient problem of the RNN is addressed by both LSTM and GRU, which differ in
a few ways. These distinctions are as follows:
• Bidirectional LSTM can utilize information from both past and future, whereas
standard LSTM can only utilize past info.
• Whereas GRU only employs two gates, LSTM uses three gates to compute the input of
sequence data.
• Compared to LSTM, GRUs are typically faster and simpler.
• GRUs are favored for small datasets, while LSTMs are preferable for large datasets.
4. What is the difference between LSTM and Gated Recurrent Unit (GRU)?
LSTM has a cell state and gating mechanism which controls information flow, whereas GRU has
a simpler single gate update mechanism. LSTM is more powerful but slower to train, while GRU
is simpler and faster.
5. What is difference between LSTM and RNN?
• RNNs have a simple recurrent structure with unidirectional information flow.
• LSTMs have a gating mechanism that controls information flow and a cell state for
long-term memory.
• LSTMs generally outperform RNNs in tasks that require learning long-term
dependencies.

Autoencoder and decoder

• An autoencoder is a type of artificial neural network used to learn data encodings in an


unsupervised manner.

• The aim of an autoencoder is to learn a lower-dimensional representation (encoding) for a


higher-dimensional data, typically for dimensionality reduction, by training the network to
capture the most important parts of the input image.

The architecture of autoencoder

Autoencoders consist of 3 parts:

• 1. Encoder: A module that compresses the train-validate-test set input data into an
encoded representation that is typically several orders of magnitude smaller than the input
data.

• 2. Bottleneck: A module that contains the compressed knowledge representations and is


therefore the most important part of the network.

• 3. Decoder: A module that helps the network “decompress” the knowledge representations
and reconstructs the data back from its encoded form. The output is then compared with a
ground truth.
• Additionally, in almost all contexts where the term "autoencoder" is used, the compression
and decompression functions are implemented with neural networks.

• 1) Autoencoders are data-specific, which means that they will only be able to compress
data similar to what they have been trained on. This is different from, say, the MPEG-2
Audio Layer III (MP3) compression algorithm, which only holds assumptions about
"sound" in general, but not about specific types of sounds. An autoencoder trained on
pictures of faces would do a rather poor job of compressing pictures of trees, because the
features it would learn would be face-specific.

• 2) Autoencoders are lossy, which means that the decompressed outputs will be degraded
compared to the original inputs (similar to MP3 or JPEG compression). This differs from
lossless arithmetic compression.

• 3) Autoencoders are learned automatically from data examples, which is a useful


property: it means that it is easy to train specialized instances of the algorithm that will
perform well on a specific type of input. It doesn't require any new engineering, just
appropriate training data.

• To build an autoencoder, you need three things: an encoding function, a decoding


function, and a distance function between the amount of information loss between the
compressed representation of your data and the decompressed representation (i.e. a
"loss" function). The encoder and decoder will be chosen to be parametric functions
(typically neural networks), and to be differentiable with respect to the distance function,
so the parameters of the encoding/decoding functions can be optimize to minimize the
reconstruction loss, using Stochastic Gradient Descent. It's simple! And you don't even
need to understand any of these words to start using autoencoders in practice.

1. Encoder

1. Input layer take raw input data


2. The hidden layers progressively reduce the dimensionality of the input, capturing
important features and patterns. These layer compose the encoder.

3. The bottleneck layer (latent space) is the final hidden layer, where the
dimensionality is significantly reduced. This layer represents the compressed
encoding of the input data.

2. Decoder

2. The bottleneck layer takes the encoded representation and expands it back to the
dimensionality of the original input.

3. The hidden layers progressively increase the dimensionality and aim to reconstruct
the original input.

4. The output layer produces the reconstructed output, which ideally should be as
close as possible to the input data.

3. The loss function used during training is typically a reconstruction loss, measuring the
difference between the input and the reconstructed output. Common choices include mean
squared error (MSE) for continuous data or binary cross-entropy for binary data.

4. During training, the autoencoder learns to minimize the reconstruction loss, forcing the
network to capture the most important features of the input data in the bottleneck layer.

5. After the training process, only the encoder part of the autoencoder is retained to encode a
similar type of data used in the training process. The different ways to constrain the network
are: –

6. Keep small Hidden Layers: If the size of each hidden layer is kept as small as possible,
then the network will be forced to pick up only the representative features of the data thus
encoding the data.

7. Regularization: In this method, a loss term is added to the cost function which encourages
the network to train in ways other than copying the input.

8. Denoising: Another way of constraining the network is to add noise to the input and teach
the network how to remove the noise from the data.

9. Tuning the Activation Functions: This method involves changing the activation
functions of various nodes so that a majority of the nodes are dormant thus, effectively
reducing the size of the hidden layers.
The relationship between the Encoder, Bottleneck, and Decoder

Encoder

• The encoder is a set of convolutional blocks followed by pooling modules that


compress the input to the model into a compact section called the bottleneck.

• The bottleneck is followed by the decoder that consists of a series of upsampling


modules to bring the compressed feature back into the form of an image. In case of
simple autoencoders, the output is expected to be the same as the input data with
reduced noise.

• However, for variational autoencoders it is a completely new image, formed with


information the model has been provided as input.

Bottleneck

• The most important part of the neural network, and ironically the smallest one, is the
bottleneck. The bottleneck exists to restrict the flow of information to the decoder
from the encoder, thus,allowing only the most vital information to pass through.

• Since the bottleneck is designed in such a way that the maximum information
possessed by an image is captured in it, we can say that the bottleneck helps us form
a knowledge-representation of the input.

• Thus, the encoder-decoder structure helps us extract the most from an image in the
form of data and establish useful correlations between various inputs within the
network.

• A bottleneck as a compressed representation of the input further prevents the neural


network from memorising the input and overfitting on the data.

• As a rule of thumb, remember this: The smaller the bottleneck, the lower the risk of
overfitting.

• However—

• Very small bottlenecks would restrict the amount of information storable, which
increases the chances of important information slipping out through the pooling
layers of the encoder.

Decoder

• Finally, the decoder is a set of upsampling and convolutional blocks that reconstructs
the bottleneck's output.
• Since the input to the decoder is a compressed knowledge representation, the decoder
serves as a “decompressor” and builds back the image from its latent attributes.

How to train autoencoder?

1. Code size: The code size or the size of the bottleneck is the most important hyperparameter
used to tune the autoencoder. The bottleneck size decides how much the data has to be
compressed. This can also act as a regularisation term.

2. Number of layers: Like all neural networks, an important hyperparameter to tune


autoencoders is the depth of the encoder and the decoder. While a higher depth increases
model complexity, a lower depth is faster to process.

3. Number of nodes per layer: The number of nodes per layer defines the weights we use per
layer. Typically, the number of nodes decreases with each subsequent layer in the
autoencoder as the input to each of these layers becomes smaller across the layers.

4. Reconstruction Loss: The loss function we use to train the autoencoder is highly dependent
on the type of input and output we want the autoencoder to adapt to. If we are working with
image data, the most popular loss functions for reconstruction are MSE Loss and L1 Loss.
In case the inputs and outputs are within the range [0,1], as in MNIST, we can also make
use of Binary Cross Entropy as the reconstruction loss.

Types of Autoencoders

• The first applications date to the 1980s. Initially used for dimensionality reduction and
feature learning, an autoencoder concept has evolved over the years and is now widely
used for learning generative models of data.

• Here are five popular autoencoders that we will discuss:

1. Undercomplete autoencoders

2. Sparse autoencoders

3. Contractive autoencoders

4. Denoising autoencoders

5. Variational Autoencoders (for generative modelling)

[Link] autoencoders
• Undercomplete autoencoder takes in an image and tries to predict the same image as output,
thus reconstructing the image from the compressed bottleneck region.

• Undercomplete autoencoders are truly unsupervised as they do not take any form of label,
the target being the same as the input.

• The primary use of autoencoders like such is the generation of the latent space or the
bottleneck, which forms a compressed substitute of the input data and can be easily
decompressed back with the help of the network when needed.

• This form of compression in the data can be modelled as a form of dimensionality


reduction.

[Link] autoencoders

• This type of autoencoder typically contains more hidden units than the input but only a few
are allowed to be active at once. This property is called the sparsity of the network. The
sparsity of the network can be controlled by either manually zeroing the required hidden
units, tuning the activation functions or by adding a loss term to the cost function.

• Advantages
1. The sparsity constraint in sparse autoencoders helps in filtering out noise and irrelevant
features during the encoding process.

2. These autoencoders often learn important and meaningful features due to their emphasis
on sparse activations.

• Disadvantages

1. The choice of hyperparameters play a significant role in the performance of this


autoencoder. Different inputs should result in the activation of different nodes of the
network.

2. The application of sparsity constraint increases computational complexity.

[Link] autoencoders

[Link] autoencoders
• Denoising Autoencoder

• Denoising autoencoder works on a partially corrupted input and trains to recover the
original undistorted image. As mentioned above, this method is an effective way to
constrain the network from simply copying the input and thus learn the underlying structure
and important features of the data.

• Advantages

1. This type of autoencoder can extract important features and reduce the noise or the useless
features.

2. Denoising autoencoders can be used as a form of data augmentation, the restored images
can be used as augmented data thus generating additional training samples.

• Disadvantages

1. Selecting the right type and level of noise to introduce can be challenging and may require
domain knowledge.

2. Denoising process can result into loss of some information that is needed from the original
input. This loss can impact accuracy of the output.
[Link] Autoencoders (for generative modelling)
Differences between a Neural Network and a Deep Learning System

Now that we have talked about Neural Networks and Deep Learning Systems, we can move
forward and see how they differ from each other!

S. DIFFERENCE
NEURAL NETWORKS DEEP LEARNING SYSTEMS
No. BETWEEN

A neural network is a model of neurons Deep learning neural networks


inspired by the human brain. It is made are distinguished from neural
1. Definition
up of many neurons that at inter- networks on the basis of their
connected with each other. depth or number of hidden layers.

Feed Forward Neural Networks Recursive Neural Networks

2. Architecture Recurrent Neural Networks Unsupervised Pre-trained


Networks
Symmetrically Connected Neural
Networks Convolutional Neural Networks
Neurons Motherboards

Connection and weights PSU


3. Structure
Propagation function RAM

Learning rate Processors


It generally takes more time to
It generally takes less time to train them.
Time & train them.
4.
Accuracy They have lower accuracy than Deep
They have higher accuracy than
Learning Systems
Neural Networks.
It gives low performance compared to It gives high performance
5. Performance
Deep Learning Networks. compared to neural networks.

Task Your task is poorly interpreted by a The deep learning network more
6.
Interpretation neural network. effectively perceives your task.

The ability to model non-linear Deep learning models can be


7. Applications processes makes neural networks used in a variety of industries,
excellent tools for addressing a variety including pattern recognition,
of issues, including classification, speech recognition, natural
S. DIFFERENCE
NEURAL NETWORKS DEEP LEARNING SYSTEMS
No. BETWEEN

pattern recognition, prediction and language processing, computer


analysis, clustering, decision making, games, self-driving cars, social
machine learning, deep learning, and network filtering, and more.
more.

Neural network criticism centered on


training problems, theoretical problems, Deep learning criticism centered
8. Critique hardware problems, real-world on theory, errors, cyberthreats,
counterexamples to criticisms, and etc.
hybrid techniques.

You might also like