Deep Learning
Deep Learning
Deep Learning is a subset of Artificial Intelligence (AI) that helps machines to learn from large
datasets using multi-layered neural networks. It automatically finds patterns and makes
predictions and eliminates the need for manual feature extraction. Deep Learning tutorial covers
the basics to advanced topics making it perfect for beginners and those with experience.
Neural networks are machine learning models that mimic the complex functions of the human
brain. These models consist of interconnected nodes or neurons that process data, learn patterns
and enable tasks such as pattern recognition and decision-making.
In this article, we will explore the fundamentals of neural networks, their architecture, how they
work and their applications in various fields. Understanding neural networks is essential for anyone
interested in the advancements of artificial intelligence.
3. Backpropagation
After forward propagation, the network evaluates its performance using a loss function which
measures the difference between the actual output and the predicted output. The goal of training
is to minimize this loss. This is where backpropagation comes into play:
1. Loss Calculation: The network calculates the loss which provides a measure of error in
the predictions. The loss function could vary; common choices are mean squared error
for regression tasks or cross-entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss function with
respect to each weight and bias in the network. This involves applying the chain rule of
calculus to find out how much each part of the output error can be attributed to each
weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases are
updated using an optimization algorithm like stochastic gradient descent (SGD). The
weights are adjusted in the opposite direction of the gradient to minimize the loss. The
size of the step taken in each update is determined by the learning rate.
4. Iteration
This process of forward propagation, loss calculation, backpropagation and weight update is
repeated for many iterations over the dataset. Over time, this iterative process reduces the loss
and the network's predictions become more accurate.
Through these steps, neural networks can adapt their parameters to better approximate the
relationships in the data, thereby improving their performance on tasks such as classification,
regression or any other predictive modeling.
To classify this email, we will create a feature vector based on the analysis of keywords such as
"free" "win" and "offer"
The feature vector of the record can be presented as:
• "free": Present (1)
• "win": Absent (0)
• "offer": Present (1)
How Neurons Process Data in a Neural Network
In a neural network, input data is passed through multiple layers, including one or more hidden
layers. Each neuron in these hidden layers performs several operations, transforming the input into
a usable output.
1. Input Layer: The input layer contains 3 nodes that indicates the presence of each keyword.
2. Hidden Layer: The input vector is passed through the hidden layer. Each neuron in the hidden
layer performs two primary operations: a weighted sum followed by an activation function.
Weights:
• Neuron H1: [0.5,−0.2,0.3]
• Neuron H2: [0.4,0.1,−0.5]
Input Vector: [1,0,1]
Weighted Sum Calculation
• For H1: (1×0.5)+(0×−0.2)+(1×0.3)=0.5+0+0.3=0.8
• For H2: (1×0.4)+(0×0.1)+(1×−0.5)=0.4+0−0.5=−0.1
Activation Function
Here we will use ReLu activation function:
• H1 Output: ReLU(0.8)= 0.8
• H2 Output: ReLu(-0.1) = 0
3. Output Layer
The activated values from the hidden neurons are sent to the output neuron where they are again
processed using a weighted sum and an activation function.
• Output Weights: [0.7, 0.2]
• Input from Hidden Layer: [0.8, 0]
• Weighted Sum: (0.8×0.7)+(0×0.2)=0.56+0=0.56
• Activation (Sigmoid): σ(0.56)=11+e−0.56≈0.636σ(0.56)=1+e−0.561≈0.636
4. Final Classification
• The output value of approximately 0.636 indicates the probability of the email being
spam.
• Since this value is greater than 0.5, the neural network classifies the email as spam (1).
Neural Network for Email Classification Example
Learning of a Neural Network
1. Learning with Supervised Learning
In supervised learning, a neural network learns from labeled input-output pairs provided by a
teacher. The network generates outputs based on inputs and by comparing these outputs to the
known desired outputs, an error signal is created. The network iteratively adjusts its parameters to
minimize errors until it reaches an acceptable performance level.
2. Learning with Unsupervised Learning
Unsupervised learning involves data without labeled output variables. The primary goal is to
understand the underlying structure of the input data (X). Unlike supervised learning, there is no
instructor to guide the process. Instead, the focus is on modeling data patterns and relationships,
with techniques like clustering and association commonly used.
3. Learning with Reinforcement Learning
Reinforcement learning enables a neural network to learn through interaction with its environment.
The network receives feedback in the form of rewards or penalties, guiding it to find an optimal
policy or strategy that maximizes cumulative rewards over time. This approach is widely used in
applications like gaming and decision-making.
Types of Neural Networks
There are seven types of neural networks that can be used.
• Feedforward Networks: It is a simple artificial neural network architecture in which
data moves from input to output in a single direction.
• Singlelayer Perceptron: It has one layer and it applies weights, sums inputs and uses
activation to produce output.
• Multilayer Perceptron (MLP): It is a type of feedforward neural network with three or
more layers, including an input layer, one or more hidden layers and an output layer. It
uses nonlinear activation functions.
• Convolutional Neural Network (CNN): It is designed for image processing. It uses
convolutional layers to automatically learn features from input images, enabling
effective image recognition and classification.
• Recurrent Neural Network (RNN): Handles sequential data using feedback loops to
retain context over time.
• Long Short-Term Memory (LSTM): A type of RNN with memory cells and gates to
handle long-term dependencies and avoid vanishing gradients.
Implementation of Neural Network using TensorFlow
Here, we implement simple feedforward neural network that trains on a sample dataset and makes
predictions using following steps:
Step 1: Import Necessary Libraries
Import necessary libraries, primarily TensorFlow and Keras, along with other required packages
such as NumPy and Pandas for data handling.
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
Step 2: Create and Load Dataset
• Create or load a dataset. Convert the data into a format suitable for training (usually
NumPy arrays).
• Define features (X) and labels (y).
data = {
'feature1': [0.1, 0.2, 0.3, 0.4, 0.5],
'feature2': [0.5, 0.4, 0.3, 0.2, 0.1],
'label': [0, 0, 1, 1, 1]
}
df = pd.DataFrame(data)
X = df[['feature1', 'feature2']].values
y = df['label'].values
Step 3: Create a Neural Network
Instantiate a Sequential model and add layers. The input layer and hidden layers are typically
created using Dense layers, specifying the number of neurons and activation functions.
model = Sequential()
model.add(Dense(8, input_dim=2, activation='relu')) # Hidden layer
model.add(Dense(1, activation='sigmoid')) # Output layer
Step 4: Compiling the Model
Compile the model by specifying the loss function, optimizer and metrics to evaluate during
training. Here we will use binary crossentropy and adam optimizer.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Step 5: Train the Model
Fit the model on the training data, specifying the number of epochs and batch size. This step trains
the neural network to learn from the input data.
model.fit(X, y, epochs=100, batch_size=1, verbose=1)
Step 5: Make Predictions
Use the trained model to make predictions on new data. Process the output to interpret the
predictions like converting probabilities to binary outcomes.
test_data = np.array([[0.2, 0.4]])
prediction = model.predict(test_data)
predicted_label = (prediction > 0.5).astype(int)
Output:
Predicted label: 1
Advantages of Neural Networks
Neural networks are widely used in many different applications because of their many benefits:
• Adaptability: Neural networks are useful for activities where the link between inputs
and outputs is complex or not well defined because they can adapt to new situations
and learn from data.
• Pattern Recognition: Their proficiency in pattern recognition renders them efficacious
in tasks like as audio and image identification, natural language processing and other
intricate data patterns.
• Parallel Processing: Because neural networks are capable of parallel processing by
nature, they can process numerous jobs at once which speeds up and improves the
efficiency of computations.
• Non-Linearity: Neural networks are able to model and comprehend complicated
relationships in data by virtue of the non-linear activation functions found in neurons
which overcome the drawbacks of linear models.
Disadvantages of Neural Networks
Neural networks while powerful, are not without drawbacks and difficulties:
• Computational Intensity: Large neural network training can be a laborious and
computationally demanding process that demands a lot of computing power.
• Black box Nature: As "black box" models, neural networks pose a problem in
important applications since it is difficult to understand how they make decisions.
• Overfitting: Overfitting is a phenomenon in which neural networks commit training
material to memory rather than identifying patterns in the data. Although regularization
approaches help to alleviate this, the problem still exists.
• Need for Large datasets: For efficient training, neural networks frequently need
sizable, labeled datasets; otherwise, their performance may suffer from incomplete or
skewed data.
Applications of Neural Networks
Neural networks have numerous applications across various fields:
1. Image and Video Recognition: CNNs are extensively used in applications such as
facial recognition, autonomous driving and medical image analysis.
2. Natural Language Processing (NLP): RNNs and transformers power language
translation, chatbots and sentiment analysis.
3. Finance: Predicting stock prices, fraud detection and risk management.
4. Healthcare: Neural networks assist in diagnosing diseases, analyzing medical images
and personalizing treatment plans.
5. Gaming and Autonomous Systems: Neural networks enable real-time decision-
making, enhancing user experience in video games and enabling autonomous systems
like self-driving cars.
Neurons
• BNNs: Composed of biological structures like dendrites and axons, with complex
behavior and signal processing abilities.
• ANNs: Use simplified models of neurons with a single output, focusing on numerical
signal transformations through activation functions.
Learning
• BNNs: Adapts based on learning, experience and environmental factors.
• ANNs: Use fixed mathematical weights that are adjusted during training but remain
static during testing.
Neural Pathways
• BNNs: Feature a highly complex web of adaptable pathways influenced by learning
and memory.
• ANNs: Have predefined pathways determined by network architecture and model
design.
Single Layer Perceptron is inspired by biological neurons and their ability to process information.
To understand the SLP we first need to break down the workings of a single artificial neuron
which is the fundamental building block of neural networks. An artificial neuron is a simplified
computational model that mimics the behavior of a biological neuron. It takes inputs, processes
them and produces an output. Here's how it works step by step:
• Receive signal from outside.
• Process the signal and decide whether we need to send information or not.
• Communicate the signal to the target cell, which can be another neuron or gland.
Structure of a biological neuron
Similarly, neural networks also function in a similar manner.
model = tf.keras.Sequential([
tf.keras.layers.Dense(1, activation='sigmoid', input_shape=(2,))
])
Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform input data
from one dimension to another. It is called multi-layer because it contains an input layer, one or
more hidden layers and an output layer. The purpose of an MLP is to model complex relationships
between inputs and outputs.
Components of Multi-Layer Perceptron (MLP)
• Input Layer: Each neuron or node in this layer corresponds to an input feature. For
instance, if you have three input features the input layer will have three neurons.
• Hidden Layers: MLP can have any number of hidden layers with each layer
containing any number of nodes. These layers process the information received from
the input layer.
• Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.
Every connection in the diagram is a representation of the fully connected nature of an MLP. This
means that every node in one layer connects to every node in the next layer. As the data moves
through the network each layer transforms it until the final output is generated in the output
layer.
Working of Multi-Layer Perceptron
Let's see working of the multi-layer perceptron. The key mechanisms such as forward
propagation, loss function, backpropagation and optimization.
1. Forward Propagation
In forward propagation the data flows from the input layer to the output layer, passing through
any hidden layers. Each neuron in the hidden layers processes the input as follows:
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
z=∑iwixi+bz=∑iwixi+b
Where:
• xixi is the input feature.
• wiwi is the corresponding weight.
• bb is the bias term.
2. Activation Function: The weighted sum z is passed through an activation function to introduce
non-linearity. Common activation functions include:
• Sigmoid: σ(z)=11+e−zσ(z)=1+e−z1
• ReLU (Rectified Linear Unit): f(z)=max(0,z)f(z)=max(0,z)
• Tanh (Hyperbolic Tangent): tanh(z)=21+e−2z−1tanh(z)=1+e−2z2−1
2. Loss Function
Once the network generates an output the next step is to calculate the loss using a loss function.
In supervised learning this compares the predicted output to the actual label.
For a classification problem the commonly used binary cross-entropy loss function is:
L=−1N∑i=1N[yilog(y^i)+(1−yi)log(1−y^i)]L=−N1∑i=1N[yilog(y^i)+(1−yi)log(1−y^i)]
Where:
• yiyi is the actual label.
• y^iy^i is the predicted label.
• NN is the number of samples.
For regression problems the mean squared error (MSE) is often used:
MSE=1N∑i=1N(yi−y^i)2MSE=N1∑i=1N(yi−y^i)2
3. Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network's weights
and biases. This is achieved through backpropagation:
1. Gradient Calculation: The gradients of the loss function with respect to each weight
and bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss: w=w−η⋅∂L∂ww=w−η⋅∂w∂L
Where:
• ww is the weight.
• ηη is the learning rate.
• ∂L∂w∂w∂L is the gradient of the loss function with respect to the weight.
4. Optimization
MLPs rely on optimization algorithms to iteratively refine the weights and biases during training.
Popular optimization methods include:
• Stochastic Gradient Descent (SGD): Updates the weights based on a single sample
or a small batch of data: w=w−η⋅∂L∂ww=w−η⋅∂w∂L
• Adam Optimizer: An extension of SGD that incorporates momentum and adaptive
learning rates for more efficient training:
o mt=β1mt−1+(1−β1)⋅gtmt=β1mt−1+(1−β1)⋅gt
o vt=β2vt−1+(1−β2)⋅gt2vt=β2vt−1+(1−β2)⋅gt2
• Here gtgt represents the gradient at time tt and β1,β2β1,β2 are decay rates.
Now that we are done with the theory part of multi-layer perception, let's go ahead and
implement code in python using the TensorFlow library.
Implementing Multi Layer Perceptron
In this section, we will guide through building a neural network using TensorFlow.
1. Importing Modules and Loading Dataset
First we import necessary libraries such as TensorFlow, NumPy and Matplotlib for visualizing
the data. We also load the MNIST dataset.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
3. Visualizing Data
To understand the data better we plot the first 100 training samples each representing a digit.
fig, ax = plt.subplots(10, 10)
k = 0
for i in range(10):
for j in range(10):
ax[i][j].imshow(x_train[k].reshape(28, 28), aspect='auto')
k += 1
plt.show()
Output:
4. Building the Neural Network Model
Here we build a Sequential neural network model. The model consists of:
• Flatten Layer: Reshapes 2D input (28x28 pixels) into a 1D array of 784 elements.
• Dense Layers: Fully connected layers with 256 and 128 neurons, both using the relu
activation function.
• Output Layer: The final layer with 10 neurons representing the 10 classes of digits
(0-9) with sigmoid activation.
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(256, activation='sigmoid'),
Dense(128, activation='sigmoid'),
Dense(10, activation='softmax'),
])
5. Compiling the Model
Once the model is defined we compile it by specifying:
• Optimizer: Adam for efficient weight updates.
• Loss Function: Sparse categorical cross entropy, which is suitable for multi-class
classification.
• Metrics: Accuracy to evaluate model performance.
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
6. Training the Model
We train the model on the training data using 10 epochs and a batch size of 2000. We also use
20% of the training data for validation to monitor the model’s performance on unseen data during
training.
mod = model.fit(x_train, y_train, epochs=10,
batch_size=2000,
validation_split=0.2)
print(mod)
Output:
plt.subplot(1, 2, 1)
plt.plot(mod.history['accuracy'], label='Training Accuracy', color='blue')
plt.plot(mod.history['val_accuracy'], label='Validation Accuracy', color='orange')
plt.title('Training and Validation Accuracy', fontsize=14)
plt.xlabel('Epochs', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.legend()
plt.grid(True)
plt.subplot(1, 2, 2)
plt.plot(mod.history['loss'], label='Training Loss', color='blue')
plt.plot(mod.history['val_loss'], label='Validation Loss', color='orange')
plt.title('Training and Validation Loss', fontsize=14)
plt.xlabel('Epochs', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.legend()
plt.grid(True)
Artificial Neural Networks (ANNs) are computer systems designed to mimic how the human brain
processes information. Just like the brain uses neurons to process data and make decisions, ANNs
use artificial neurons to analyze data, identify patterns and make predictions. These networks
consist of layers of interconnected neurons that work together to solve complex problems. The
key idea is that ANNs can "learn" from the data they process, just as our brain learns from
experience. They are used in various applications from recognizing images to making
personalized recommendations. In this article, we will see more about ANNs, how they function
and other core concepts.
Neural networks are computational models that mimic the way biological neural networks in the
human brain process information. They consist of layers of neurons that transform the input data
into meaningful outputs through a series of mathematical operations.
In this article, we are going to explore different types of neural networks.
1. Feedforward Neural Networks
Feedforward neural networks are a form of artificial neural network where without forming any
cycles between layers or nodes means inputs can pass data through those nodes within the
hidden level to the output nodes.
• Architecture: Made up of layers with unidirectional flow of data i.e from input
through hidden and the output layer.
• Training: Backpropagation is often used during training for the main aim of reducing
the prediction errors.
• Applications: In visual and voice recognition, NLP, financial forecasting and
recommending system
• When to use: Best for general-purpose tasks like classification and regression. Ideal
when data is static and has no sequential dependencies.
It is the simplest and most basic architecture of ANN's. It consists of only two layers- the input layer and the
output layer. The input layer consists of 'm' input neurons connected to each of the 'n' output neurons. The
connections carry weights w11 and so on. The input layer of the neurons doesn't conduct any processing - they
pass the i/p signals to the o/p neurons. The computations are performed in the output layer. So, though it has
2 layers of neurons, only one layer is performing the computation. This is the reason why the network is
known as SINGLE layer. Also, the signals always flow from the input layer to the output layer. Hence,
the network is known as FEED FORWARD.
The net signal input to the output neurons is given by:
yin_k+x1w1k+x2w2k+...+xmwmk=∑i=1mxiwikyin_k+x1w1k+x2w2k+...+xmwmk=∑i=1mxiwik
The signal output from each output neuron will depend on the activation function used.
B. Multi-layer Feed Forward Network:
Recurrent Network
In feed-forward networks, the signal always flows from the input layer towards the output layer (in one
direction only). In the case of recurrent neural networks, there is a feedback loop (from the neurons in the
output layer to the input layer neurons). There can be self-loops too.
Learning Process In ANN:
Learning process in ANN mainly depends on four factors, they are:
1. The number of layers in the network (Single-layered or multi-layered)
2. Direction of signal flow (Feedforward or recurrent)
3. Number of nodes in layers: The number of node in the input layer is equal to the number of
features of the input data set. The number of output nodes will depend on possible outcomes i.e.
the number of classes in case of supervised learning. But the number of layers in the hidden layer
is to be chosen by the user. A larger number of nodes in the hidden layer, higher the performance
but too many nodes may result in overfitting as well as increased computational expense.
4. Weight of Interconnected Nodes: Deciding the value of weights attached with each
interconnection between each neuron so that a specific learning problem can be solved correctly is
quite a difficult problem by itself. Take an example to understand the problem. Take the example
of a Multi-layered Feed-Forward Network, we have to train an ANN model using some data, so
that it can classify a new data set, say p_5(3,-2). Say we have deduced that p_1=(5,2) and p_2 =
(-1,12) belonging to class C1 while p_3=(3,-5) and p_4 = (-2,-1) belonging to class C2. We
assume the values of synaptic weights w_0,w_1,w_2 as -2, 1/2 and 1/4 respectively. But we will
NOT get these weight values for every learning problem. For solving a learning problem with
ANN, we can start with a set of values for synaptic weights and keep changing those in multiple
iterations. The stopping criterion may be the rate of misclassification < 1% or the maximum
numbers of iterations should be less than 25(a threshold value). There may be another problem
that, the rate of misclassification may not reduce progressively.
So, we can summarize the learning process in ANN as the combination of - deciding the number of hidden
layers, the number of nodes in each of the hidden layers, the direction of signal flow, deciding the
connection weight.
Multi-layer feed network is a commonly used architecture. It has been observed that a neural network with
even one hidden layer can be used to reasonably approximate any continuous function. The learning
methodology adopted to train a multi-layer feed-forward network is Backpropagation.
Backpropagation:
In the above section, we get to know that the most critical activities of training an ANN are to assign the inter-
neuron connection weights. In 1986, an efficient way of training an ANN was introduced. In this method,
the difference in output values of the output layer and the expected values, are propagated back from the
output layer to the preceding layers. Hence, the algorithm implementing this method is known as BACK
PROPAGATION i.e. propagating the errors back to the preceding layers.
The backpropagation algorithm is applicable for multi-layer feed-forward network. It is a supervised learning
algorithm which continues adjusting the weights of the connected neurons with an objective to reduce the
deviation of the output signal from the target output. This algorithm consists of multiple iterations, known as
epochs. Each epoch consists of two phases:
• Forward Phase: Signal flow from neurons in the input layer to the neurons in the output layer
through the hidden layers. The weights of the interconnections and activation functions are used
during the flow. In the output layer, the output signals are generated.
• Backward Phase: Signal is compared with the expected value. The computed errors are
propagated backwards from the output to the preceding layer. The error propagated back are used
to adjust the interconnection weights between the layers.
BACKPROPAGATION
The above diagram depicts a reasonably simplified version of the back propagation algorithm.
One main part of the algorithm is adjusting the interconnection weights. This is done using a technique termed
as Gradient Descent. In simple words, the algorithm calculates the partial derivative of the activation function
by each interconnection weight to identify the 'gradient' or extent of change of the weight required to minimize
the cost function.
In order to understand the back propagation algorithm in detail, let us consider the Multi-layer Feed Forward
The net signal input to the hidden layer neurons is given by:
yin_k=x0w0k+x1w1k+...+xmwmk=w0k+∑i=1mxiwikyin_k=x0w0k+x1w1k+...+xmwmk=w0k+∑i=1mxi
wik
If fy fy is the activation function of the hidden layer, then yout_k=fy(yin_k)yout_k=fy(yin_k)
The net signal input to the output layer neurons is given by:
zin_k=y0w0k′+yout_1w1k′+...+yout_nwnk′=w0k′+∑i=1nyout_iwik′zin_k=y0w0k′+yout_1w1k′+...+yout_n
wnk′=w0k′+∑i=1nyout_iwik′
BACKPROPAGATION NET
Note that the signals X0 X0 and Y0 Y0 are assumed to be 1. If fz fz is the
activation function of the hidden layer, then zout_k=fz(zin_k)zout_k=fz(zin_k)
If is the target of the k-th output neuron, then the cost function defined as the squared error of the
output layer is given by:
E=12∑k=1n(tk−zout_k)2E=21∑k=1n(tk−zout_k)2
E=12∑k=1n(tk−fz(zin_k))2E=21∑k=1n(tk−fz(zin_k))2
According to the descent algorithm, partial derivative of cost function E has to be taken with respect to
interconnection weights. Mathematically it can be represented as:
∂E∂wjk′=∂∂wjk′{12∑k=1n(tk−fz(zin_k)2}∂wjk′∂E=∂wjk′∂{21∑k=1n(tk−fz(zin_k)2}
{Above expression is for the interconnection weights between the j-th neuron in the hidden layer and the
k-th neuron in the output layer.} This expression can be reduced to
∂E∂wjk′=−(tk−zout_k)⋅fz′(zin_k)⋅∂∂wjk′{∑i=0nyout_i⋅wik}∂wjk′∂E=−(tk−zout_k)⋅fz′(zin_k)⋅∂wjk′∂{∑i=0n
yout_i⋅wik}
where, fz′(zin_k)=∂∂wjk′(fz(zin_k)) fz′(zin_k)=∂wjk′∂(fz(zin_k
)) or ∂E∂wjk′=−(tk−zout_k)⋅fz′(zin_k).yout_i∂wjk′∂E=−(tk−zout_k)⋅fz′(zin_k).yout_i
If we assume δwk=−(tk−zout_k)⋅fz′(zin_k) δwk=−(tk−zout_k)⋅fz′(zin_k) as a component of the
weight adjustment needed for weight wjk wjk corresponding to the k-th output neuron, then :
∂E∂wjk′=δwk′⋅yout_i∂wjk′∂E=δwk′⋅yout_i
On the basis of this, the weights and bias need to be updated as follows:
• For weights: Δwjk=−α⋅∂E∂wjk′=−α⋅δwk′⋅yout_iΔwjk=−α⋅∂wjk′∂E=−α⋅δwk′⋅yout_i
• Hence, wjk′(new)=wjk′(old)+Δwjk′wjk′(new)=wjk′(old)+Δwjk′
• For bias: Δw0k=−α⋅δwk′Δw0k=−α⋅δwk′
• Hence, w0k′(new)=w0k′(old)+Δw0k′w0k′(new)=w0k′(old)+Δw0k′
In the above expressions, alpha is the learning rate of the neural network. Learning rate is a user parameter
which decreases or increases the speed with which the interconnection weights of a neural network is to be
adjusted. If the learning rate is too high, the adjustment done as a part of the gradient descent process may
diverge the data set rather than converging it. On the other hand, if the learning rate is too low, the optimization
may consume more time because of the small steps towards the minima.
{All the above calculations are for the interconnection weight between neurons in the hidden layer and neurons
in the output layer}
Like the above expressions, we can deduce the expressions for "Interconnection weights between the input
and hidden layers:
• For weights: Δwij=−α⋅∂E∂wij=−α⋅δwj⋅xout_iΔwij=−α⋅∂wij∂E=−α⋅δwj⋅xout_i
• Hence, wij(new)=wij(old)+Δwijwij(new)=wij(old)+Δwij
• For bias: Δw0j=−α⋅δwjΔw0j=−α⋅δwj
Hence, w0j(new)=w0j(old)+Δw0jw0j(new)=w0j(old)+Δw0j
1. Weights
Weights are numerical values assigned to the connections between neurons. They find how
much influence each input has on the network’s final output.
• Purpose: During forward propagation, inputs are multiplied by their respective
weights before being passed through an activation function. This helps decide how
strongly an input will affect the output.
• Learning Mechanism: During training, weights are updated iteratively
through optimization algorithms like gradient descent to minimize the difference
between predicted and actual outcomes.
• Generalization: Well-tuned weights help the network not only make accurate
predictions on training data but also generalize to new, unseen data.
• Example: In a neural network predicting house prices, the weight for the "size of the
house" find how much the house size influences the price prediction. The larger the
weight, the bigger the impact size will have on the final result.
2. Biases
Biases are additional parameters that adjust the output of a neuron. Unlike weights, they are not
tied to any specific input but instead shift the activation function to better fit the data.
• Purpose: Biases help neurons activate even when the weighted sum of inputs is not
enough. This allows the network to recognize patterns that don't necessarily pass
through the origin.
• Functionality: Without biases, neurons would only activate when the input reaches a
specific threshold. It makes the network more flexible by enabling activation across a
wider range of conditions.
• Training: During training, biases are updated alongside weights through
backpropagation. Together, they fine-tune the model, improving prediction accuracy.
• Example: In a house price prediction network, the bias might ensure that even for a
house with a size of zero, the model predicts a non-zero price. This could reflect a fixed
value such as land value or other baseline costs.
How Neural Networks Learn?
Neural networks learn through a process involving forward propagation and backpropagation.
Let’s see each step:
1. Forward Propagation
Forward propagation is the initial phase of processing input data through the neural network to
produce an output or prediction. Let's see how it works:
1. Input Layer: The process starts with data entering the network’s input layer. This
could be anything from pixel values in an image to feature values in a dataset.
2. Weighted Sum: Each neuron calculates the weighted sum of the inputs. Each input is
multiplied by its corresponding weight which shows the importance of that input.
3. Adding Biases: A bias is added to the weighted sum. Bias helps shift the output and
provides flexibility, allowing the network to make better predictions even if all input
values are zero.
4. Activation Function: The sum of the weighted inputs plus bias is passed through an
activation function (e.g ReLU, sigmoid). The activation function decides if the neuron
should activate which means it will pass information to the next layer or stay inactive.
5. Propagation: This process is repeated across multiple layers. The output of one layer
becomes the input for the next, continuing until the network generates the final output
or prediction.
2. Backpropagation
Once the network has made a prediction, it's important to evaluate how accurate that prediction
is and make adjustments to improve future predictions. This is where backpropagation comes:
1. Error Calculation: Once the network generates an output, it’s compared to the actual
result (the target). The difference between the predicted and actual values is the error
also called the loss.
2. Gradient Calculation: The error is propagated back through the network and the
gradient or slope of the error with respect to the weights and biases is calculated. This
tells the network how to adjust the parameters to minimize the error.
3. Updating Weights and Biases: Using the gradient, the network adjusts the weights
and biases. The goal is to reduce the error in future predictions. This step is done
through an optimization algorithm like gradient descent.
4. Iteration: This process of forward and backward propagation is repeated many times
on different batches of data. With each iteration, the network’s weights and biases get
closer to the optimal values, improving the model’s performance.
Real-World Applications of Neural Networks
Neural networks are increasingly used in various fields to solve complex problems. Let's see
various examples of how weights and biases play an important role in below applications:
1. Image Recognition
Neural networks are efficient at tasks like object and image classification. For example, in
detecting objects like cats, dogs or even specific facial features:
• Weights: These find which pixels are important. For example, in a picture of a cat, the
weights might give more importance to features like ears, whiskers and eyes, helping
the network correctly identify the object.
• Biases: They ensure the network remains adaptable despite changes in image
conditions. For example, slight shifts in lighting, position or orientation won’t stop the
network from recognizing the object.
By adjusting weights and biases, the network learns to recognize patterns in data and improve
its accuracy in classifying new, unseen images.
2. Natural Language Processing (NLP)
In tasks such as sentiment analysis, language translation and chatbots, neural networks analyze
and generate text. For example, understanding customer reviews or translating languages:
• Weights: These decide how important specific words or phrases are in a given
context. For example, recognizing the sentiment of the word “happy” in a review versus
“sad” helps the network understand the sentiment of the sentence.
• Biases: They help the network to adapt to different sentence structures and tones.
This helps the model recognize meaning even when the sentence might be phrased
differently.
Training the network on large datasets allows it to interpret language effectively, whether it's
classifying emotions in reviews or translating text between languages.
3. Autonomous Vehicles
In self-driving cars, neural networks process a range of sensor data (camera, radar, lidar) to make
driving decision such as stopping at a red light or avoiding obstacles:
• Weights: The weights help the network focus on important input data such as
recognizing pedestrians, road signs and other vehicles, adjusting their significance
based on the car’s current needs.
• Biases: Biases ensure that the car can adapt to different driving conditions like fog or
night-time driving, ensuring safety and accuracy under varied circumstances.
By continuously adjusting the weights and biases, the system learns how to safely navigate
complex environments and make real-time decisions.
4. Healthcare and Medical Diagnosis
Neural networks are also applied in healthcare such as in diagnosing diseases from medical
images like X-rays, MRIs or CT scans:
• Weights: These help the network focus on important features in medical images such
as specific areas indicating a tumor or anomaly. This helps the network make more
accurate predictions regarding health conditions.
• Biases: Biases allow the network to remain flexible and adaptable to variations in
imaging techniques or the patient’s body type, making the system more reliable across
different scenarios.
By training on thousands of medical images, the neural network learns to identify patterns and
make precise diagnoses, aiding medical professionals in early disease detection.
Advantages of Weights and Biases
1. Learning from Data: Weights and biases help the network adjust to data patterns,
enabling it to make predictions based on input significance and flexibility.
2. Flexibility in Complex Data: Biases allow the network to adjust outputs even when
inputs are minimal, improving flexibility in tasks like image recognition or language
processing.
3. Improved Accuracy: Through iterative updates, weights and biases help reduce
prediction errors, leading to more accurate results over time.
4. Better Generalization: Properly tuned weights and biases help the network apply
learned patterns to new, unseen data, ensuring it performs well outside the training
set.
5. Enhanced Learning Capacity: They allow the network to capture complex patterns,
improving its ability to handle complex tasks that traditional algorithms struggle with.
While building a neural network, one key decision is selecting the Activation Function for both the hidden
layer and the output layer. It is a mathematical function applied to the output of a neuron. It introduces non-
linearity into the model, allowing the network to learn and represent complex patterns in the data. Without
this non-linearity feature a neural network would behave like a linear regression model no matter how many
layers it has.
Activation function decides whether a neuron should be activated by calculating the weighted sum of inputs
and adding a bias term. This helps the model make complex decisions and predictions by introducing non-
linearities to the output of each neuron.
Before diving into the activation function, you should have prior knowledge of the following topics: Neural
Networks, Backpropagation
Introducing Non-Linearity in Neural Network
Non-linearity means that the relationship between input and output is not a straight line. In simple terms the
output does not change proportionally with the input. A common choice is the ReLU function defined
as σ(x)=max(0,x)σ(x)=max(0,x).
Imagine you want to classify apples and bananas based on their shape and color.
• If we use a linear function it can only separate them using a straight line.
• But real-world data is often more complex like overlapping colors, different lighting, etc.
• By adding a non-linear activation function like ReLU, Sigmoid or Tanh the network can
create curved decision boundaries to separate them correctly.
Effect of Non-Linearity
The inclusion of the ReLU activation function σσ allows h1h1 to introduce a non-linear decision boundary in
the input space. This non-linearity enables the network to learn more complex patterns that are not possible
with a purely linear model such as:
• Modeling functions that are not linearly separable.
• Increasing the capacity of the network to form multiple decision boundaries based on the
combination of weights and biases.
Why is Non-Linearity Important in Neural Networks?
Neural networks consist of neurons that operate using weights, biases and activation functions.
In the learning process these weights and biases are updated based on the error produced at the output—a
process known as backpropagation. Activation functions enable backpropagation by providing gradients that
are essential for updating the weights and biases.
Without non-linearity even deep networks would be limited to solving only simple, linearly separable
problems. Activation functions help neural networks to model highly complex data distributions and solve
advanced deep learning tasks. Adding non-linear activation functions introduce flexibility and enable the
network to learn more complex and abstract patterns from data.
Mathematical Proof of Need of Non-Linearity in Neural Networks
To illustrate the need for non-linearity in neural networks with a specific example let's consider a network
with two input nodes (i1and i2)(i1and i2), a single hidden layer containing neurons h1 and h2h1 and h2 and
an output neuron (out).
We will use w1,w2w1,w2 as weights connecting the inputs to the hidden neuron and w5w5 as the weight
connecting the hidden neuron to the output. We'll also include biases (b1b1 for the hidden neuron and b2b2
for the output neuron) to complete the model.
1. Input Layer: Two inputs i1i1 and i2i2.
2. Hidden Layer: Two neuron h1h1 and h2h2
3. Output Layer: One output neuron.
The input to the hidden neuron h1h1 is calculated as a weighted sum of the inputs plus a bias:
h1=i1.w1+i2.w3+b1h1=i1.w1+i2.w3+b1
h2=i1.w2+i2.w4+b2h2=i1.w2+i2.w4+b2
The output neuron is then a weighted sum of the hidden neuron's output plus a bias:
output=h1.w5+h2.w6+biasoutput=h1.w5+h2.w6+bias
Here, h_1 , h_2 \text{ and output} are linear expressions.
In order to add non-linearity, we will be using sigmoid activation function in the output layer:
σ(x)=11+e−xσ(x)=1+e−x1
final output=σ(h1.w5+h2.w6+bias)final output=σ(h1.w5+h2.w6+bias)
final output=11+e−(h1.w5+h2.w6+bias)final output=1+e−(h1.w5+h2.w6+bias)1
This gives the final output of the network after applying the sigmoid activation function in output layers,
introducing the desired non-linearity.
Types of Activation Functions in Deep Learning
1. Linear Activation Function
Linear Activation Function resembles straight line define by y=x. No matter how many layers the neural
network contains if they all use linear activation functions the output is a linear combination of the input.
• The range of the output spans from (−∞ to +∞)(−∞ to +∞).
• Linear activation function is used at just one place i.e. output layer.
• Using linear activation across all layers makes the network's ability to learn complex patterns
limited.
Linear activation functions are useful for specific tasks but must be combined with non-linear functions to
enhance the neural network’s learning and predictive capabilities.
Linear Activation Function or Identity Function returns the input as the output
2. Non-Linear Activation Functions
1. Sigmoid Function
Sigmoid Activation Function is characterized by 'S' shape. It is mathematically defined
asA=11+e−xA=1+e−x1. This formula ensures a smooth and continuous output that is essential for gradient-
based optimization methods.
• It allows neural networks to handle and model complex patterns that linear equations cannot.
• The output ranges between 0 and 1, hence useful for binary classification.
• The function exhibits a steep gradient when x values are between -2 and 2. This sensitivity means
that small changes in input x can cause significant changes in output y which is critical during the
training process.
Sigmoid or Logistic Activation Function Graph
2. Tanh Activation Function
Tanh function (hyperbolic tangent function) is a shifted version of the sigmoid, allowing it to stretch across
the y-axis. It is defined as:
f(x)=tanh(x)=21+e−2x−1.f(x)=tanh(x)=1+e−2x2−1.
Alternatively, it can be expressed using the sigmoid function:
tanh(x)=2×sigmoid(2x)−1tanh(x)=2×sigmoid(2x)−1
• Value Range: Outputs values from -1 to +1.
• Non-linear: Enables modeling of complex data patterns.
• Use in Hidden Layers: Commonly used in hidden layers due to its zero-centered output,
facilitating easier learning for subsequent layers.
Tanh Activation Function
3. ReLU (Rectified Linear Unit) Function
ReLU activation is defined by A(x)=max(0,x)A(x)=max(0,x), this means that if the input x is positive,
ReLU returns x, if the input is negative, it returns 0.
• Value Range: [0,∞)[0,∞), meaning the function only outputs non-negative values.
• Nature: It is a non-linear activation function, allowing neural networks to learn complex patterns
and making backpropagation more efficient.
• Advantage over other Activation: ReLU is less computationally expensive than tanh and
sigmoid because it involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for computation.
ReLU Activation Function
3. Exponential Linear Units
1. Softmax Function
Softmax function is designed to handle multi-class classification problems. It transforms raw output scores
from a neural network into probabilities. It works by squashing the output values of each class into the range
of 0 to 1 while ensuring that the sum of all probabilities equals 1.
• Softmax is a non-linear activation function.
• The Softmax function ensures that each class is assigned a probability, helping to identify which
class the input belongs to.
2. SoftPlus Function
Softplus function is defined mathematically as: A(x)=log(1+ex)A(x)=log(1+ex).
This equation ensures that the output is always positive and differentiable at all points which is an advantage
over the traditional ReLU function.
• Nature: The Softplus function is non-linear.
• Range: The function outputs values in the range (0,∞)(0,∞), similar to ReLU, but without the
hard zero threshold that ReLU has.
• Smoothness: Softplus is a smooth, continuous function, meaning it avoids the sharp
discontinuities of ReLU which can sometimes lead to problems during optimization.
A loss function is a mathematical way to measure how good or bad a model’s predictions are
compared to the actual results. It gives a single number that tells us how far off the predictions
are. The smaller the number, the better the model is doing. Loss functions are used to train
models. Loss functions are important because they:
1. Guide Model Training: During training, algorithms such as Gradient Descent use the
loss function to adjust the model's parameters and try to reduce the error and improve
the model’s predictions.
2. Measure Performance: By finding the difference between predicted and actual
values and it can be used for evaluating the model's performance.
3. Affect learning behavior: Different loss functions can make the model learn in
different ways depending on what kind of mistakes they make.
There are many types of loss functions each suited for different tasks. Here are some common
methods:
1. Regression Loss Functions
These are used when your model needs to predict a continuous number such as predicting the
price of a product or age of a person. Popular regression loss functions are:
1. Mean Squared Error (MSE) Loss
Mean Squared Error (MSE) Loss is one of the most widely used loss functions for regression
tasks. It calculates the average of the squared differences between the predicted values and the
actual values. It is simple to understand and sensitive to outliers because the errors are squared
which can affect the loss.
MSE=1n∑i=1n(yi−y^i)2MSE=n1∑i=1n(yi−yi)2
2. Mean Absolute Error (MAE) Loss
Mean Absolute Error (MAE) Loss is another commonly used loss function for regression. It
calculates the average of the absolute differences between the predicted values and the actual
values. It is less sensitive to outliers compared to MSE. But it is not differentiable at zero which
can cause issues for some optimization algorithms.
MAE=1n∑i=1n∣yi−yi^∣MAE=n1∑i=1n∣yi−yi∣
3. Huber Loss
Huber Loss combines the advantages of MSE and MAE. It is less sensitive to outliers than MSE
and differentiable everywhere unlike MAE. It requires tuning of the parameter δδ. Huber Loss is
defined as:
{12(yi−y^i)2for ∣yi−y^i∣≤δδ∣yi−y^i∣−12δ2for ∣yi−y^i∣>δ{21(yi−y^i)2δ∣yi−y^i∣−21δ2for ∣yi−y^i∣≤δfor ∣yi
−y^i∣>δ
2. Classification Loss Functions
Classification loss functions are used to evaluate how well a classification model's predictions
match the actual class labels. There are different types of classification Loss functions:
1. Binary Cross-Entropy Loss (Log Loss)
Binary Cross-Entropy Loss is also known as Log Loss and is used for binary classification
problems. It measures the performance of a classification model whose output is a probability
value between 0 and 1.
Binary Cross-Entropy=−1n∑i=1n[yilog(y^i)+(1−yi)log(1−y^i)]Binary Cross-Entropy=−n1∑i=1n
[yilog(y^i)+(1−yi)log(1−y^i)]
where:
• n is the number of data points
• yiyi is the actual binary label (0 or 1)
• y^iy^i is the predicted probability.
2. Categorical Cross-Entropy Loss
Categorical Cross-Entropy Loss is used for multiclass classification problems. It measures the
performance of a classification model whose output is a probability distribution over multiple
classes.
Categorical Cross-Entropy=−∑i=1n∑j=1kyijlog(y^ij)Categorical Cross-Entropy=−∑i=1n∑j=1kyij
log(y^ij)
where:
• n is the number of data points
• k is the number of classes,
• yijyij is the binary indicator (0 or 1) if class label j is the correct classification for data
point i
• y^ijy^ij is the predicted probability for class j.
3. Sparse Categorical Cross-Entropy Loss
Sparse Categorical Cross-Entropy Loss is similar to Categorical Cross-Entropy Loss but is used
when the target labels are integers instead of one-hot encoded vectors. It is efficient for large
datasets with many classes.
Sparse Categorical Cross-Entropy=−∑i=1nlog(y^i,yi)Sparse Categorical Cross-Entropy=−∑i=1n
log(y^i,yi)
where yiyi is the integer representing the correct class for data point i.
4. Kullback-Leibler Divergence Loss (KL Divergence)
KL Divergence measures how one probability distribution diverges from a second expected
probability distribution. It is often used in probabilistic models. It is sensitive to small differences
in probability distributions.
KL Divergence=∑i=1n∑j=1kyijlog(yijy^ij)KL Divergence=∑i=1n∑j=1kyijlog(y^ijyij)
5. Hinge Loss
Hinge Loss is used for training classifiers especially for support vector machines (SVMs). It is
suitable for binary classification tasks as it is not differentiable at zero.
Hinge Loss=1n∑i=1nmax(0,1−yi ⋅y^i)Hinge Loss=n1∑i=1nmax(0,1−yi⋅y^i)
where:
• yiyi is the actual label (-1 or 1)
• y^iy^i is the predicted value.
3. Ranking Loss Functions
Ranking loss functions are used to evaluate models that predict the relative order of items. These
are commonly used in tasks such as recommendation systems and information retrieval.
1. Contrastive Loss
Contrastive Loss is used to learn embeddings such that similar items are closer in the embedding
space while dissimilar items are farther apart. It is often used in Siamese networks.
Contrastive Loss=12N∑i=1N(yi⋅di2+(1−yi)⋅max(0,m−di)2)Contrastive Loss=2N1∑i=1N(yi⋅di2
+(1−yi)⋅max(0,m−di)2)
where:
• didi is the distance between a pair of embeddings
• yiyi is 1 for similar pairs and 0 for dissimilar pairs
• m is a margin.
2. Triplet Loss
Triplet Loss is used to learn embeddings by comparing the relative distances between triplets:
anchor, positive example and negative example.
Triplet Loss=1N∑i=1N[∥f(xia)−f(xip)∥22−∥f(xia)−f(xin)∥22+α]+Triplet Loss=N1∑i=1N[∥f(xia)−f(xip
)∥22−∥f(xia)−f(xin)∥22+α]+
where:
• f(x) is the embedding function
• xiaxia is the anchor
• xipxip is the positive example
• xinxin is the negative example
• αα is a margin.
3. Margin Ranking Loss
Margin Ranking Loss measures the relative distances between pairs of items and ensures that
the correct ordering is maintained with a specified margin.
Margin Ranking Loss=1N∑i=1Nmax(0,−yi ⋅(si+−si−)+margin)Margin Ranking Loss=N1∑i=1N
max(0,−yi⋅(si+−si−)+margin)
where:
• si+si+ and si−si− are the scores for the positive and negative samples
• yiyi is the label indicating the correct ordering.
4. Image and Reconstruction Loss Functions
These loss functions are used to evaluate models that generate or reconstruct images ensuring
that the output is as close as possible to the target images.
1. Pixel-wise Cross-Entropy Loss
Pixel-wise Cross-Entropy Loss is used for image segmentation tasks where each pixel is
classified independently.
Pixel-wise Cross-Entropy=−1N∑i=1N∑c=1Cyi,clog(y^i,c)Pixel-wise Cross-Entropy=−N1∑i=1N
∑c=1Cyi,clog(y^i,c)
where:
• N is the number of pixels,
• C is the number of classes
• yi,cyi,c is the binary indicator for the correct class of pixel
• y^i,cy^i,c is the predicted probability for class c.
2. Dice Loss
Dice Loss is used for image segmentation tasks and is particularly effective for imbalanced
datasets. It measures the overlap between the predicted segmentation and the ground truth.
Dice Loss=1−2∑i=1Nyiy^i∑i=1Nyi+∑i=1Ny^iDice Loss=1−∑i=1Nyi+∑i=1Ny^i2∑i=1Nyiy^i
where:
• yiyi is the ground truth label
• y^iy^i is the predicted label.
3. Jaccard Loss (Intersection over Union, IoU)
Jaccard Loss is also known as IoU Loss that measures the intersection over union of the predicted
segmentation and the ground truth.
Jaccard Loss=1−∑i=1Nyiy^i∑i=1Nyi+∑i=1Ny^i−∑i=1Nyiy^iJaccard Loss=1−∑i=1Nyi+∑i=1Ny^i
−∑i=1Nyiy^i∑i=1Nyiy^i
4. Perceptual Loss
Perceptual Loss measures the difference between high-level features of images rather than
pixel-wise differences. It is often used in image generation tasks.
Perceptual Loss=∑i=1N∥ϕj(yi)−ϕj(y^i)∥22Perceptual Loss=∑i=1N∥ϕj(yi)−ϕj(y^i)∥22
where:
• ϕjϕj is a layer in a pre-trained network
• yiyi and y^iy^i are the ground truth and predicted images
5. Total Variation Loss
Total Variation Loss encourages spatial smoothness in images by penalizing differences between
adjacent pixels.
Total Variation Loss=∑i,j((yi,j+1−yi,j)2+(yi+1,j−yi,j)2)Total Variation Loss=∑i,j((yi,j+1−yi,j
)2+(yi+1,j−yi,j)2)
5. Adversarial Loss Functions
Adversarial loss functions are used in generative adversarial networks (GANs) to train the
generator and discriminator networks.
1. Adversarial Loss (GAN Loss)
The standard GAN loss function involves a minimax game between the generator and the
discriminator.
minGmaxDEx∼pdata(x)[logD(x)]+Ez ∼pz(z)[log(1−D(G(z)))]minGmaxDEx∼pdata(x)
[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]
• The discriminator tries to maximize the probability of correctly classifying real and
fake samples.
• The generator tries to minimize the discriminator’s ability to tell its outputs are fake.
2. Least Squares GAN Loss
LSGAN modifies the standard GAN loss by using least squares error instead of log loss make
the training more stable:
Discriminator Loss: minD12Ex∼pdata(x)[(D(x)−1)2]+12Ez ∼pz(z)[D(G(z))2]minD21Ex∼pdata(x)
[(D(x)−1)2]+21Ez∼pz(z)[D(G(z))2]
Generator Loss: minG12Ez∼pz(z)[(D(G(z))−1)2]minG21Ez∼pz(z)[(D(G(z))−1)2]
6. Specialized Loss Functions
Specialized loss functions are designed for specific tasks such as sequence prediction, count data
and cosine similarity.
1. CTC Loss (Connectionist Temporal Classification)
CTC Loss is used for sequence prediction tasks where the alignment between input and output
sequences is unknown.
CTC Loss=−log(p(y∣x))CTC Loss=−log(p(y∣x))
where p(y∣x) is the probability of the correct output sequence given the input sequence.
2. Poisson Loss
Poisson Loss is used for count data modeling the distribution of the predicted values as a Poisson
distribution.
Poisson Loss=∑i=1N(y^i−yilog(y^i))Poisson Loss=∑i=1N(y^i−yilog(y^i))
y^iy^i is the predicted count and yiyi is the actual count.
3. Cosine Proximity Loss
Cosine Proximity Loss measures the cosine similarity between the predicted and target vectors
encouraging them to point in the same direction.
Cosine Proximity Loss=−1N∑i=1Nyi⋅y^i∥yi∥∥y^i∥Cosine Proximity Loss=−N1∑i=1N∥yi∥∥y^i∥yi⋅y^i
4. Earth Mover's Distance (Wasserstein Loss)
Earth Mover's Distance measures the distance between two probability distributions and is used
in Wasserstein GANs.
Wasserstein Loss=Ex∼pr[D(x)]−Ez∼pz[D(G(z))]Wasserstein Loss=Ex∼pr[D(x)]−Ez∼pz[D(G(z))]
How to Choose the Right Loss Function?
Choosing the right loss function is very important for training a deep learning model that works
well. Here are some guidelines to help you make the right choice:
• Understand the Task : The first step in choosing the right loss function is to
understand what your model is trying to do. Use MSE or MAE for regression, Cross-
Entropy for classification, Contrastive or Triplet Loss for ranking and Dice or Jaccard
Loss for image segmentation.
• Consider the Output Type: You should also think about the type of output your
model produces. If the output is a continuous number use regression loss functions like
MSE or MAE, classification losses for labels and CTC Loss for sequence outputs like
speech or handwriting.
• Handle Imbalanced Data: If your dataset is imbalanced one class appears much more
often than others it's important to use a loss function that can handle this. Focal Loss
is useful for such cases because it focuses more on the harder-to-predict or rare
examples and help the model learn better from them.
• Robust to Outliers: When your data has outliers it’s better to use a loss function
that’s less sensitive to them. Huber Loss is a good option because it combines the
strengths of both MSE and MAE and make it more robust and stable when outliers are
present.
• Performance and Convergence: Choose loss functions that help your model
converge faster and perform better. For example using Hinge Loss for SVMs can
sometimes lead to better performance than Cross-Entropy for classification.
Back Propagation is also known as "Backward Propagation of Errors" is a method used to train
neural network . Its goal is to reduce the difference between the model’s predicted output and
the actual output by adjusting the weights and biases in the network.
It works iteratively to adjust weights and bias to minimize the cost function. In each epoch the
model adapts these parameters by reducing loss by following the error gradient. It often uses
optimization algorithms like gradient descent or stochastic gradient descent. The algorithm
computes the gradient using the chain rule from calculus allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.
Fig(a) A simple illustration of how the backpropagation works by adjustments of weights
Back Propagation plays a critical role in how neural networks improve over time. Here's why:
1. Efficient Weight Update: It computes the gradient of the loss function with respect
to each weight using the chain rule making it possible to update weights efficiently.
2. Scalability: The Back Propagation algorithm scales well to networks with multiple
layers and complex architectures making deep learning feasible.
3. Automated Learning: With Back Propagation the learning process becomes
automated and the model can adjust itself to optimize its performance.
Working of Back Propagation Algorithm
The Back Propagation algorithm involves two main steps: the Forward Pass and the Backward
Pass.
1. Forward Pass Work
In forward pass the input data is fed into the input layer. These inputs combined with their
respective weights are passed to hidden layers. For example in a network with two hidden layers
(h1 and h2) the output from h1 serves as the input to h2. Before applying an activation function,
a bias is added to the weighted inputs.
Each hidden layer computes the weighted sum (`a`) of the inputs then applies an activation
function like ReLU (Rectified Linear Unit) to obtain the output (`o`). The output is passed to the
next layer where an activation function such as softmax converts the weighted outputs into
probabilities for classification.
The forward pass using weights and biases
2. Backward Pass
In the backward pass the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases. One common method
for error calculation is the Mean Squared Error (MSE) given by:
MSE=(Predicted Output−Actual Output)2MSE=(Predicted Output−Actual Output)2
Once the error is calculated the network adjusts weights using gradients which are computed
with the chain rule. These gradients indicate how much each weight and bias should be adjusted
to minimize the error in the next iteration. The backward pass continues layer by layer ensuring
that the network learns and improves its performance. The activation function through its
derivative plays a crucial role in computing these gradients during Back Propagation.
Example of Back Propagation in Machine Learning
Let’s walk through an example of Back Propagation in machine learning. Assume the neurons
use the sigmoid activation function for the forward and backward pass. The target output is 0.5
and the learning rate is 1.
Example (1) of backpropagation sum
Forward Propagation
1. Initial Calculation
The weighted sum at each node is calculated using:
aj=∑(wi,j∗xi)aj=∑(wi,j∗xi)
Where,
• ajaj is the weighted sum of all the inputs and weights at each node
• wi,jwi,j represents the weights between the ithithinput and the jthjth neuron
• xixi represents the value of the ithith input
O (output): After applying the activation function to a, we get the output of the neuron:
ojoj = activation function(ajaj)
2. Sigmoid Function
The sigmoid function returns a value between 0 and 1, introducing non-linearity into the model.
yj=11+e−ajyj=1+e−aj1
To find the outputs of y3, y4 and y5
3. Computing Outputs
At h1 node
a1=(w1,1x1)+(w2,1x2)=(0.2∗0.35)+(0.2∗0.7)=0.21a1=(w1,1x1)+(w2,1x2
)=(0.2∗0.35)+(0.2∗0.7)=0.21
Once we calculated the a1 value, we can now proceed to find the y3 value:
yj=F(aj)=11+e−a1yj=F(aj)=1+e−a11
y3=F(0.21)=11+e−0.21y3=F(0.21)=1+e−0.211
y3=0.56y3=0.56
Similarly find the values of y4 at h2 and y5 at O3
a2=(w1,2∗x1)+(w2,2∗x2)=(0.3∗0.35)+(0.3∗0.7)=0.315a2=(w1,2∗x1)+(w2,2∗x2
)=(0.3∗0.35)+(0.3∗0.7)=0.315
y4=F(0.315)=11+e−0.315y4=F(0.315)=1+e−0.3151
a3=(w1,3∗y3)+(w2,3∗y4)=(0.3∗0.57)+(0.9∗0.59)=0.702a3=(w1,3∗y3)+(w2,3∗y4
)=(0.3∗0.57)+(0.9∗0.59)=0.702
y5=F(0.702)=11+e−0.702=0.67y5=F(0.702)=1+e−0.7021=0.67
Values of y3, y4 and y5
4. Error Calculation
Our actual output is 0.5 but we obtained 0.67. To calculate the error we can use the below
formula:
Errorj=ytarget−y5Errorj=ytarget−y5
=> 0.5−0.67=−0.17=> 0.5−0.67=−0.17
Using this error value we will be backpropagating.
Back Propagation
1. Calculating Gradients
The change in each weight is calculated as:
Δwij=η×δj×OjΔwij=η×δj×Oj
Where:
• δjδj is the error term for each unit,
• ηη is the learning rate.
2. Output Unit Error
For O3:
δ5=y5(1−y5)(ytarget−y5)δ5=y5(1−y5)(ytarget−y5)
=0.67(1−0.67)(−0.17)=−0.0376=0.67(1−0.67)(−0.17)=−0.0376
3. Hidden Unit Error
For h1:
δ3=y3(1−y3)(w1,3×δ5)δ3=y3(1−y3)(w1,3×δ5)
=0.56(1−0.56)(0.3×−0.0376)=−0.0027=0.56(1−0.56)(0.3×−0.0376)=−0.0027
For h2:
δ4=y4(1−y4)(w2,3×δ5)δ4=y4(1−y4)(w2,3×δ5)
=0.59(1−0.59)(0.9×−0.0376)=−0.0819=0.59(1−0.59)(0.9×−0.0376)=−0.0819
4. Weight Updates
For the weights from hidden to output layer:
Δw2,3=1×(−0.0376)×0.59=−0.022184Δw2,3=1×(−0.0376)×0.59=−0.022184
New weight:
w2,3(new)=−0.022184+0.9=0.877816w2,3(new)=−0.022184+0.9=0.877816
For weights from input to hidden layer:
Δw1,1=1×(−0.0027)×0.35=0.000945Δw1,1=1×(−0.0027)×0.35=0.000945
New weight:
w1,1(new)=0.000945+0.2=0.200945w1,1(new)=0.000945+0.2=0.200945
Similarly other weights are updated:
• w1,2(new)=0.273225w1,2(new)=0.273225
• w1,3(new)=0.086615w1,3(new)=0.086615
• w2,1(new)=0.269445w2,1(new)=0.269445
• w2,2(new)=0.18534w2,2(new)=0.18534
The updated weights are illustrated below
Through backward pass the weights are updated
After updating the weights the forward pass is repeated yielding:
• y3=0.57y3=0.57
• y4=0.56y4=0.56
• y5=0.61y5=0.61
Since y5=0.61y5=0.61 is still not the target output the process of calculating the error and
backpropagating continues until the desired output is reached.
This process demonstrates how Back Propagation iteratively updates weights by minimizing
errors until the network accurately predicts the output.
Error=ytarget−y5Error=ytarget−y5
=0.5−0.61=−0.11=0.5−0.61=−0.11
This process is said to be continued until the actual output is gained by the neural network.
Back Propagation Implementation in Python for XOR Problem
This code demonstrates how Back Propagation is used in a neural network to solve the XOR
problem. The neural network consists of:
1. Defining Neural Network
We define a neural network as Input layer with 2 inputs, Hidden layer with 4 neurons, Output
layer with 1 output neuron and use Sigmoid function as activation function.
• self.input_size = input_size: stores the size of the input layer
• self.hidden_size = hidden_size: stores the size of the hidden layer
• self.weights_input_hidden = np.random.randn(self.input_size, self.hidden_size):
initializes weights for input to hidden layer
• self.weights_hidden_output = np.random.randn(self.hidden_size, self.output_size):
initializes weights for hidden to output layer
• self.bias_hidden = np.zeros((1, self.hidden_size)): initializes bias for hidden layer
• self.bias_output = np.zeros((1, self.output_size)): initializes bias for output layer
import numpy as np
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
self.weights_input_hidden = np.random.randn(
self.input_size, self.hidden_size)
self.weights_hidden_output = np.random.randn(
self.hidden_size, self.output_size)
self.output_activation = np.dot(
self.hidden_output, self.weights_hidden_output) + self.bias_output
self.predicted_output = self.sigmoid(self.output_activation)
return self.predicted_output
3. Defining Backward Network
In Backward pass or Back Propagation the errors between the predicted and actual outputs are
computed. The gradients are calculated using the derivative of the sigmoid function and weights
and biases are updated accordingly.
• output_error = y - self.predicted_output: calculates the error at the output layer
• output_delta = output_error *
self.sigmoid_derivative(self.predicted_output): calculates the delta for the output
layer
• hidden_error = np.dot(output_delta, self.weights_hidden_output.T): calculates the
error at the hidden layer
• hidden_delta = hidden_error *
self.sigmoid_derivative(self.hidden_output): calculates the delta for the hidden layer
• self.weights_hidden_output += np.dot(self.hidden_output.T, output_delta) *
learning_rate: updates weights between hidden and output layers
• self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate: updates
weights between input and hidden layers
def backward(self, X, y, learning_rate):
output_error = y - self.predicted_output
output_delta = output_error * \
self.sigmoid_derivative(self.predicted_output)
self.weights_hidden_output += np.dot(self.hidden_output.T,
output_delta) * learning_rate
self.bias_output += np.sum(output_delta, axis=0,
keepdims=True) * learning_rate
self.weights_input_hidden += np.dot(X.T, hidden_delta) * learning_rate
self.bias_hidden += np.sum(hidden_delta, axis=0,
keepdims=True) * learning_rate
4. Training Network
The network is trained over 10,000 epochs using the Back Propagation algorithm with a learning
rate of 0.1 progressively reducing the error.
• output = self.feedforward(X): computes the output for the current inputs
• self.backward(X, y, learning_rate): updates weights and biases using Back
Propagation
• loss = np.mean(np.square(y - output)): calculates the mean squared error (MSE) loss
def train(self, X, y, epochs, learning_rate):
for epoch in range(epochs):
output = self.feedforward(X)
self.backward(X, y, learning_rate)
if epoch % 4000 == 0:
loss = np.mean(np.square(y - output))
print(f"Epoch {epoch}, Loss:{loss}")
5. Testing Neural Network
• X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]): defines the input data
• y = np.array([[0], [1], [1], [0]]): defines the target values
• nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1): initializes the
neural network
• nn.train(X, y, epochs=10000, learning_rate=0.1): trains the network
• output = nn.feedforward(X): gets the final predictions after training
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
output = nn.feedforward(X)
print("Predictions after training:")
print(output)
Output:
Trained Model
• The output shows the training progress of a neural network over 10,000 epochs.
Initially the loss was high (0.2713) but it gradually decreased as the network learned
reaching a low value of 0.0066 by epoch 8000.
• The final predictions are close to the expected XOR outputs: approximately 0 for [0,
0] and [1, 1] and approximately 1 for [0, 1] and [1, 0] indicating that the network
successfully learned to approximate the XOR function.
Advantages of Back Propagation for Neural Network Training
The key benefits of using the Back Propagation algorithm are:
1. Ease of Implementation: Back Propagation is beginner-friendly requiring no prior
neural network knowledge and simplifies programming by adjusting weights with
error derivatives.
2. Simplicity and Flexibility: Its straightforward design suits a range of tasks from basic
feedforward to complex convolutional or recurrent networks.
3. Efficiency: Back Propagation accelerates learning by directly updating weights based
on error especially in deep networks.
4. Generalization: It helps models generalize well to new data improving prediction
accuracy on unseen examples.
5. Scalability: The algorithm scales efficiently with larger datasets and more complex
networks making it ideal for large-scale tasks.
Challenges with Back Propagation
While Back Propagation is useful it does face some challenges:
1. Vanishing Gradient Problem: In deep networks the gradients can become very small
during Back Propagation making it difficult for the network to learn. This is common
when using activation functions like sigmoid or tanh.
2. Exploding Gradients: The gradients can also become excessively large causing the
network to diverge during training.
3. Overfitting: If the network is too complex it might memorize the training data instead
of learning general patterns.
Learning Rate in Neural Network
The learning rate is a key hyperparameter in neural networks that controls how quickly the model
learns during training. It determines the size of the steps taken to minimize the loss function. It
controls how much change is made in response to the error encountered, each time the model
weights are updated. It determines the size of the steps taken towards a minimum of the loss
function during optimization.
In mathematical terms, when using a method like Stochastic Gradient Descent (SGD), the
learning rate (often denoted as αα or ηη) is multiplied by the gradient of the loss function to
update the weights:
w=w−α⋅∇L(w)w=w−α⋅∇L(w)
Where:
• ww represents the weights
• αα is the learning rate
• ∇L(w)∇L(w) is the gradient of the loss function
Impact of Learning Rate on Model
The learning rate is a critical hyperparameter that directly affects how a model learns during
training by controlling the magnitude of weight updates. Its value significantly affects both
convergence speed and model performance.
Low Learning Rate:
• Leads to slow convergence
• Requires more training epochs
• Can improve accuracy but increases computation time
High Learning Rate:
• Speeds up training
• Risks of overshooting optimal weights
• May cause instability or divergence of the loss function
Optimal Learning Rate:
• Balances training speed and model accuracy
• Ensures stable convergence without excessive training time
Best Practices:
• Fine-tune the learning rate based on the task and model
• Use techniques like learning rate scheduling or adaptive optimizers to improve
performance and stability
Identifying the ideal learning rate can be challenging but is important for improving performance
without wasting resources.
Techniques for Adjusting the Learning Rate
1. Fixed Learning Rate
• A constant learning rate is maintained throughout training.
• Simple to implement and commonly used in basic models.
• Its limitation is that it lacks the ability to adapt on different training phases which may
create sub optimal results.
2. Learning Rate Schedules
These techniques reduce the learning rate over time based on predefined rules to improve
convergence:
• Step Decay: Reduces the learning rate by a fixed factor at set intervals (every few
epochs).
• Exponential Decay: Continuously decreases the learning rate exponentially over
training time.
• Polynomial Decay: Learning rate decays polynomially, offering smoother transitions
compared to step or exponential methods.
3. Adaptive Learning Rate Methods
Adaptive methods adjust the learning rate dynamically based on gradient information, allowing
better updates per parameter:
• AdaGrad: AdaGrad adapts the learning rate per parameter based on the squared
gradients. It is effective for sparse data but may decay too quickly.
• RMSprop: RMSprop builds on AdaGrad by using a moving average of squared
gradients to prevent aggressive decay.
• Adam (Adaptive Moment Estimation): Adam combines RMSprop with momentum to
provide stable and fast convergence; widely used in practice.
4. Cyclic Learning Rate
• The learning rate oscillates between a minimum and maximum value in a cyclic
manner throughout training.
• It increases and then decreases the learning rate linearly in each cycle.
• Benefits include better exploration of the loss surface and leading to faster
convergence.
5. Decaying Learning Rate
• Gradually reduces the learning rate as training progresses.
• Helps the model take more precise steps towards the minimum. This improves
stability in later epochs.
First-Order Algorithms
First-order optimization algorithms are methods that rely on the first derivative (gradient) of the
objective function to find the minimum or maximum. They use gradient information to decide the
direction and size of updates for model parameters. These algorithms are widely used in machine
learning due to their simplicity and efficiency, especially for large-scale problems. Below are
some First-Order Algorithms:
1. Gradient Descent and Its Variants
Gradient Descent is an optimization algorithm used for minimizing the objective function by
iteratively moving towards the minimum. It is a first-order iterative algorithm for finding a local
minimum. The algorithm works by taking repeated steps in the opposite direction of the gradient
of the function at the current point because it will be the direction of steepest descent.
Let's assume we want to minimize the function f(x)=x2 using gradient descent.
• The main function gradient_descent takes the gradient, a starting point, learning rate,
number of iterations and a convergence tolerance.
• In each iteration, it calculates the gradient at the current point and updates the point
in the opposite direction of the gradient (descent), scaled by the learning rate.
• The update continues until either the maximum number of iterations is reached or the
update magnitude falls below the specified tolerance.
• The final result is printed which should be a value close to the minimum of the function.
import numpy as np
# Initial point
start = 5.0
# Learning rate
learn_rate = 0.1
# Number of iterations
n_iter = 50
# Tolerance for convergence
tolerance = 1e-6
Output of Gradient
Variants of Gradient Descent
• Stochastic Gradient Descent (SGD): This variant suggests model update using a
single training example at a time which does not require a large amount of computation
and therefore is suitable for large datasets.
• Mini-Batch Gradient Descent: This method is designed so that it computes it for
every mini-batches of data, a balance between amount of time and precision. It
converges faster than SGD and is used widely in practice to train many deep learning
models.
• Momentum: Momentum improves SGD by adding information of the previous steps
of the algorithm to the next step. By adding a portion of the current update vector to
the previous update, it enables the algorithm to go through flat areas and noisy
gradients to minimize the time to train and find convergence.
2. Stochastic Optimization Techniques
Stochastic optimization techniques introduce randomness to the search process which can be
advantageous for tackling complex optimization problems where traditional methods might
struggle.
• Simulated Annealing: Similar to the annealing process in metallurgy this technique
starts with a high temperature (high randomness) that allows exploration of the search
space widely. Over time, the temperature decreases (randomness decreases) which
helps the algorithm converge towards better solutions while avoiding local minima.
• Random Search: This simple method randomly chooses points in the search space
then evaluates them. Random search is actually quite effective particularly for
optimization problems that are high-dimensional. The ease of implementation and its
ability to work with complex algorithms makes this approach widely used.
When using stochastic optimization algorithms, we consider the following practical aspects:
• Repeated Evaluations: Stochastic optimization algorithms often need repeated
evaluations of the objective function which is time-consuming. Therefore, we have to
balance the number of evaluations with the computational resources available.
• Problem Structure: The choice of stochastic optimization algorithm depends on the
structure of the problem. For example, simulated annealing is suitable for problems
with multiple local optima while random search is effective for high-dimensional
optimization landscapes.
3. Evolutionary Algorithms
In evolutionary algorithms we take inspiration from natural selection and include techniques such
as Genetic Algorithms and Differential Evolution. They are often used to solve complex
optimization problems that are difficult to solve using traditional methods.
Key Components:
• Population: Set of candidate solutions to the optimization problem.
• Fitness Function: A function that evaluates the quality of each candidate solution.
• Selection: Mechanism for selecting the fittest candidates to reproduce.
• Genetic Operators: Operators that modify the selected candidates to create new
offspring such as crossover and mutation.
• Termination: A condition for stopping the algorithm.
1. Genetic Algorithms
These algorithms use crossover and mutation operators to evolve the candidate population. It is
commonly used to generate solutions to optimization/search problems by relying on biologically
inspired operators such as mutation, crossover and selection. In the code example below we
implement a Genetic Algorithm to minimize:
f(x)=∑i=1nxi2f(x)=∑i=1nxi2
• fitness_func returns the negative sum of squares to convert minimization into
maximization.
• generate_population creates random individuals between 0 and 1.
• Each generation, the top 50% (fittest) are selected as parents.
• Offspring are created via single-point crossover between two parents.
• Mutation randomly alters one gene with a small probability.
• The process repeats for a fixed number of generations.
• Outputs the best individual and its minimized objective value.
import numpy as np
# Genetic algorithm
def genetic_algorithm(population, fitness_func, n_generations=100, mutation_rate=0.01):
for _ in range(n_generations):
population = sorted(population, key=fitness_func, reverse=True)
next_generation = population[:len(population)//2].copy()
while len(next_generation) < len(population):
parents_indices = np.random.choice(len(next_generation), 2, replace=False)
parent1, parent2 = next_generation[parents_indices[0]], next_generation[parents_indices[1]]
crossover_point = np.random.randint(1, len(parent1))
child = np.concatenate((parent1[:crossover_point], parent2[crossover_point:]))
if np.random.rand() < mutation_rate:
mutate_point = np.random.randint(len(child))
child[mutate_point] = np.random.rand()
next_generation.append(child)
population = np.array(next_generation)
return population[0]
# Parameters
population_size = 10
dimension = 5
n_generations = 50
mutation_rate = 0.05
# Initialize population
population = generate_population(population_size, dimension)
def rastrigin(x):
return 10 * len(x) + sum([(xi ** 2 - 10 * np.cos(2 * np.pi * xi)) for xi in x])
class Particle:
def __init__(self, bounds):
self.position = np.random.uniform(bounds[:, 0], bounds[:, 1], len(bounds))
self.velocity = np.random.uniform(-1, 1, len(bounds))
self.pbest_position = self.position.copy()
self.pbest_value = float('inf')
for _ in range(max_iter):
for particle in particles:
fitness = objective_func(particle.position)
if fitness < particle.pbest_value:
particle.pbest_value = fitness
particle.pbest_position = particle.position.copy()
# Define bounds
bounds = np.array([[-5.12, 5.12]] * 10)
# Run PSO
best_solution, best_fitness = particle_swarm_optimization(rastrigin, bounds, n_particles=30,
max_iter=100)
PSO output
2. Ant Colony Optimization (ACO)
Ant Colony Optimization is inspired by the behavior of ants. Ants find the shortest path between
their colony and food sources by laying down pheromones which guide other ants to the path.
Here’s a basic implementation of ACO for the Traveling Salesman Problem (TSP):
• Each ant constructs a complete tour by selecting unvisited cities based on pheromone
intensity and inverse distance.
• The transition probability combines pheromone influence (αα) and heuristic
desirability (ββ).
• After each iteration, the best tour is updated if a shorter path is found.
• Pheromone levels are globally evaporated (rate ρρ) and reinforced in proportion to
the quality (1/length) of each ant’s tour.
• The algorithm iterates over multiple generations to converge toward an optimal or
near-optimal solution.
• Returns the shortest tour and its total length discovered during the search.
import numpy as np
class Ant:
def __init__(self, n_cities):
self.path = []
self.visited = [False] * n_cities
self.distance = 0.0
for _ in range(n_iterations):
ants = [Ant(n_cities) for _ in range(n_ants)]
for ant in ants:
ant.visit_city(np.random.randint(n_cities), distance_matrix)
pheromone *= (1 - rho)
for ant in ants:
contribution = Q / ant.path_length(distance_matrix)
for i in range(n_cities):
pheromone[ant.path[i]][ant.path[(i + 1) % n_cities]] += contribution
# Run ACO
best_path, best_length = ant_colony_optimization(distance_matrix)
def f_prime(x):
return 3*x**2 - 4*x
def f_double_prime(x):
return 6*x - 4
# Initial point
x0 = 3.0
# Tolerance for convergence
tol = 1e-6
# Maximum iterations
max_iter = 100
Gradient descent is the backbone of the learning process for various algorithms, including linear
regression, logistic regression, support vector machines, and neural networks which serves as a
fundamental optimization technique to minimize the cost function of a model by iteratively
adjusting the model parameters to reduce the difference between predicted and actual
values, improving the model's performance. Let's see it's role in machine learning:
Prerequisites: Understand the working and math of gradient descent.
1. Training Machine Learning Models
Neural networks are trained using Gradient Descent (or its variants) in combination
with backpropagation. Backpropagation computes the gradients of the loss function with
respect to each parameter (weights and biases) in the network by applying the chain rule. The
process involves:
• Forward Propagation: Computes the output for a given input by passing data
through the layers.
• Backward Propagation: Uses the chain rule to calculate gradients of the loss with
respect to each parameter (weights and biases) across all layers.
Gradients are then used by Gradient Descent to update the parameters layer-by-layer,
moving toward minimizing the loss function.
Neural networks often use advanced variants of Gradient Descent. If you want to read more
about variants, please refer : Gradient Descent Variants.
2. Minimizing the Cost Function
The algorithm minimizes a cost function, which quantifies the error or loss of the model's
predictions compared to the true labels for:
1. Linear Regression
Gradient descent minimizes the Mean Squared Error (MSE) which serves as the loss function to
find the best-fit line. Gradient Descent is used to iteratively update the weights (coefficients) and
bias by computing the gradient of the MSE with respect to these parameters.
Since MSE is a convex function gradient descent guarantees convergence to the global
minimum if the learning rate is appropriately chosen. For each iteration:
The algorithm computes the gradient of the MSE with respect to the weights and biases.
It updates the weights (w) and bias (b) using the formula:
• Calculating the gradient of the log-loss with respect to the weights.
• Updating weights and biases iteratively to maximize the likelihood of the correct
classification:
w=w−α⋅∂J(w,b)∂w,b=b−α⋅∂J(w,b)∂bw=w−α⋅∂w∂J(w,b),b=b−α⋅∂b∂J(w,b)
The formula is the parameter update rule for gradient descent, which adjusts the weights w
and biases b to minimize a cost function. This process iteratively adjusts the line's slope and
intercept to minimize the error.
2. Logistic Regression
In logistic regression, gradient descent minimizes the Log Loss (Cross-Entropy Loss) to optimize
the decision boundary for binary classification. Since the output is probabilistic (between 0 and
1), the sigmoid function is applied. The process involves:
• Calculating the gradient of the log-loss with respect to the weights.
• Updating weights and biases iteratively to maximize the likelihood of the correct
classification:
w=w−α⋅∂J(w)∂ww=w−α⋅∂w∂J(w)
This adjustment shifts the decision boundary to separate classes more effectively.
3. Support Vector Machines (SVMs)
For SVMs, gradient descent optimizes the hinge loss, which ensures a maximum-margin
hyperplane. The algorithm:
• Calculates gradients for the hinge loss and the regularization term (if used, such as
L2 regularization).
• Updates the weights to maximize the margin between classes while minimizing
misclassification penalties with same formula provided above.
Gradient descent ensures the optimal placement of the hyperplane to separate classes with
the largest possible margin.
Gradient Descent Python Implementation
Diving further into the concept, let's understand in depth, with practical implementation.
Import the necessary libraries
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
Set the input and output data
# set random seed for reproducibility
torch.manual_seed(42)
# create random weights and bias for the linear regression model
true_weights = torch.tensor([1.3, -1])
true_bias = torch.tensor([-3.5])
# Target variable
y = x @ true_weights.T + true_bias
ax[0].set_xlabel('X1')
ax[0].set_ylabel('Y')
ax[1].set_xlabel('X2')
ax[1].set_ylabel('Y')
plt.show()
Output:
X vs Y
Let's first try with a linear model:
yp=xWT+byp=xWT+b
# Define the model
class LinearRegression(nn.Module):
def __init__(self, input_size, output_size):
super(LinearRegression, self).__init__()
self.linear = nn.Linear(input_size, output_size)
# Learning Rate
learning_rate = 0.01
# Backproogation
# Find the fradient using
loss.backward()
# Learning Rate
learning_rate = 0.001
# Model Parameter
w = model.linear.weight
b = model.linear.bias
if (epoch+1) % 100 == 0:
ax1.plot(w.detach().numpy(),loss.item(),'r*-')
ax2.plot(b.detach().numpy(),loss.item(),'g+-')
print('Epoch [{}/{}], weight:{}, bias:{} Loss: {:.4f}'.format(
epoch+1,num_epochs,
w.detach().numpy(),
b.detach().numpy(),
loss.item()))
ax1.set_xlabel('weight')
ax2.set_xlabel('bias')
ax1.set_ylabel('Loss')
ax2.set_ylabel('Loss')
plt.show()
Output:
Epoch [100/1000], weight:[[-0.2618025 0.44433367]], bias:[-0.17722966] Loss:
14.1803
Epoch [200/1000], weight:[[-0.21144074 0.35393423]], bias:[-0.7892358] Loss:
10.3030
Epoch [300/1000], weight:[[-0.17063744 0.28172654]], bias:[-1.2897989] Loss:
7.7120
Epoch [400/1000], weight:[[-0.13759881 0.22408141]], bias:[-1.699218] Loss:
5.9806
Epoch [500/1000], weight:[[-0.11086453 0.17808875]], bias:[-2.0340943] Loss:
4.8235
Epoch [600/1000], weight:[[-0.08924612 0.14141548]], bias:[-2.3080034] Loss:
4.0502
Epoch [700/1000], weight:[[-0.0717768 0.11219224]], bias:[-2.5320508] Loss:
3.5333
Epoch [800/1000], weight:[[-0.0576706 0.08892148]], bias:[-2.7153134] Loss:
3.1878
Epoch [900/1000], weight:[[-0.04628877 0.07040432]], bias:[-2.8652208] Loss:
2.9569
Epoch [1000/1000], weight:[[-0.0371125 0.05568104]], bias:[-2.9878428] Loss:
2.8026
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
For a linear regression with one feature, the model is described by the equation:
y=θ0+(θ1)⋅Xy=θ0+(θ1)⋅X
Where:
• θ0θ0 is the intercept (the bias term),
• θ1θ1 is the slope or coefficient associated with the input feature XX.
2. Defining the SGD Function
Here we define the core function for Stochastic Gradient Descent (SGD). The function takes the
input data X and y. It initializes the model parameters, performs stochastic updates for a specified
number of epochs and records the cost at each step.
• theta (θθ) is the parameter vector (intercept and slope) initialized randomly.
• X_bias is the augmented XXwith a column of ones added for the bias term (intercept).
In each epoch, the data is shuffled and for each mini-batch (or single sample), the gradient is
calculated and the parameters are updated. The cost is calculated as the mean squared error and
the history of the cost is recorded to monitor convergence.
def sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1):
m = len(X)
theta = np.random.randn(2, 1)
cost_history = []
predictions = X_bias.dot(theta)
cost = np.mean((predictions - y) ** 2)
cost_history.append(cost)
if epoch % 100 == 0:
print(f"Epoch {epoch}, Cost: {cost}")
plt.plot(cost_history)
plt.xlabel('Epochs')
plt.ylabel('Cost (MSE)')
plt.title('Cost Function during Training')
plt.show()
Output:
Visualize the Cost Function
5. Plotting the Data and Regression Line
We will visualize the data points and the fitted regression line after training. We plot the data
points as blue dots and the predicted line (from the final θθ) as a red line.
plt.scatter(X, y, color='blue', label='Data points')
plt.plot(X, np.c_[np.ones((X.shape[0], 1)), X].dot(theta_final), color='red', label='SGD fit line')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression using Stochastic Gradient Descent')
plt.legend()
plt.show()
Output:
Plot the Data and Regression Line
6. Printing the Final Model Parameters
After training, we print the final parameters of the model, which include the slope and intercept.
These values are the result of optimizing the model using SGD.
print(f"Final parameters: {theta_final}")
Output:
Final parameters: [[4.35097872] [3.45754277]]
The final parameters returned by the model are:
θ0=4.35,θ1=3.45θ0=4.35,θ1=3.45
Then the fitted linear regression model will be:
y=4.35+(3.45)⋅Xy=4.35+(3.45)⋅X
This means:
• When X=0, y=4.3(the intercept or bias term).
• For each unit increase in X,yX,y will increase by 3.4 units (the slope or coefficient).
Advantages of Stochastic Gradient Descent
1. Efficiency: Because it uses only one or a few data points to calculate the gradient,
SGD can be much faster, especially for large datasets. Each step requires fewer
computations, leading to quicker convergence.
2. Memory Efficiency: Since it does not require storing the entire dataset in memory for
each iteration, SGD can handle much larger datasets than traditional gradient descent.
3. Escaping Local Minima: The noisy updates in SGD, caused by the stochastic nature of
the algorithm, can help the model escape local minima or saddle points, potentially
leading to better solutions in non-convex optimization problems (common in deep
learning).
4. Online Learning: SGD is well-suited for online learning, where the model is trained
incrementally as new data comes in, rather than on a static dataset.
Challenges of Stochastic Gradient Descent
1. Noisy Convergence: Since the gradient is estimated based on a single data point (or a
small batch), the updates can be noisy, causing the cost function to fluctuate rather
than steadily decrease. This makes convergence slower and more erratic than in batch
gradient descent.
2. Learning Rate Tuning: SGD is highly sensitive to the choice of learning rate. A
learning rate that is too large may cause the algorithm to diverge, while one that is too
small can slow down convergence. Adaptive methods like Adam and RMSprop address
this by adjusting the learning rate dynamically during training.
3. Long Training Times: While each individual update is fast, the convergence might
take a longer time overall since the steps are more erratic compared to batch gradient
descent.
Variants of Stochastic Gradient Descent
While traditional SGD is a method, there are several improvements and variants designed to
improve convergence and stability:
• Mini-batch SGD: Instead of using a single data point, mini-batch SGD uses a small
batch of data points to calculate the gradient. This strikes a balance between the
efficiency of SGD and the stability of batch gradient descent. It reduces the noise in the
updates while maintaining the computational efficiency.
• Momentum: Momentum helps accelerate SGD by adding a fraction of the previous
update to the current one. This allows the algorithm to keep moving in the same
direction and can help overcome oscillations in the cost function.
• Adaptive Methods (example: Adam, RMSprop): These methods dynamically adjust
the learning rate for each parameter. Adam, for example, uses both the average of the
gradients (first moment) and the average of the squared gradients (second moment) to
compute an adaptive learning rate, improving convergence and stability.
Applications of Stochastic Gradient Descent
SGD and its variants are widely used across various domains of machine learning:
• Deep Learning: In training deep neural networks, SGD is the default optimizer due to
its efficiency with large datasets and its ability to work with large models. Deep
learning frameworks like TensorFlow and PyTorch typically use variants
like Adam or RMSprop, which are based on SGD.
• Natural Language Processing (NLP): Models like Word2Vec and transformers are
trained using SGD variants to optimize large models on vast text corpora.
• Computer Vision: For tasks such as image classification, object detection and
segmentation, SGD has been fundamental in training convolutional neural networks
(CNNs).
• Reinforcement Learning: SGD is also used to optimize the parameters of models
used in reinforcement learning, such as deep Q-networks (DQNs) and policy gradient
methods.
From the above graph and data, we can observe the Losses are decreasing as per the weight and
bias variations.
Now we have found the optimal weight and bias values. Print the optimal weight and bias and
w = model.linear.weight
b = model.linear.bias
Batch normalization is a technique used to improve the training of deep neural networks by
stabilizing the learning process. It addresses the issue of internal covariate shift where the
distribution of each layer's inputs changes during training as the parameters of the previous
layers change..
Batch Normalization in CNN addresses several challenges encountered during training.
1. Addressing Internal Covariate Shift: Internal covariate shift occurs when the
distribution of network activations changes as parameters are updated during training.
It addresses this by normalizing the activations in each layer. This stabilizes training
and speeds up convergence.
2. Improving Gradient Flow: It contributes to stabilizing the gradient flow
during backpropagation by reducing reliance on gradients on parameter scales. As a
result, training more stable enabling effective training of deeper networks without
facing issues like vanishing or exploding gradients.
3. Regularization Effect: During training it introduces noise to the network activations.
This noise helps to handle overfitting by adding randomness to the system.
How Does Batch Normalization Work in CNN?
Batch normalization works in convolutional neural networks (CNNs) by normalizing the
activations of each layer across mini-batch during training. The working is discussed below:
1. Normalization within Mini-Batch
In a CNN, each layer receives inputs from multiple channels (feature maps) and processes them
through convolutional filters. Batch Normalization operates on each feature map separately,
normalizing the activations across the mini-batch.
During training, batch normalization (BN) regularizes the activations of each layer by subtracting
the mean and dividing by the standard deviation of each mini-batch.
• Mean Calculation: μB=1m∑i=1mxiμB=m1∑i=1mxi
• Variance Calculation: σB2=1m∑i=1m(xi−μB)2σB2=m1∑i=1m(xi−μB)2
• Normalization: xi^=xi−μBσB2+ϵxi=σB2+ϵxi−μB
2. Scaling and Shifting
After normalization, it adjusts the normalized activations using learned scaling and shifting
parameters. These parameters enable the network to instantaneously scale and shift the
activations thereby maintaining the network's ability to represent complex patterns in the data.
• Scaling: γxi^γxi
• Shifting: zi=yi+βzi=yi+β
3. Learnable Parameters
The parameters γγ and ββ are learned during training through backpropagation. This allows the
network to adjust the normalization and ensure that the activations are in the appropriate range
for learning.
4. Applying Batch Normalization
It is typically applied after the convolutional and activation layers in a CNN before passing the
outputs to the next layer. It can also be applied before or after the activation function, depending
on the network architecture.
5. Training and Inference
During training, Batch Normalization calculates the mean and variance of each mini-batch. During
testing, it uses the averaged mean and variance that are calculated during training to normalize
the activations. This ensures consistent normalization between training and testing.
Applying Batch Normalization in CNN model using TensorFlow
For applying batch normalization layers after the convolutional layers and before the activation
functions, we use tensorflow's 'tf.keras.layers.BatchNormalization()'.
1. Importing Required Libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, BatchNormalization
2. Creating Sequential Model
First Convolutional Block
• Conv2D(32): Extracts low-level features (edges, textures).
• BatchNormalization(): Stabilizes and speeds up training.
• MaxPooling2D(): Reduces spatial size.
Second Convolutional Block
• Conv2D(64): Learns deeper patterns.
• BatchNormalization(): Normalizes activations.
• MaxPooling2D(): Further reduces size.
Dense Layers(Classifier)
• Flatten(): Converts 3D feature map to 1D.
• Dense(64): Learns high-level combinations.
• Dense(10, softmax): Outputs probabilities for 10 classes.
model = Sequential([
#First convolutional block
Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
BatchNormalization(),
MaxPooling2D((2, 2)),
#Second convolutional block
Conv2D(64, (3, 3), activation='relu'),
BatchNormalization(),
MaxPooling2D((2, 2)),
#Connecting layers
Flatten(),
Dense(64, activation='relu'),
Dense(10, activation='softmax')
])
Applying Batch Normalization in 1D CNN model using PyTorch
In PyTorch, we can easily apply batch normalization in a CNN model. For applying BN in 1D
Convolutional Neural Network model, we use 'nn.BatchNorm1d()'.
1. Importing Libraries
import torch
import torch.nn as nn
2. Defining the Model
In this step, we structure our model:
• Conv1d(3, 16): First convolution layer that transforms 3 input channels to 16 feature
maps using 3-sized filters.
• BatchNorm1d(16): Normalizes the 16 output channels to improve training stability.
• Conv1d(16, 32): Second convolutional layer, increasing feature channels to 32.
• BatchNorm1d(32): Normalizes the output from the second conv layer.
• Linear(32 * 28, 10): Fully connected layer that maps the flattened feature map to 10
output classes.
class CNN1D(nn.Module):
def __init__(self):
super(CNN1D, self).__init__()
self.conv1 = nn.Conv1d(3, 16, kernel_size=3, stride=1, padding=1)
self.bn1 = nn.BatchNorm1d(16)
self.conv2 = nn.Conv1d(16, 32, kernel_size=3, stride=1, padding=1)
self.bn2 = nn.BatchNorm1d(32)
self.fc = nn.Linear(32 * 28, 10)
3. Forward Pass
Prediction step and input flow:
• self.conv1 -> bn1 -> ReLU: Applies the first convolution and activates.
• self.conv2 -> bn2 -> ReLU: Applies the second convolution and activates.
• view(-1, 32 * 28): Flattens the 3D tensor into 2D for the dense layer.
• self.fc: Final layer that outputs a vector of size 10 (class scores).
def forward(self, x):
x = torch.relu(self.bn1(self.conv1(x)))
x = torch.relu(self.bn2(self.conv2(x)))
x = x.view(-1, 32 * 28)
x = self.fc(x)
return x
4. Initialize Model
model = CNN1D()
This code defines and creates a simple 1D Convolutional Neural Network (CNN1D) in PyTorch
for classification tasks.
Applying Batch Normalization in 2D CNN model using PyTorch
For applying Batch Normalization in 2D Convolutional Neural Network model, we use
'nn.BatchNorm2d()'.
1. Importing Libraries
import torch
import torch.nn as nn
2. Structuring Model
In this step, we define the model:
• Conv2d(3, 16, 3, 1, 1): Applies 16 filters to a 3-channel image using 3×3 kernels.
Padding keeps the image size unchanged.
• BatchNorm2d(16): Normalizes the 16 output feature maps from conv1.
• Conv2d(16, 32, 3, 1, 1): Applies 32 filters, again preserving spatial dimensions.
• BatchNorm2d(32): Normalizes the output of conv2.
• Linear(32*28*28, 10): Fully connected layer that flattens the feature map and
outputs scores for 10 classes.
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
self.bn1 = nn.BatchNorm2d(16)
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
self.bn2 = nn.BatchNorm2d(32)
self.fc = nn.Linear(32 * 28 * 28, 10)
3. Forward Pass
Prediction step and input flow:
• Conv1 -> BN -> ReLU: First feature extraction block.
• Conv2 -> BN -> ReLU: Second feature extraction block.
• view(-1, 32*28*28): Flattens the 3D output to 1D for the dense layer.
• fc: Maps the extracted features to 10 output classes.
def forward(self, x):
x = torch.relu(self.bn1(self.conv1(x)))
x = torch.relu(self.bn2(self.conv2(x)))
x = x.view(-1, 32 * 28 * 28)
x = self.fc(x)
return x
4. Model Initialization
model = CNN()
This code defines a 2D Convolutional Neural Network (CNN) in PyTorch for image
classification into 10 classes.
In conclusion, batch normalization stands as a technique used in enhancing the training and
performance of convolutional neural networks (CNNs).
Gradient Descent is an optimization algorithm in machine learning used to determine the optimal
parameters such as weights and bias for models. The idea is to minimize the model's error by
iteratively updating the parameters in the direction of the steepest descent as determined by the
gradient of the loss function.
Depending on how much data is used to compute the gradient during each update, gradient
descent comes in three main variants:
• Batch Gradient Descent
• Stochastic Gradient Descent (SGD)
• Mini-Batch Gradient Descent
Each variant has its own strengths and trade-offs in terms of speed, stability and convergence
behavior.
Convergence in BGD, SGD & MBGD
Working of Mini-Batch Gradient Descent
Mini-batch gradient descent is a optimization method that updates model parameters using small
subsets of the training data called mini-batches. This technique offers a middle path between
the high variance of stochastic gradient descent and the high computational cost of batch
gradient descent. They are used to perform each update, making training faster and more
memory-efficient. It also helps stabilize convergence and introduces beneficial randomness
during learning.
It is often preferred in modern machine learning applications because it combines the benefits of
both batch and stochastic approaches.
Key advantages of mini-batch gradient descent:
• Computational Efficiency: Supports parallelism and vectorized operations on GPUs
or TPUs.
• Faster Convergence: Provides more frequent updates than full-batch which improves
speed.
• Noise Reduction: Less noisy than stochastic updates which leads to smoother
convergence.
• Better Generalization: Introduces slight randomness to help escape local minima.
• Memory Efficiency: Doesn’t require loading the entire dataset into memory.
Algorithm:
Let:
• θθ = model parameters
• max_iters = number of epochs
• ηη = learning rate
For itr=1,2,3,…,max_iters:
• Shuffle the training data. It is optional but often done for better randomness in mini-
batch selection.
• Split the dataset into mini-batches of size bb.
For each mini-batch (XminiXmini, yminiymini):
1. Forward Pass on the batch X_mini:
Make predictions on the mini-batch
y^=f(Xmini, θ)y^=f(Xmini, θ)
Compute error in predictions J(θ)J(θ) with the current values of the parameters
J(θ)=L(y^,ymini)J(θ)=L(y^,ymini)
2. Backward Pass:
Compute gradient:
∇θJ(θ)=∂J(θ)∂θ∇θJ(θ)=∂θ∂J(θ)
3. Update parameters:
Gradient descent rule:
θ=θ−η∇θJ(θ)θ=θ−η∇θJ(θ)
Python Implementation
Here we will use Mini-Batch Gradient Descent for Linear Regression.
1. Importing Libraries
We begin by importing libraries like Numpy and Matplotlib.pyplot
import numpy as np
import matplotlib.pyplot as plt
2. Generating Synthetic 2D Data
Here, we generate 8000 two-dimensional data points sampled from a multivariate normal
distribution:
• The data is centered at the point (5.0, 6.0).
• The cov matrix defines the variance and correlation between the features. A value
of 0.95 indicates a strong positive correlation between the two features.
mean = np.array([5.0, 6.0])
cov = np.array([[1.0, 0.95], [0.95, 1.2]])
data = np.random.multivariate_normal(mean, cov, 8000)
3. Visualizing Generated Data
plt.scatter(data[:500, 0], data[:500, 1], marker='.')
plt.title("Scatter Plot of First 500 Samples")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
Output:
4. Splitting Data
We split the data into training and testing sets:
• Original data shape: (8000, 2)
• New shape after adding bias: (8000, 3)
• 90% of the data is used for training and 10% for testing.
data = np.hstack((np.ones((data.shape[0], 1)), data)) # shape: (8000, 3)
split_factor = 0.90
split = int(split_factor * data.shape[0])
X_train = data[:split, :-1]
y_train = data[:split, -1].reshape((-1, 1))
X_test = data[split:, :-1]
y_test = data[split:, -1].reshape((-1, 1))
5. Displaying Datasets
print("Number of examples in training set = %d" % X_train.shape[0])
print("Number of examples in testing set = %d" % X_test.shape[0])
Output:
results
6. Defining Core Functions of Linear Regression
• Hypothesis(X, theta): Computes the predicted output using the linear model h(X)=X⋅θ
• Gradient(X, y, theta): Calculates the gradient of the cost function which is used to
update model parameters during training.
• Cost(X, y, theta): Computes the Mean Squared Error (MSE).
# Hypothesis function
def hypothesis(X, theta):
return np.dot(X, theta)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Lambda(lambda x: x.view(-1)) # flatten 28x28 to 784
])
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(784, 64)
self.fc2 = nn.Linear(64, 10)
model = SimpleModel()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adagrad(model.parameters(), lr=0.01)
7 Best Deep Learning Frameworks You Should Know in 2025As we are now moving at this speed
in technology, we’re becoming more adaptive toward it. In recent years, there has been a lot of noise
about Deep Learning, especially in the field of technology (to be specific, data science), and it’s being widely
used today in various industries. Deep Learning has become one of the most significant weapons in technology
today including self-driving cars, automated tasks, AI-based voice-overs, and whatnot, it is widely
operating in almost every domain to make work-life balance simpler and more advanced.
By the end of 2025, the deep learning market is projected to reach approximately USD 34.29 billion, with
a CAGR of 38.8%.
What is a Deep Learning Framework?
A deep learning framework is a software library or tool that includes a set of APIs (Application
Programming Interfaces), abstractions, and tools to assist developers in building and training deep
learning models. Deep learning frameworks could help you upload data and train a deep learning model that
would lead to accurate and intuitive predictive analysis. These frameworks simplify the process of creating
and deploying neural networks, allowing researchers and engineers to focus on complicated machine-learning
tasks.
Complete Machine Learning and Data Science Coursegives you access to the course explaining ML and AI
concepts such as Regression, Classification, and Clustering, and you will get to learn all about NLP.
7 Best Deep Learning Frameworks You Should Know :
Explore these deep-learning frameworks designed to advance your projects and boost your results. Whether
you're a beginner eager to work on your first project or an experienced developer trying to find something
new in AI innovation, this list provides you with the knowledge to choose the perfect framework for your
needs.
Table of Content
• 1. TensorFlow
• 2. PyTorch
• 3. Keras
• 4. Theano
• 5. Deeplearning4j (DL4J)
• 6. Scikit-learn
• 7. Sonnet
1. TensorFlow
TensorFlow is one of the most popular, open-source libraries that is being heavily used for numerical
computation in deep learning. Google introduced it in 2015 for their internal RnD work but later when they
saw the capabilities of this framework, they decided to make it open and the repository is available
at TensorFlow Repository. As you’ll see, learning deep learning is pretty complex but making certain
implementations is far easier, and by such frameworks, it’s even smoother to process the desired outcomes.
How Does it Work?
This framework allows you to create dataflow graphs and structures to specify how data travels through a
graph with the help of inputs as tensors (also known as a multi-dimensional graph). TensorFlow allows users
to prepare a flowchart and based on their inputs, it generates the output.
Applications of Tensor Flow:
• Text-Based Application: Nowadays, text-based apps are being heavily used in the market
including language detection, sentimental analysis (for social media to block abusive posts)
• Image Recognition (I-R) Based System: Most sectors have introduced this technology in their
system for motion, facial, and photo-clustering models.
• Video Detection: Real-time object detection is a computer vision technique to detect the motion
(from both image and video) to trace back any object from the provided data.
2. PyTorch
The most famous, that even powers “Tesla Auto-Pilot” is none other than Pytorch which works on deep
learning technology. It was first introduced in 2016 by a group of people (Adam Paszke, Sam Gross,
Soumith Chintala, and Gregory Chanan), under Facebook’s AI lab. The interesting part about PyTorch is
that both C++ & Python can use it but Python’s interface is the most polished. Unsurprisingly, Pytorch is
being backed by some of the top giants in the tech industry (Google, Salesforce, Uber, etc.). It was introduced
to achieve two major goals, the first is to remove the requirement of NumPy (so that it can power GPU with
tensor) and the second is to offer an automatic differentiation library (that is useful to implement neural
networks).
How Does it Work?
This framework uses a computational dynamic graph right after the declaration of variables. Besides this, it
uses Python’s basic concepts like loops, structures, etc. We have often used NLP functions in our smartphones
(such as Apple’s Siri or Google Assistant), and they all use deep learning algorithms known asRNN or
Recurrent Neural Network.
Applications of PyTorch:
• Weather Forecast: To predict and highlight the pattern of a particular set of data, PyTorch is
being used (not only for forecast but also for real-time analysis).
• Text Auto Detection: You might have noticed that sometimes when we try to search for
something on Google or any other search engine, it starts showing some “auto-suggestions”, that’s
where the algorithm works and PyTorch is being used.
• Fraud Detection: To prevent any unauthorized activities on credit/debit cards, this algorithm is
being used to detect anomalous behavior and outliers.
3. Keras
Keras is another highly productive library that focuses on solving deep learning problems. Besides this, Keras
also helps engineers to take full advantage of the scalability and cross-platform capabilities to apply within
their projects. It was first introduced in 2015 under the ONEIROS (Open-ended Neuro-Electronic
Intelligent Robot Operating System) project. Keras is an open-source platform and is being actively used
as a part of Python’s interface in machine learning and deep neural learning. Today, big tech giants
like Netflix, Uber, etc. are using Keras actively to improve their scalability.
How Does it Work?
The architecture of Keras has been designed in such a way that it acts as a high-level neural network
(written in Python). Besides this, It works as a wrapper for low-level libraries (such
as TensorFlow or Theano) and high-level neural network libraries. It was introduced with the concept of
performing fast testing and experiments before going on the full scale.
Applications of Keras:
• Today, companies are using Keras to develop smartphones powered by machine learning and
deep learning in their system. Apple is one of the biggest giants that has incorporated this
technology in past few years.
• In the healthcare industry, developers have built a predictive technology where the machine can
predict the patient’s diagnosis and can also alert pre-heart attack issues.
• Face Mask Detection: During the pandemic, many companies have offered various contributions,
and companies have built a system using deep learning mechanisms for using facial recognition to
detect whether the person is wearing a facial mask or not. (Nokia was among the companies to
initiate this using the Keras library)
4. Theano
To define any mathematical expressions in deep learning, we use Python’s library Theano. It was named
after a great Greek mathematician “Theano”. It was released in 2007 by MILA (Montreal Institute for
Learning Algorithms) and Theano uses a host of clever code optimizations to deliver as much performance
at maximum caliber from your hardware. Besides this, there are two salient features at the core of any deep-
learning library:
• The tensor operations, and
• The capability to run the code on CPU or Graphical Computation Unit (GPU).
These two features enable us to work with a big bucket of data. Moreover, Theano proposes automatic
differentiation which is a very useful feature and can also solve numeric optimization on a big picture than
deep learning complex issues.
How Does it Work?
If you talk about its working algorithm, Theano itself is effectively dead, but the deep learning frameworks
built on top of Theano, are still functioning which also include the more user-friendly
frameworks- Keras, Lasagne, and Blocksthat offer a high-level framework for fast prototyping and model
testing in deep learning and machine learning algorithms.
Applications of Theano:
• Implementation Cycle: Theanos works in 3 different steps where it starts by defining the
objects/variables then moves into different stages to define the mathematical expressions (in the
form of functions) and at last it helps in evaluating expressions by passing values to it.
• Companies like IBM are using Theanos for implementing neural networks and to enhance their
efficiency
• For using Theanos, make sure you have pre-installed some of the following
dependencies: Python, NumPy, SciPy, and BLAS (for matrix operations).
5. Deeplearning4j (DL4J)
Deeplearning4j (DL4J) is a free tool, a deep learning framework for building applications using Java and
Scala. It was created by Skymind. The reason behind its popularity is that it works well with existing Java-
based systems, thanks to its compatibility with the Java Virtual Machine (JVM). DL4J lets developers make
and use strong models for things like recognizing images and speech, understanding language, and making
predictions.
How Does it Work?
Deeplearning4j (DL4J) helps developers by providing a set of libraries for Java and Scala programmers to
build and deploy deep learning models. It takes advantage of the Java Virtual Machine (JVM) for
compatibility and supports various neural network architectures. Its emphasis on distributed computing
enables efficient training of large-scale models across multiple machines.
Application of Deeplearning4j (DL4J)
• DL4J is suited for integration with existing systems due to its compatibility with the JVM.
• It is good at training large deep-learning models because it can work on many machines at once.
• It is widely used in various domains such as image and speech recognition, natural language
processing, and predictive analytics, making it a versatile choice for different tasks.
6. Scikit-learn
Originating from the notion SciPy Toolkit was designed to operate and handle high-performance linear
algebra. Firstly, it was introduced back in 2007 during the Google Summer of Code project by David
Cournapeau. This model is designed on various frameworks such as NumPy, SciPy, and Matplotlib and has
been written in Python. The objective of sci-kit-learn is to offer some of efficient tools for Deep learning,
Machine learning, and statistical modeling that enlist:
• Regression (Linear and Logistic)
• Classification (K-Nearest Neighbors)
• Clustering (K-means and K-means++)
• Model Selection,
• Preprocessing (min to max normalization), and
• Dimensionality reduction (used for visualization, summarization, and feature selection)
Moreover, it offers two different varieties of algorithms (supervised and unsupervised).
How Does it Work?
The sole purpose of introducing this library is to achieve the level of robustness and support required for
use in production systems, which means a deep focus on concerns (that include ease of use, code quality,
collaboration, documentation, and performance). Although the interface is Python, c-libraries are an advantage
for performance (such as NumPy) for arrays and matrix operations.
Application of Scikit-learn
• Companies like Spotify, Inria, and J.P. Morgan are actively using this framework to improve
linear algebra and statistical analysis.
• It works on the user’s behavior and displays the outputs based on their activity
• It helps in collecting data, analyzing those stats, and providing satisfactory outputs of what
users would want to see.
7. Sonnet
Sonnet, crafted by DeepMind, is a high-level toolkit for creating sophisticated neural network architectures
in TensorFlow. This deep learning framework is built on top of TensorFlow. It seeks to construct and generate
Python objects that correspond to certain parts of a neural network. These objects are then individually linked
to the computational TensorFlow graph. This approach of independently building Python objects and attaching
them to a graph simplifies the construction of high-level structures. This is one of the greatest deep-learning
frameworks available.
How Does it Work?
It simplifies the creation of models through high-level abstractions, modular design, and efficient
parameter management. Sonnet also works well with TensorFlow, making it easier for scientists and
developers to create and train smart systems for different jobs. It's like a friendly assistant that makes
constructing and optimizing sophisticated models straightforward.
Application of Sonnet
• Sonnet plays a crucial role in advanced neural network research, offering a flexible framework
for quickly trying out new ideas in model architectures and optimization methods.
• In NLP, it is used to build language models like transformers, making it great for tasks such as
understanding and generating text.
• For computer vision tasks such as recognizing images and finding objects, Sonnet is very
valuable. It smoothly works with TensorFlow and supports GPU acceleration, making designing
and training models efficient.
Deep learning is going to further transform the world from as we know it to something different
in the future and lead the way in most industries across the globe. And the most important part
of this technology are the algorithms that are used to create and train those models. These
algorithms are starting to dominate sectors as diverse as healthcare, autonomous
vehicles and finance by analyzing and learning from huge datasets. The availability of
advanced algorithms, powerful computing technologies and a wealth of data has made deep
learning the leading subfield of AI that is paving the way to the development of new and better
solutions and, thus, to technological progress.
Top 10 Deep Learning Algorithms
In this article, we highlight the top 10 deep learning algorithms in 2025. From Convolutional
Neural Networks (CNNs) to Generative Adversarial Networks (GANs), these algorithms are
driving innovations in various industries. We will also take a look at their key mechanisms which
define them and their key functionalities. But before we deep-dive into those algorithms, let us
familiarize ourselves with the concept of deep learning.
Table of Content
• What is Deep Learning?
• What are Deep Learning Algorithms?
• 1. Convolutional Neural Network (CNN)
• 2. Recurrent Neural Network (RNN)
• 3. Long Short-Term Memory (LSTM)
• 4. Auto-Encoders
• 5. Deep Belief Network (DBN)
• 6. Generative Adversarial Network (GAN)
• 7. Self-Organizing Map (SOM)
• 8. Variational Autoencoders (VAEs)
• 9. Graph Neural Networks (GNNs)
• 10. Transformers
What is Deep Learning?
Deep learning is a subfield of machine learning, which is itself a part of artificial intelligence,
that focuses on the use of many layered neural networks to train themselves on large amounts
of data. Developed based on the idea of biological brains, these networks are able to learn from
data without being programmed explicitly, which makes deep learning particularly effective for
tasks that involve images, speech, natural language, and many other kinds of input data.
Traditional machine learning is not as efficient at dealing with complex and unorganized data,
and the effectiveness only improves with the size of the dataset and computational resources
available, which is where deep learning models excel.
Learn more: Deep Learning Tutorial
Emergence of Deep Learning: A Quick Look Back
The fascinating field of Deep Learning has been around longer than you might think. It was first
introduced in the 1940s, with the development of the perceptron in the late 1950s acting as a
cornerstone of modern deep learning. The evolution of deep learning has been marked by
remarkable breakthroughs, often spurred by progress in computer processing power, the
availability of vast amounts of data, and algorithmic refinements.
What are Deep Learning Algorithms?
The deep learning algorithms are a type of specific machine learning models based on the
principles of the human brain. These algorithms apply the artificial neural networks in the
processing of data, where each network is consisted of connected nodes or neurons. Deep
learning algorithms are different from regular machine learning models because they are able to
learn complex patterns from the data sets without needing manual extraction. Because of this,
they are very successful in their application areas, which include image classification, speech
recognition, and natural language processing.
Top 10 Deep Learning Algorithms in 2025
1. Convolutional Neural Network (CNN)
Convolutional Neural Networks are advanced forms of neural networks which are primarily
employed in various tasks that involve images and videos. They are designed to learn features
directly from the data, automatically detecting patterns such as edges, textures and shapes,
thus making them very useful for applications like object detection, medical imaging and facial
recognition.
Key Mechanisms:
• Convolution Layer: It applies filters (kernels) on the input data (e.g. an image) to
identify basic features like edges or corners. Each filter slides over the image to capture
local patterns.
• Pooling Layer: After detecting the features, the pooling layer down samples the
data, retaining only the most significant features, thereby enhancing the computational
efficiency of the model.
• Fully Connected Layer: After the convolution and pooling operations, the extracted
features are passed through a fully connected layer to make the prediction about the
class of the input.
• Activation Function: An activation function is a mathematical function that is used in
neural networks to introduce non-linearity, and thereby enables the model to learn
complex patterns and make better predictions.
2. Recurrent Neural Network (RNN)
RNNs are designed for sequential data such as time series or natural language. Traditional
neural networks differ from RNNs as RNNs have a memory that keeps information from the
previous steps, making them suitable for applications like speech recognition, language
translation, and stock price prediction.
Key Mechanisms:
• Sequential Processing: RNNs process data one step at a time, and output at each
step depends on the current input and the previous step's output, effectively
capturing temporal patterns.
• Hidden States: The states of the RNNs are hidden, being updated after each step, to
enable the network to remember past information. These states are also fed into the
next step in the sequence.
• Weight Sharing: RNNs use the same weights across time steps, which is useful
when dealing with sequences of varying length, and make the models more efficient.
• Backpropagation Through Time (BPTT): In the training phase, RNNs learn to
minimize the error from future steps, learning to better predict each part of the
sequence by adjusting their weights.
3. Long Short-Term Memory (LSTM)
To overcome the vanishing gradient problem, there is a particular kind of RNN, i.e., LSTM. It can
learn many dependencies in data, and therefore, find its application in language modeling, text
generation, and video analysis.
Key Mechanisms:
• Cell State: The LSTMs keep a state called cell state which is the long term memory
of the network. It can store, update or forget information over time, helping the
network keep track of important information.
• Forget Gate: This gate decides what information from the previous cell state
should be discarded, allowing the network to forget some information.
• Input Gate: It controls the input of new information to the cell state, and hence
what is added to the memory.
• Output Gate: This gate controls what information from the cell state is outputted
to the next layer or time step.
4. Auto-Encoders
Auto-encoders are unsupervised learning models used to reduce the dimensionality of data.
They learn to compress input data into a lower-dimensional representation and then reconstruct
it back to its original form, making them useful for tasks like data compression and anomaly
detection.
Key Mechanisms:
• Encoder: The network encoder part of the network is to compress the input data to a
lower dimensional representation. It learns the most important characteristics of the
input data.
• Bottleneck: The bottleneck layer is implemented to make the network learn a
compact representation of the input, identifying crucial characteristics.
• Decoder: The decoder attempts to synthesize the original input from the encoded
data, trying to make the output match the original input as much as possible.
• Loss Function: The model uses a loss function, such as Mean Squared Error, for
defining the error between the input and output of the model.
5. Deep Belief Network (DBN)
Deep Belief Networks are composed of multiple layers of Restricted Boltzmann Machines
(RBMs) stacked together. They are often used for feature learning, image
recognition, and unsupervised pretraining.
Key Mechanisms:
• Layered Structure: DBNs are a kind of deep neural networks (DNNs) which are
constructed by stacking several layers of Restricted Boltzmann machines (RBMs).
Each RBM is responsible for learning features from the data and increasing the level of
complexity with each subsequent layer.
• Unsupervised Pretraining: The layers are pretrained in an unsupervised manner, and
each RBM tries to learn the distribution of the data.
• Fine-Tuning: After that, the network is fine-tuned for actual labeled data in order to
enhance the performance on certain tasks, like classification.
• Stochastic Units: The RBMs utilize stochastic (probabilistic) units, which determine
the activation of each unit by probability, enabling the network to learn complicated,
non-linear relationships.
6. Generative Adversarial Network (GAN)
GANs use two models: a Generator and a Discriminator. The Generator produces the fake
data (for ex. images), and the Discriminator checks if the data is real or fake. GANs are probably
the most popular model for creating realistic images, videos and even deepfakes.
Key Mechanisms:
• Generator: The Generator is trained on random noise and learns to create synthetic
data that looks similar to real data, e.g. images or text.
• Discriminator: The Discriminator evaluates the generated data, compares it to real
data and provides feedback to the Generator.
• Adversarial Training: The Generator and Discriminator are trained together in
an adversarial training process where each is attempting to fool the other. The
Generator wants to create more plausible data while the Discriminator tries to get
better at telling real data from fake.
• Loss Function: The models are trained with a specific type of loss function that
determines the discrepancy between the Discriminator's output and the actual class
labels to further enhance both networks' training process.
7. Self-Organizing Map (SOM)
Self-Organizing Maps are a type of unsupervised learning model used to map high-
dimensional data to a lower-dimensional grid. They are particularly useful for clustering and
visualizing complex data.
Key Mechanisms:
• Neuron Grid: The network has a grid of neurons, each neuron being a representation
of a cluster of similar data points.
• Competitive Learning: Neurons respond to input data by competing for it, updating
the weights of the 'winner' neuron with the input.
• Neighborhood Function: Other neurons nearby the winner also learn their weights,
helping the network to learn the similarities in the data, and preserve its structure.
• Topological Preservation: SOMs maintain the topological relationships of the data,
so that the similar data points end up near each other on the map.
8. Variational Autoencoders (VAEs)
Variational Autoencoders are a probabilistic version of autoencoders used for generative
tasks. VAEs learn a distribution of the data and generate new data by sampling from that
distribution.
Key Mechanisms:
• Encoder: The encoder is in charge of learning a compressed representation of the
input in the form of a probabilistic distribution, typically in terms of the mean and
variance.
• Latent Space: This distribution is then used for sampling new data points from the
latent space to enable the model to create new data that is completely new.
• Decoder: The decoder learns to reconstruct the data from the sampled latent
variables, creating synthetic output.
• KL Divergence: The model learns to minimize the Kullback-Leibler divergence, so
that the learned distribution is close to a standard prior distribution, like a normal
distribution.
9. Graph Neural Networks (GNNs)
Graph Neural Networks are designed to work with graph-structured data, such as social
networks, molecular structures, and recommendation systems. They capture relationships
between nodes and edges in the graph to make predictions or understand the structure.
Key Mechanisms:
• Node Aggregation: They collect information from their neighbor nodes and give a
better representation of their context.
• Message Passing: Information is passed between adjacent nodes in the graph to
capture dependencies and relationships between entities and helping the model learn
them.
• Graph Pooling: This mechanism creates a global representation of the graph by
learning the information from all of the nodes.
• Backpropagation: The optimization of the node features is achieved through
the standard backpropagation so as to enhance the learning process and the
prediction of graph-based tasks.
10. Transformers
Transformers are widely used in Natural Language Processing (NLP) tasks like machine
translation, text generation, and sentiment analysis. They are based on self-attention
mechanisms that help models capture long-range dependencies in data.
Key Mechanisms:
• Self-Attention: Every token in the input sequence is able to learn the relationship of
every other token, thus learning long range dependencies without needing any
sequence order.
• Multi-Head Attention: Multiple attention mechanisms work in parallel, capturing
different types of relationships between tokens.
• Positional Encoding: Since transformers are not sequential in processing the data,
positional encodings are employed to provide information about the position of the
tokens in the sequence.
• Feedforward Layers: After the attention mechanisms, the information is then
passed through fully connected layers, which are able to process and transform the
data for further tasks like classification or generation.
• Flattening: The resulting feature maps are flattened into a one-dimensional vector
after the convolution and pooling layers so they can be passed into a completely linked
layer for categorization or regression.
• Fully Connected Layers: It takes the input from the previous layer and computes the
final classification or regression task.
• Output Layer: The output from the fully connected layers is then fed into a logistic
function for classification tasks like sigmoid or softmax which converts the output of
each class into the probability score of each class.
Example: Applying CNN to an Image
Step:
import the necessary libraries
•
• set the parameter
• define the kernel
• Load the image and plot it.
• Reformat the image
• Apply convolution layer operation and plot the output image.
• Apply activation layer operation and plot the output image.
• Apply pooling layer operation and plot the output image.
# import the necessary libraries
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from itertools import product
# Reformat
image = tf.image.convert_image_dtype(image, dtype=tf.float32)
image = tf.expand_dims(image, axis=0)
kernel = tf.reshape(kernel, [*kernel.shape, 1, 1])
kernel = tf.cast(kernel, dtype=tf.float32)
# convolution layer
conv_fn = tf.nn.conv2d
image_filter = conv_fn(
input=image,
filters=kernel,
strides=1, # or (1, 1)
padding='SAME',
)
plt.figure(figsize=(15, 5))
plt.imshow(
tf.squeeze(image_filter)
)
plt.axis('off')
plt.title('Convolution')
# activation layer
relu_fn = tf.nn.relu
# Image detection
image_detect = relu_fn(image_filter)
plt.subplot(1, 3, 2)
plt.imshow(
# Reformat for plotting
tf.squeeze(image_detect)
)
plt.axis('off')
plt.title('Activation')
# Pooling layer
pool = tf.nn.pool
image_condense = pool(input=image_detect,
window_shape=(2, 2),
pooling_type='MAX',
strides=(2, 2),
padding='SAME',
)
plt.subplot(1, 3, 3)
plt.imshow(tf.squeeze(image_condense))
plt.axis('off')
plt.title('Pooling')
plt.show()
Digital Image Processing means processing digital image by means of a digital computer. We
can also say that it is a use of computer algorithms, in order to get enhanced image either to
extract some useful information.
Digital image processing is the use of algorithms and mathematical models to process and
analyze digital images. The goal of digital image processing is to enhance the quality of images,
extract meaningful information from images, and automate image-based tasks.
The basic steps involved in digital image processing are:
1. Image acquisition: This involves capturing an image using a digital camera or scanner,
or importing an existing image into a computer.
2. Image enhancement: This involves improving the visual quality of an image, such as
increasing contrast, reducing noise, and removing artifacts.
3. Image restoration: This involves removing degradation from an image, such as blurring,
noise, and distortion.
4. Image segmentation: This involves dividing an image into regions or segments, each of
which corresponds to a specific object or feature in the image.
5. Image representation and description: This involves representing an image in a way
that can be analyzed and manipulated by a computer, and describing the features of
an image in a compact and meaningful way.
6. Image analysis: This involves using algorithms and mathematical models to extract
information from an image, such as recognizing objects, detecting patterns, and
quantifying features.
7. Image synthesis and compression: This involves generating new images or
compressing existing images to reduce storage and transmission requirements.
8. Digital image processing is widely used in a variety of applications, including medical
imaging, remote sensing, computer vision, and multimedia.
A 16 bit format is actually divided into three further formats which are Red, Green and Blue. That
famous RGB format.
Image as a Matrix
As we know, images are represented in rows and columns we have the following syntax in which
images are represented:
The right side of this equation is digital image by definition. Every element of this matrix is called
image element , picture element , or pixel.
1.ACQUISITION- It could be as simple as being given an image which is in digital form. The main
work involves:
a) Scaling
b) Color conversion(RGB to Gray or vice-versa)
2.IMAGE ENHANCEMENT- It is amongst the simplest and most appealing in areas of Image
Processing it is also used to extract some hidden details from an image and is subjective.
3.IMAGE RESTORATION- It also deals with appealing of an image but it is objective(Restoration
is based on mathematical or probabilistic model or image degradation).
4.COLOR IMAGE PROCESSING- It deals with pseudocolor and full color image processing color
models are applicable to digital image processing.
5.WAVELETS AND MULTI-RESOLUTION PROCESSING- It is foundation of representing
images in various degrees.
6.IMAGE COMPRESSION-It involves in developing some functions to perform this operation. It
mainly deals with image size or resolution.
7.MORPHOLOGICAL PROCESSING-It deals with tools for extracting image components that
are useful in the representation & description of shape.
8.SEGMENTATION PROCEDURE-It includes partitioning an image into its constituent parts or
objects. Autonomous segmentation is the most difficult task in Image Processing.
9.REPRESENTATION & DESCRIPTION-It follows output of segmentation stage, choosing a
representation is only the part of solution for transforming raw data into processed data.
10.OBJECT DETECTION AND RECOGNITION-It is a process that assigns a label to an object
based on its descriptor.
According to block 1,if input is an image and we get out image as a output, then it is termed as
Digital Image Processing.
According to block 2,if input is an image and we get some kind of information or description as
a output, then it is termed as Computer Vision.
According to block 3,if input is some description or code and we get image as an output, then it
is termed as Computer Graphics.
According to block 4,if input is description or some keywords or some code and we get
description or some keywords as a output,then it is termed as Artificial Intelligence
1. Improved image quality: Digital image processing algorithms can improve the visual
quality of images, making them clearer, sharper, and more informative.
2. Automated image-based tasks: Digital image processing can automate many image-
based tasks, such as object recognition, pattern detection, and measurement.
3. Increased efficiency: Digital image processing algorithms can process images much
faster than humans, making it possible to analyze large amounts of data in a short
amount of time.
4. Increased accuracy: Digital image processing algorithms can provide more accurate
results than humans, especially for tasks that require precise measurements or
quantitative analysis.
Predictable output
Reliable under input transformations
Equivariance changes
Convolution layers are key building blocks of convolutional neural networks (CNNs) which are
used in computer vision and image processing. They apply convolution operation to the input
data which involves a filter (or kernel) that slides over the input data, performing element-wise
multiplications and summing the results to produce a feature map. This process allows the
network to detect patterns such as edges, textures and shapes in the input images.
Key Components of a Convolution Layer
1. Filters(Kernels):
• Small matrices that extract specific features from the input.
• For example, one filter might detect horizontal edges while another detects vertical
edges.
• The values of filters are learned and updated during training.
2. Stride:
• Refers to the step size with which the filter moves across the input data.
• Larger strides result in smaller output feature maps and faster computation.
3. Padding:
• Zeros or other values may be added around the input to control the spatial
dimensions of the output.
• Common types: "valid" (no padding) and "same" (pads output so feature map
dimensions match input).
4. Activation Function:
• After convolution, a non-linear function like ReLU (Rectified Linear Unit) is often
applied allowing the network to learn complex relationships in data.
• Common activations: ReLU, Tanh, Leaky ReLU.
Types of Convolution Layers
• 2D Convolution (Conv2D): Most common for image data where filters slide in two
dimensions (height and width) across the image.
• Depthwise Separable Convolution: Used for computational efficiency, applying
depthwise and pointwise convolutions separately to reduce parameters and speed up
computation.
• Dilated (Atrous) Convolution: Inserts spaces (zeros) between kernel elements to
increase the receptive field without increasing computation, useful for tasks requiring
context aggregation over larger areas.
Steps in a Convolution Layer
1. Initialize Filters: Randomly initialize a set of filters with learnable parameters.
2. Convolve Filters with Input: Slide the filters across the width and height of the input
data, computing the dot product between the filter and the input sub-region.
3. Apply Activation Function: Apply a non-linear activation function to the convolved
output to introduce non-linearity.
4. Pooling (Optional): Often followed by a pooling layer (like max pooling) to reduce
the spatial dimensions of the feature map and retain the most important information.
Example Of Convolution Layer
Consider an input image of size 32x32x3 (32x32 pixels with 3 color channels). A convolution
layer with ten 5x5 filters, a stride of 1 and 'same' padding will produce an output feature map of
size 32x32x10. Each of the 10 filters detects different features in the input image.
Pooling layer is used in CNNs to reduce the spatial dimensions (width and height) of the input
feature maps while retaining the most important information. It involves sliding a two-
dimensional filter over each channel of a feature map and summarizing the features within the
region covered by the filter.
For a feature map with dimensions nh×nw×ncnh×nw×nc, the dimensions of the output after a
pooling layer are:
(nh−f+1s)×(nw−f+1s)×nc(snh−f+1)×(snw−f+1)×nc
where:
• nhnh → height of the feature map
• nwnw → width of the feature map
• ncnc → number of channels in the feature map
• ff → size of the pooling filter
• ss → stride length
A typical CNN model architecture consists of multiple convolution and pooling layers stacked
together.
Why are Pooling Layers Important?
1. Dimensionality Reduction: Pooling layers reduce the spatial size of the feature maps,
which decreases the number of parameters and computations in the network. This
makes the model faster and more efficient.
2. Translation Invariance: Pooling helps the network become invariant to small
translations or distortions in the input image. For example, even if an object in an image
is slightly shifted, the pooled output will remain relatively unchanged.
3. Overfitting Prevention: By reducing the spatial dimensions, pooling layers help
prevent overfitting by providing a form of regularization.
4. Feature Hierarchy: Pooling layers help build a hierarchical representation of features,
where lower layers capture fine details and higher layers capture more abstract and
global features.
Types of Pooling Layers
1. Max Pooling
Max pooling selects the maximum element from the region of the feature map covered by the
filter. Thus, the output after max-pooling layer would be a feature map containing the most
prominent features of the previous feature map.
Max pooling layer preserves the most important features (edges, textures, etc.) and provides
better performance in most cases.
Max Pooling in Keras:
from tensorflow.keras.layers import MaxPooling2D
import numpy as np
print(output.numpy().reshape(2, 2))
Output:
[[6 9]
[5 8]]
2. Average Pooling
Average pooling computes the average of the elements present in the region of feature map
covered by the filter. Thus, while max pooling gives the most prominent feature in a particular
patch of the feature map, average pooling gives the average of features present in a patch.
Average pooling provides a more generalized representation of the input. It is useful in the cases
where preserving the overall context is important.
Average Pooling using Keras:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import AveragePooling2D
feature_map = np.array([
[1, 3, 2, 9],
[5, 6, 1, 7],
[4, 2, 8, 6],
[3, 5, 7, 2]
], dtype=np.float32).reshape(1, 4, 4, 1) # Convert to float32
feature_map = np.array([
[1, 3, 2, 9],
[5, 6, 1, 7],
[4, 2, 8, 6],
[3, 5, 7, 2]
], dtype=np.float32).reshape(1, 4, 4, 1)
Fully Connected (FC) layers are also known as dense layers which are used in neural networks
especially in of deep learning. They are a type of neural network layer where every neuron in the
layer is connected to every neuron in the previous and subsequent layers. The "fully connected"
descriptor comes from the fact that each of the neurons in these layers is connected to every
activation in the previous layer creating a highly interconnected network.
• In CNNs fully connected layers often follow convolutional and pooling layers used to
interpret the feature maps generated by these layers into the final output categories or
predictions.
• In fully connected feedforward networks these layers are the main building blocks
that directly process the input data into outputs.
Structure of Fully Connected Layers
The structure of FC layers is one of the most significant factors that define how it works in a
neural network. This structure involves the fact that every neuron in one layer will interconnect
with every neuron in the subsequent layer.
Dense (Fully Connected) Layer
Key Components of Fully Connected Layers
A Fully Connected layer is characterized by its dense interconnectivity. Here’s a breakdown of its
key components:
• Neurons: Basic units that receive inputs from all neurons of the previous layer and
send outputs to all neurons of the subsequent layer.
• Weights: Each connection between neurons has an associated weight indicating the
strength and influence of one neuron on another.
• Biases: A bias term for each neuron helps adjust the output along with the weighted
sum of inputs.
• Activation Function: Functions like ReLU, Sigmoid or Tanh introduce non-linearity to
the model helping it to learn complex patterns and behaviors.
Working and Structure of Fully Connected Layers in Neural Networks
The extensive connectivity allows for comprehensive information processing and feature
integration making FC layers essential for tasks requiring complex pattern recognition.
Key Operations in Fully Connected Layers
1. Input Processing
Each neuron in an FC layer receives inputs from all neurons of the previous layer with each
connection having a specific weight and each neuron incorporating a bias. The input to each
neuron is a weighted sum of these inputs plus a bias:
zj=∑i(wij.xi)+bjzj=∑i(wij.xi)+bj
Here wijwij is the weight from neuron i of the previous layer to neuron j, xixi is the input from
neuron i and bjbj is the bias for neuron j
2. Activation
The weighted sum is then processed through a non-linear activation function such as ReLU,
Sigmoid or Tanh. This step introduces non-linearity enabling the network to learn complex
functions:
aj=f(zj)aj=f(zj)
f denotes the activation function transforming the linear combination of inputs into a non-linear
output.
Example Configuration
Consider a neural network transition from a layer with 4 neurons to an FC layer with 3 neurons:
• Previous Layer (4 neurons) → Fully Connected Layer (3 neurons)
Each neuron in the FC layer receives inputs from all four neurons of the previous layer resulting
in a configuration that involves 12 weights and 3 biases. This design of FC layer helps in
transforming and combining features from the input layer hence helping in network's ability to
perform complex decision-making tasks.
Key Role of Fully Connected Layers in Neural Networks
The key roles of fully connected layers in neural network are discussed below:
1. Feature Integration and Abstraction
FC layers consolidate features extracted by earlier layers (e.g., convolutional or recurrent),
transforming them into a form suitable for accurate prediction by capturing complex patterns and
relationships.
2. Decision Making and Output Generation
Typically used as the final layer in classification or regression tasks, FC layers convert high-level
features into output scores. For classification, these scores are passed through Softmax to yield
class probabilities.
3. Introduction of Non-Linearity
Activation functions like ReLU, Sigmoid, or Tanh applied in FC layers introduce non-linearity,
allowing the network to learn complex, non-linear patterns and generalize effectively.
4. Universal Approximation
According to the Universal Approximation Theorem, an FC layer with enough neurons can
approximate any continuous function, showcasing its power in modeling diverse problems.
5. Flexibility across Domains
FC layers are input-agnostic and versatile, applicable to various domains like vision, speech, and
NLP, supporting both shallow and deep architectures.
6. Regularization and Overfitting Control
Techniques like Dropout and L2 regularization are crucial in FC layers to prevent overfitting,
promoting generalization by reducing reliance on specific neurons or large weights.
Advantages of Fully Connected Layers
• Integration of Features: They are capable of combining all features before making
predictions, essential for complex pattern recognition.
• Flexibility: FC layers can be incorporated into various network architectures and
handle any form of input data provided it is suitably reshaped.
• Simplicity: These layers are straightforward to implement and are supported by all
major deep learning frameworks.
Limitations of Fully Connected Layers
Despite their benefits FC layers have several drawbacks:
• High Computational Cost: The dense connections can lead to a large number of
parameters, increasing both computational complexity and memory usage.
• Prone to Overfitting: Due to the high number of parameters they can easily overfit on
smaller datasets unless techniques like dropout or regularization are used.
• Inefficiency with Spatial Data: Unlike convolutional layers, FC layers do not exploit
the spatial hierarchy of images or other structured data, which can lead to less effective
learning.
Backpropagation in Convolutional Neural Networks
Understanding Backpropagation
Backpropagation, short for "backward propagation of errors," is an algorithm used to calculate
the gradient of the loss function of a neural network with respect to its weights. It is essentially
a method to update the weights to minimize the loss. Backpropagation is crucial because it tells
us how to change our weights to improve our network’s performance.
Role of Backpropagation in CNNs
In a CNN, backpropagation plays a crucial role in fine-tuning the filters and weights during
training, allowing the network to better differentiate features in the input data. CNNs typically
consist of multiple layers, including convolutional layers, pooling layers, and fully connected
layers. Each of these layers has weights and biases that are adjusted via backpropagation.
Fundamentals of Backpropagation
Backpropagation, in essence, is an application of the chain rule from calculus used to compute
the gradients (partial derivatives) of a loss function with respect to the weights of the network.
The process involves three main steps: the forward pass, loss calculation, and the backward pass.
The Forward Pass
During the forward pass, input data (e.g., an image) is passed through the network to compute
the output. For a CNN, this involves several key operations:
1. Convolutional Layers: Each convolutional layer applies numerous filters to the input.
For a given layer ll with filters denoted by FF, input II, and bias bb, the output OO is
given by: O=(I∗F)+bO=(I∗F)+b Here, ∗∗ denotes the convolution operation.
2. Activation Functions: After convolution, an activation function σ\sigmaσ (e.g., ReLU)
is applied element-wise to introduce non-linearity: O=σ((I∗F)+b)O=σ((I∗F)+b)
3. Pooling Layers: Pooling (e.g., max pooling) reduces dimensionality, summarizing the
features extracted by the convolutional layers.
Loss Calculation
After computing the output, a loss function LL is calculated to assess the error in prediction.
Common loss functions include mean squared error for regression tasks or cross-entropy loss for
classification:
L=−∑ylog(y^)L=−∑ylog(y^)
Here, yy is the true label, and y^y^ is the predicted label.
The Backward Pass (Backpropagation)
The backward pass computes the gradient of the loss function with respect to each weight in the
network by applying the chain rule:
1. Gradient with respect to output: First, calculate the gradient of the loss function
with respect to the output of the final layer: ∂L∂O∂O∂L
2. Gradient through activation function: Apply the chain rule through the activation
function: ∂L∂I=∂L∂O∂O∂I∂I∂L=∂O∂L∂I∂O For ReLU, ∂O∂I∂I∂O is 1 for I>0I>0 and 0
otherwise.
3. Gradient with respect to filters in convolutional layers: Continue applying the chain
rule to find the gradients with respect to the filters:∂L∂F=∂L∂O∗rot180(I)∂F∂L=∂O∂L
∗rot180(I)Here, rot180(I)rot180(I) rotates the input by 180 degrees, aligning it for the
convolution operation used to calculate the gradient with respect to the filter.
Weight Update
Using the gradients calculated, the weights are updated using an optimization algorithm such as
SGD:
Fnew=Fold−η∂L∂FFnew=Fold−η∂F∂L
Here, ηη is the learning rate, which controls the step size during the weight update.
Challenges in Backpropagation
Vanishing Gradients
In deep networks, backpropagation can suffer from the vanishing gradient problem, where
gradients become too small to make significant changes in weights, stalling the training.
Advanced activation functions like ReLU and optimization techniques such as batch
normalization are used to mitigate this issue.
Exploding Gradients
Conversely, gradients can become excessively large; this is known as exploding gradients. This
can be controlled by techniques such as gradient clipping.
ssl._create_default_https_context = ssl._create_unverified_context
plt.rcParams['figure.figsize'] = 14, 6
train_dataset = torchvision.datasets.CIFAR10(
root="./CIFAR10/train", train=True, transform=normalize_transform,
download=True)
test_dataset = torchvision.datasets.CIFAR10(
root="./CIFAR10/test", train=False, transform=normalize_transform,
download=True)
Step 3: Creating Data Loaders
• Set batch size to 128 for efficiency.
• Create data loaders for both train and test sets to manage batching and easy iteration.
batch_size = 128
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)
Step 4: Visualizing Sample Images
• Obtain a batch of images and labels from the train loader.
• Display a grid of 25 training images for visual confirmation of the data pipeline.
dataiter = iter(train_loader)
images, labels = next(dataiter)
plt.imshow(np.transpose(torchvision.utils.make_grid(
images[:25], normalize=True, padding=1, nrow=5).numpy(), (1, 2, 0)))
plt.axis('off')
plt.show()
Step 5: Analyzing Dataset Class Distribution
• Collect all class labels from training data.
• Count occurrences for every class and visualize with a bar chart, revealing class balance.
classes = []
for batch_idx, data in enumerate(train_loader):
x, y = data
classes.extend(y.tolist())
num_epochs = 50
learning_rate = 0.001
weight_decay = 0.01
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(
model.parameters(), lr=learning_rate, weight_decay=weight_decay)
plt.imshow(np.transpose(torchvision.utils.make_grid(
images[:num_images].cpu(), normalize=True, padding=1).numpy(), (1, 2, 0)))
plt.title(title)
plt.axis("off")
plt.show()
X = np.array(sequences)
y = np.array(labels)
4. Converting Sequences and Labels to One-Hot Encoding
For training we convert X and y into one-hot encoded tensors.
X_one_hot = tf.one_hot(X, len(chars))
y_one_hot = tf.one_hot(y, len(chars))
for i in range(50):
x = np.array([[char_to_index[char] for char in generated_text[-seq_length:]]])
x_one_hot = tf.one_hot(x, len(chars))
prediction = model.predict(x_one_hot)
next_index = np.argmax(prediction)
next_char = index_to_char[next_index]
generated_text += next_char
print("Generated Text:")
print(generated_text)
Output:
Neural networks have become essential tools in solving complex machine learning tasks. Among
them most widely used architectures are Feed-Forward Neural Networks (FNNs) and Recurrent
Neural Networks (RNNs). While both are capable of learning patterns from data, they are
structurally and functionally different.
Feed-Forward Neural Networks
Feed-forward neural networks is a type of neural network where the connections between nodes
do not form cycles. It processes input data in one direction i.e from input to output, without any
feedback loops.
• No memory of previous inputs.
• Best suited for static data (e.g., images).
• Simple and fast to train.
• Cannot handle sequences or time dependencies.
Basic Example:
Used in classification tasks like identifying handwritten digits using the MNIST dataset.
Recurrent Neural Networks
Recurrent neural networks add a missing element from feed-forward networks i.e memory. They
can remember information from previous steps, making them ideal for sequential data where
context matters.
• Has memory of previous inputs using hidden states.
• Ideal for sequential data like text, speech, time series.
• Can suffer from vanishing gradient problems.
• More complex and slower to train.
Basic Example:
Used in language modeling such as predicting the next word in a sentence.
Key Differences
Feature Feed-Forward Neural Network Recurrent Neural Network (RNN)
(FNN)
Data Flow One-way (input → output) Cyclic (can loop over previous states)
Memory No memory Has memory via hidden states
Best For Static input (images, tabular Sequential input (text, audio, time
data) series)
Complexity Lower Higher
Training Time Faster Slower due to time dependencies
Gradient Issues Less prone Can suffer from vanishing/exploding
gradients
Example Use Image classification, object Sentiment analysis, speech recognition
Cases detection
When to Use Each Architecture
Feed-Forward Networks are ideal for:
• Image classification where each image is independent
• Medical diagnosis where patient symptoms don't depend on previous patients
• Credit scoring as current application doesn't depend on previous applications
• Any problem where inputs are independent
RNNs are ideal for:
• Language translation where word order matters
• Stock price prediction as today's price depends on yesterday's
• Weather forecasting as tomorrow's weather depends on today's
• Speech recognition
Computational Considerations
Feed-Forward Networks
• Simple Structure: Feed-forward networks follow a straight path from input to output.
This makes them easier to implement and tune.
• Parallel Computation: Inputs can be processed in batches, enabling fast training
using modern hardware.
• Efficient Backpropagation: They use standard backpropagation which is stable and
well-supported across frameworks.
• Lower Resource Use: No memory of past inputs means less overhead during training
and inference.
Recurrent Neural Networks
• Sequential Nature: RNNs process data step-by-step, this limits parallelism and
slows down training.
• Harder to Train: Training uses Backpropagation Through Time (BPTT) which can be
unstable and slower.
• Captures Temporal Patterns: They are suited for sequential data but require careful
tuning to learn long-term dependencies.
• Higher Compute Demand: Maintaining hidden states and learning over time steps
makes RNNs more resource-intensive.
Limitations and Challenges
Limitation Feed-Forward Neural Network Recurrent Neural Network (RNN)
Input Handling Cannot handle variable-length Supports sequences but struggles
input sequences with long ones
Memory No memory of previous inputs Limited memory; forgets long-term
context
Temporal Ineffective at capturing time-based Can model temporal patterns but
Modeling patterns with difficulty
Performance Good parallelism, but lacks Sequential nature slows training and
Issues temporal context inference
Training Relatively stable Prone to vanishing gradient and
Challenges unstable training
RNN Architecture
At each timestep tt, the RNN maintains a hidden state StSt, which acts as the network’s memory
summarizing information from previous inputs. The hidden state StSt updates by combining the
current input XtXt and the previous hidden stateSt−1St−1, applying an activation function to
introduce non-linearity. Then the output YtYtis generated by transforming this hidden state.
St=g1(WxXt+WsSt−1)St=g1(WxXt+WsSt−1)
• StSt represents the hidden state (memory) at time tt.
• XtXt is the input at time t.t.
• YtYt is the output at time t.t.
• Ws,Wx,WyWs,Wx,Wy are weight matrices for hidden states, inputs and outputs,
respectively.
Yt=g2(WySt)Yt=g2(WySt)
where g1g1 and g2g2 are activation functions.
Adjusting Wy
2. Adjusting Hidden State Weight WsWs
The hidden state weight WsWs influences not just the current hidden state but all previous ones
because each hidden state depends on the previous one. To update WsWs, we must consider how
changes to WsWs affect all hidden statesS1,S2,S3S1,S2,S3 and consequently the output at time 3.
The gradient for WsWs considers all previous hidden states because each hidden state depends
on the previous one:
∂E3∂Ws=∑i=13∂E3∂Y3×∂Y3∂Si×∂Si∂Ws∂Ws∂E3=∑i=13∂Y3∂E3×∂Si∂Y3×∂Ws∂Si
Breaking down:
• Start with the error gradient at output Y3Y3.
• Propagate gradients back through all hidden states S3,S2,S1S3,S2,S1 since they
affect Y3Y3.
• Each SiSi depends on WsWs, so we differentiate accordingly.
df[dep] = labels.label.astype(int)
conts = [
'CHILDREN', 'Family_Members', 'Annual_income',
'Age', 'EmployedDaysOnly', 'UnemployedDaysOnly'
]
def proc_data():
df['Age'] = -df.Birthday_count // 365
df['EmployedDaysOnly'] = df.Employed_days.apply(lambda x: x if x > 0 else 0)
df['UnemployedDaysOnly'] = df.Employed_days.apply(lambda x: abs(x) if x < 0 else 0)
proc_data()
Step 5: Oversampling due to heavily skewed data and Data Splitting
X = df[cats + conts]
y = df[dep]
X_over, y_over = RandomOverSampler().fit_resample(X, y)
model2.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001),
metrics=['accuracy'])
It is observed that the loss does not converge and keeps fluctuating which shows we have
encountered an exploding gradient problem.
Solution for Exploding Gradient Problem
Below methods can be used to modify the model:
1. Weight Initialization: The weight initialization is changed to 'glorot_uniform,' which is
a commonly used initialization for neural networks.
2. Gradient Clipping: The clipnorm parameter in the Adam optimizer is set to 1.0, which
performs gradient clipping. This helps prevent exploding gradients.
3. Kernel Constraint: The max_norm constraint is applied to the kernel weights of each
layer with a maximum norm of 2.0. This further helps in preventing exploding
gradients.
model = Sequential()
1. Importing Libraries
We will be importing Pandas, NumPy, Matplotlib, Seaborn, TensorFlow, Keras, NLTK and Scikit-
learn for implemntation.
import warnings
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np
import re
import nltk
nltk.download('all')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()
warnings.filterwarnings("ignore")
2. Loading the Dataset
data = pd.read_csv("Clothing Review.csv")
data.head(7)
Countplot
We create a figure with size 12x5 inches using plt.subplots():
• Using plt.subplot(1, 2, 1) we plot a countplot of the Rating column
• Using plt.subplot(1, 2, 2) we plot a countplot of the Recommended IND column.
plt.subplots(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.countplot(data=data, x='Rating',palette="deep")
plt.subplot(1, 2, 2)
sns.countplot(data=data, x="Recommended IND", palette="deep")
plt.show()
Output:
The histogram on the bottom shows age distribution with green bars for recommended individuals and red
bars for non-recommended ones. The box plots at the top display the spread and outliers of ages for each
recommendation group helping to visualize differences in age distribution between the two groups.
We can visualize the distribution of the age columns data along with the Rating.
fig = px.histogram(data,
x="Age",
marginal='box',
title="Age Group",
color="Rating",
nbins=65-18,
color_discrete_sequence
=['black', 'green', 'blue', 'red', 'yellow'])
fig.update_layout(bargap=0.2)
Output:
The histogram at the bottom represents the count of individuals in each age group with bars color coded by
rating from 1 to 5. The boxplots at the top provide a summary of age distribution for each rating showing the
median, interquartile range and outliers. It helps to analyze how ratings vary with age groups.
4. Prepare the Data to build Model
Since we are working on the NLP-based dataset it could be valid to use Text columns as the feature. So we
select the features that are text and the Rating column is used for Sentiment Analysis. By the above Rating
counterplot we can observe that there is too much of an imbalance between the rating. So all the rating above
3 is made as 1 and below 3 as 0.
def filter_score(rating):
return int(rating > 3)
X = data[features]
y = data['Rating']
y = y.apply(filter_score)
5. Text Preprocessing
The text data we have comes with too much noise. This noise can be in form of repeated words or commonly
used sentences. In text preprocessing we need the text in the same format so we first convert the entire text
into lowercase and then perform Lemmatization to remove the superposition of the words. Since we need
clean text we also remove common words such as Stopwords and punctuation.
Note: Lemmatization and stop words are some concepts used in NLP and they are apart from RNN.
def toLower(data):
if isinstance(data, float):
return '<UNK>'
else:
return data.lower()
stop_words = stopwords.words("english")
def remove_stopwords(text):
no_stop = []
for word in text.split(' '):
if word not in stop_words:
no_stop.append(word)
return " ".join(no_stop)
def remove_punctuation_func(text):
return re.sub(r'[^a-zA-Z0-9]', ' ', text)
X['Title'] = X['Title'].apply(toLower)
X['Review Text'] = X['Review Text'].apply(toLower)
X['Title'] = X['Title'].apply(remove_stopwords)
X['Review Text'] = X['Review Text'].apply(remove_stopwords)
X['Title'] = X['Title'].apply(remove_punctuation_func)
X['Review Text'] = X['Review Text'].apply(remove_punctuation_func)
train_pad = pad_sequences(train_seq,
maxlen=40,
truncating="post",
padding="post")
test_pad = pad_sequences(test_seq,
maxlen=40,
truncating="post",
padding="post")
8. Building a Recurrent Neural Network (RNN) in TensorFlow
Now that the data is ready, the next step is building a Simple Recurrent Neural network. Before training with
SImpleRNN, the data is passed through the Embedding layer to perform the equal size of Word Vectors.
Note: We use return_sequences = True only when we need another layer to stack.
from tensorflow import keras
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=10000, output_dim=128,
input_length=40))
model.add(keras.layers.SimpleRNN(64, return_sequences=True))
model.add(keras.layers.SimpleRNN(64))
model.add(keras.layers.Dense(128, activation="relu"))
model.add(keras.layers.Dropout(0.4))
model.add(keras.layers.Dense(1, activation="sigmoid"))
model.build(input_shape=(None, 40))
model.summary()
Output:
history = model.fit(train_pad,
y_train,
epochs=5)
Output:
Types of Recurrent Neural Networks: There are various types of RNN which are as follows:
• Types of Recurrent Neural Networks
• Bidirectional RNNs
• Long Short-Term Memory (LSTM)
• Bidirectional Long Short-Term Memory (Bi-LSTM)
• Gated Recurrent Units (GRU)
X = np.random.rand(1000, 1, 10)
y = np.random.randint(0, 2, (1000, 1))
One-to-Many RNN
Code Implementation:
• Used in image captioning where a single input vector generates a word sequence.
• Dense transforms image features before repetition.
• RepeatVector duplicates input across time steps.
• SimpleRNN decodes the repeated vector into a sequence.
• Final layer predicts word probabilities at each step (vocab_size outputs).
import tensorflow as tf
import numpy as np
model.compile(optimizer='adam', loss='categorical_crossentropy')
Many-to-One RNN
Code Implementation:
• Designed for sequence classification (e.g., sentiment analysis).
• Inputs: word embeddings of a sentence (shape: (sequence_length, input_dim)).
• SimpleRNN encodes the sequence into a single hidden state.
• Dense layers decode that state to predict one of the num_classes.
• Categorical crossentropy used for multiclass classification.
import tensorflow as tf
import numpy as np
embedding_dim = 128
hidden_units = 64
model = Sequential()
model.add(Bidirectional(SimpleRNN(hidden_units)))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
3. Training the Model
As we have compiled our model successfully and the data pipeline is also ready so, we can move
forward toward the process of training our BRNN.
• batch_size=32 defines how many samples are processed together in one iteration.
• epochs=5 sets the number of times the model will train on the entire dataset.
• model.fit() trains the model on the training data and evaluates it using the provided
validation data.
batch_size = 32
epochs = 5
model.fit(X_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(X_test, y_test))
Output:
y_pred = model.predict(X_test)
batch_size = 32
train_dataset = train_dataset.shuffle(10000).batch(batch_size)
test_dataset = test_dataset.batch(batch_size)
Printing a sample review and its label from the training set.
example, label = next(iter(train_dataset))
print('Text:\n', example.numpy()[0])
print('\nLabel: ', label.numpy()[0])
Output:
Text: b "Having seen men Behind the Sun ... 1 as a treatment of the subject)."
Label: 0
3. Performing Text Vectorization
We will first perform text vectorization and let the encoder map all the words in the training dataset to a token.
We can also see in the example below how we can encode and decode the sample review into a vector of
integers.
• vectorize_layer : tokenizes and normalizes the text. It converts words into numeric values for
the neural network to process easily.
vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int', output_sequence_length=100)
vectorize_layer.adapt(train_dataset.map(lambda x, y: x))
4. Defining Model Architecture (BiLSTM Layers)
We define the model for sentiment analysis. The first layer, Text Vectorization, converts input text into token
indices. These tokens go through an embedding layer that maps words into trainable 32-dimensional vectors.
During training, these vectors adjust so that words with similar meanings have similar representations.
The Bidirectional LSTM layers process these sequences from both directions to capture context:
• The first Bidirectional LSTM has 32 units and outputs sequences.
• A dropout layer with rate 0.4 helps prevent overfitting.
• The second Bidirectional LSTM has 16 units and refines the learned features.
• Another dropout layer with rate 0.4 follows.
The Dense layers then perform classification:
• A dense layer with 16 neurons and ReLU activation learns patterns from LSTM output.
• The final dense layer with a single neuron outputs the sentiment prediction.
model = tf.keras.Sequential([
vectorize_layer,
tf.keras.layers.Embedding(len(vectorize_layer.get_vocabulary()), 64, mask_zero=True),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
model.build(input_shape=(None,))
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy']
)
model.summary()
Output:
Architecture of GAN
GAN consist of two main models that work together to create realistic synthetic data which are
as follows:
1. Generator Model
The generator is a deep neural network that takes random noise as input to generate realistic
data samples like images or text. It learns the underlying data patterns by adjusting its internal
parameters during training through backpropagation. Its objective is to produce samples that the
discriminator classifies as real.
Generator Loss Function: The generator tries to minimize this loss:
JG=−1mΣi=1mlogD(G(zi))JG=−m1Σi=1mlogD(G(zi))
where
• JGJGmeasure how well the generator is fooling the discriminator.
• G(zi)G(zi) is the generated sample from random noise zizi
• D(G(zi))D(G(zi)) is the discriminator’s estimated probability that the generated sample
is real.
The generator aims to maximize D(G(zi))D(G(zi)) meaning it wants the discriminator to classify its
fake data as real (probability close to 1).
2. Discriminator Model
The discriminator acts as a binary classifier helps in distinguishing between real and generated
data. It learns to improve its classification ability through training, refining its parameters to
detect fake samples more accurately. When dealing with image data, the discriminator uses
convolutional layers or other relevant architectures which help to extract features and enhance
the model’s ability.
Discriminator Loss Function: The discriminator tries to minimize this loss:
JD=−1mΣi=1mlogD(xi)−1mΣi=1mlog(1−D(G(zi))JD=−m1Σi=1mlogD(xi)−m1Σi=1m
log(1−D(G(zi))
• JDJD measures how well the discriminator classifies real and fake samples.
• xixi is a real data sample.
• G(zi)G(zi) is a fake sample from the generator.
• D(xi)D(xi) is the discriminator’s probability that xixi is real.
• D(G(zi))D(G(zi)) is the discriminator’s probability that the fake sample is real.
The discriminator wants to correctly classify real data as real (maximize logD(xi)logD(xi) and fake
data as fake (maximize log(1−D(G(zi))log(1−D(G(zi)))
MinMax Loss
GANs are trained using a MinMax Loss between the generator and discriminator:
minGmaxD(G,D)=[Ex∼pdata[logD(x)]+Ez∼pz(z)[log(1−D(g(z)))]minGmaxD(G,D)=[Ex∼pdata
[logD(x)]+Ez∼pz(z)[log(1−D(g(z)))]
where,
• GGis generator network and is DD is the discriminator network
• pdata(x)pdata(x) = true data distribution
• pz(z)pz(z)= distribution of random noise (usually normal or uniform)
• D(x)D(x) = discriminator’s estimate of real data
• D(G(z))D(G(z))= discriminator’s estimate of generated data
The generator tries to minimize this loss (to fool the discriminator) and the discriminator tries to
maximize it (to detect fakes accurately).
self.model = nn.Sequential(
nn.Linear(latent_dim, 128 * 8 * 8),
nn.ReLU(),
nn.Unflatten(1, (128, 8, 8)),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128, momentum=0.78),
nn.ReLU(),
nn.Upsample(scale_factor=2),
nn.Conv2d(128, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64, momentum=0.78),
nn.ReLU(),
nn.Conv2d(64, 3, kernel_size=3, padding=1),
nn.Tanh()
)
self.model = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
nn.ZeroPad2d((0, 1, 0, 1)),
nn.BatchNorm2d(64, momentum=0.82),
nn.LeakyReLU(0.25),
nn.Dropout(0.25),
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(128, momentum=0.82),
nn.LeakyReLU(0.2),
nn.Dropout(0.25),
nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(256, momentum=0.8),
nn.LeakyReLU(0.25),
nn.Dropout(0.25),
nn.Flatten(),
nn.Linear(256 * 5 * 5, 1),
nn.Sigmoid()
)
adversarial_loss = nn.BCELoss()
optimizer_G = optim.Adam(generator.parameters()\
, lr=lr, betas=(beta1, beta2))
optimizer_D = optim.Adam(discriminator.parameters()\
, lr=lr, betas=(beta1, beta2))
Step 8: Training the GAN
Train the discriminator on real and fake images, then update the generator to improve its fake
image quality. Track losses and visualize generated images after each epoch.
• valid = torch.ones(real_images.size(0), 1, device=device): Create a tensor of ones
representing real labels for the discriminator.
• fake = torch.zeros(real_images.size(0), 1, device=device): Create a tensor of zeros
representing fake labels for the discriminator.
• z = torch.randn(real_images.size(0), latent_dim, device=device): Generate random
noise vectors as input for the generator.
• g_loss = adversarial_loss(discriminator(gen_images), valid): Calculate generator
loss based on the discriminator classifying fake images as real.
• grid = torchvision.utils.make_grid(generated, nrow=4, normalize=True): Arrange
generated images into a grid for display, normalizing pixel values.
for epoch in range(num_epochs):
for i, batch in enumerate(dataloader):
real_images = batch[0].to(device)
real_images = real_images.to(device)
optimizer_D.zero_grad()
fake_images = generator(z)
real_loss = adversarial_loss(discriminator\
(real_images), valid)
fake_loss = adversarial_loss(discriminator\
(fake_images.detach()), fake)
d_loss = (real_loss + fake_loss) / 2
d_loss.backward()
optimizer_D.step()
optimizer_G.zero_grad()
gen_images = generator(z)
if (i + 1) % 100 == 0:
print(
f"Epoch [{epoch+1}/{num_epochs}]\
Batch {i+1}/{len(dataloader)} "
f"Discriminator Loss: {d_loss.item():.4f} "
f"Generator Loss: {g_loss.item():.4f}"
)
if (epoch + 1) % 10 == 0:
with torch.no_grad():
z = torch.randn(16, latent_dim, device=device)
generated = generator(z).detach().cpu()
grid = torchvision.utils.make_grid(generated,\
nrow=4, normalize=True)
plt.imshow(np.transpose(grid, (1, 2, 0)))
plt.axis("off")
plt.show()
Autoencoders in Machine Learning
Autoencoders are a special type of neural networks that learn to compress data into a compact
form and then reconstruct it to closely match the original input. They consist of an:
• Encoder that captures important features by reducing dimensionality.
• Decoder that rebuilds the data from this compressed representation.
The model trains by minimizing reconstruction error using loss functions like Mean Squared
Error or Binary Cross-Entropy. These are applied in tasks such as noise removal, error detection
and feature extraction where capturing efficient data representations is important.
Architecture of Autoencoder
An autoencoder’s architecture consists of three main components that work together to
compress and then reconstruct data which are as follows:
1. Encoder
It compress the input data into a smaller, more manageable form by reducing its dimensionality
while preserving important information. It has three layers which are:
• Input Layer: This is where the original data enters the network. It can be images, text
features or any other structured data.
• Hidden Layers: These layers perform a series of transformations on the input data.
Each hidden layer applies weights and activation functions to capture important
patterns, progressively reducing the data's size and complexity.
• Output(Latent Space): The encoder outputs a compressed vector known as
the latent representation or encoding. This vector captures the important features of
the input data in a condensed form helps in filtering out noise and redundancies.
2. Bottleneck (Latent Space)
It is the smallest layer of the network which represents the most compressed version of the input
data. It serves as the information bottleneck which force the network to prioritize the most
significant features. This compact representation helps the model learn the underlying structure
and key patterns of the input helps in enabling better generalization and efficient data encoding.
3. Decoder
It is responsible for taking the compressed representation from the latent space and
reconstructing it back into the original data form.
• Hidden Layers: These layers progressively expand the latent vector back into a
higher-dimensional space. Through successive transformations decoder attempts to
restore the original data shape and details
• Output Layer: The final layer produces the reconstructed output which aims to
closely resemble the original input. The quality of reconstruction depends on how well
the encoder-decoder pair can minimize the difference between the input and output
during training.
Loss Function in Autoencoder Training
During training an autoencoder’s goal is to minimize the reconstruction loss which measures how
different the reconstructed output is from the original input. The choice of loss function depends
on the type of data being processed:
• Mean Squared Error (MSE): This is commonly used for continuous data. It measures
the average squared differences between the input and the reconstructed data.
• Binary Cross-Entropy: Used for binary data (0 or 1 values). It calculates the
difference in probability between the original and reconstructed output.
During training the network updates its weights using backpropagation to minimize this
reconstruction loss. By doing this it learns to extract and retain the most important features of
the input data which are encoded in the latent space.
Efficient Representations in Autoencoders
Constraining an autoencoder helps it learn meaningful and compact features from the input data
which leads to more efficient representations. After training only the encoder part is used to
encode similar data for future tasks. Various techniques are used to achieve this are as follows:
• Keep Small Hidden Layers: Limiting the size of each hidden layer forces the network
to focus on the most important features. Smaller layers reduce redundancy and allows
efficient encoding.
• Regularization: Techniques like L1 or L2 regularization add penalty terms to the loss
function. This prevents overfitting by removing excessively large weights which helps
in ensuring the model to learns general and useful representations.
• Denoising: In denoising autoencoders random noise is added to the input during
training. It learns to remove this noise during reconstruction which helps it focus on
core, noise-free features and helps in improving robustness.
• Tuning the Activation Functions: Adjusting activation functions can promote sparsity
by activating only a few neurons at a time. This sparsity reduces model complexity and
forces the network to capture only the most relevant features.
Types of Autoencoders
Lets see different types of Autoencoders which are designed for specific tasks with unique
features:
1. Denoising Autoencoder
Denoising Autoencoder is trained to handle corrupted or noisy inputs, it learns to remove noise
and helps in reconstructing clean data. It prevent the network from simply memorizing the input
and encourages learning the core features.
2. Sparse Autoencoder
Sparse Autoencoder contains more hidden units than input features but only allows a few
neurons to be active simultaneously. This sparsity is controlled by zeroing some hidden units,
adjusting activation functions or adding a sparsity penalty to the loss function.
3. Variational Autoencoder
Variational autoencoder (VAE) makes assumptions about the probability distribution of the data
and tries to learn a better approximation of it. It uses stochastic gradient descent to optimize and
learn the distribution of latent variables. They used for generating new data such as creating
realistic images or text.
It assumes that the data is generated by a Directed Graphical Model and tries to learn an
approximation to qϕ(z∣x)qϕ(z∣x) to the conditional property qθ(z∣x)qθ(z∣x) where ϕ ϕ and θ θ are
the parameters of the encoder and the decoder respectively.
4. Convolutional Autoencoder
Convolutional autoencoder uses convolutional neural networks (CNNs) which are designed for
processing images. The encoder extracts features using convolutional layers and the decoder
reconstructs the image through deconvolution also called as upsampling.
Implementation of Autoencoders
We will create a simple autoencoder with two Dense layers: an encoder that compresses images
into a 64-dimensional latent vector and a decoder that reconstructs the original image from this
compressed form.
Step 1: Import necessary libraries
We will be using Matplotlib, NumPy, TensorFlow and the MNIST dataset loader for this.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, losses
from tensorflow.keras.models import Model
from keras.datasets import mnist
Step 2: Load the MNIST dataset
We will be loading the MNIST dataset which is inbuilt dataset and normalize pixel values to
[0,1] also reshape the data to fit the model.
(x_train, _), (x_test, _) = mnist.load_data()
self.decoder = tf.keras.Sequential([
layers.Dense(28 * 28, activation='sigmoid'),
layers.Reshape((28, 28, 1))
])
autoencoder.fit(x_train, x_train,
epochs=10,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))
Output:
Training
Step 5: Visualize original and reconstructed data
Now compare original images and their reconstructions from the autoencoder.
• encoded_imgs = autoencoder.encoder(x_test).numpy(): Passes test images through
the encoder to get their compressed latent representations as NumPy arrays.
• decoded_imgs = autoencoder.decoder(encoded_imgs).numpy(): Reconstructs
images by passing the latent representations through the decoder and converts them
to NumPy arrays.
encoded_imgs = autoencoder.encoder(x_test).numpy()
decoded_imgs = autoencoder.decoder(encoded_imgs).numpy()
n = 6
plt.figure(figsize=(12, 6))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test[i].reshape(28, 28), cmap='gray')
plt.title("Original")
plt.axis('off')
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
plt.title("Reconstructed")
plt.axis('off')
plt.show()
Understanding GANs
Generative Adversarial Networks (GANs) are a framework consisting of two competing neural
networks: a generator that creates fake data and a discriminator that tries to differentiate
between real and fake data. The generator learns to produce increasingly realistic data by trying
to fool the discriminator, while the discriminator becomes better at detecting fake data. This
adversarial training process continues until the generator produces data so realistic that the
discriminator can barely tell the difference from real data.
GAN architecture
GANs consist of two neural networks trained in opposition to one another:
• Generator: Produces synthetic data that mimics the distribution of real training data.
• Discriminator: Attempts to distinguish between real and generated (fake) samples.
The underlying training objective is modeled as a minimax optimization problem, where the
Generator seeks to minimize the Discriminator's accuracy and the Discriminator itself aims to
maximize it. This dynamic leads to a equilibrium in which the generated data becomes
statistically indifferentiable from the real data.
Understanding Transformers
Transformers are neural networks that use self-attention mechanisms to process data
sequences in parallel. They can focus on all parts of an input simultaneously, which makes them
effective at capturing relationships between elements in sequential data. This architecture
powers modern models like GPT, BERT and ChatGPT, enabling unforeseen performance in
language understanding, generation and various other tasks.
Transformer Architecture
Key components of transformers include:
• Self-Attention: Allows each position to attend to all other positions in the sequence
• Encoder-Decoder Architecture: Processes input and generates output sequences
• Positional Encoding: Provides sequence order information since attention is position-
agnostic
Attention mechanism computes relationships between all pairs of positions in a sequence,
enabling the model to focus on relevant parts. This parallel processing capability makes
Transformers highly efficient for training on modern hardware.
Differences between GAN and Transformers
Aspect GANs Transformers
Training Unsupervised adversarial training Supervised learning with next-
Paradigm with competing networks token prediction
Data Processing Fixed-size inputs/outputs Variable-length sequences
processed in parallel
Architecture Generator vs Discriminator Encoder-decoder with attention
competition mechanisms
Training Training instability, delicate balance High computational requirements,
Challenges quadratic complexity
Pretrained Rarely used, train from scratch Commonly used (BERT, GPT, T5)
Models
Best Image/video generation, creative NLP, sequential data, language
Applications tasks, data augmentation modeling
Dependency Short-range, local patterns Long-range, contextual
Modeling relationships
Real-World Applications
GANs (Generative Adversarial Networks) are ideal when the goal is to create realistic synthetic
data, particularly in visual domains. They perform well in tasks like:
• High-quality image and video generation
• Style transfer and creative applications
• Data augmentation when labeled samples are limited
• Synthetic dataset creation for training deep models
• Deepfakes and media synthesis, where realism is important
Transformers are best suited for tasks involving sequential or structured input. They work well
in:
• Natural language processing such as translation, summarization and sentiment
analysis
• Conversational AI and question answering
• Code generation and programming assistance
• Document understanding and information retrieval
Choosing the Right Architecture
Choose GANs if: Choose Transformers if:
Creating visual content is priority Processing text/sequential data
Unsupervised generation needed Understanding context is crucial
Fixed output size acceptable Variable input/output lengths required
Visual quality over interpretability Leveraging pretrained models preferred
Types of Generative Adversarial Networks (GANs): GANs consist of two neural networks, the
generator and the discriminator that compete with each other. Variants of GANs include:
• Deep Convolutional GAN (DCGAN)
• Conditional GAN (cGAN)
• Cycle-Consistent GAN (CycleGAN)
Deep Convolutional GAN (DCGAN) was proposed by a researcher from MIT and Facebook AI research. It
is widely used in many convolution-based generation-based techniques. The focus of this paper was to make
training GANs stable. Hence, they proposed some architectural changes in the computer vision problems. In
this article, we will be using DCGAN on the fashion MNIST dataset to generate images related to clothes.
Need for DCGANs:
DCGANs are introduced to reduce the problem of mode collapse. Mode collapse occurs when the generator
got biased towards a few outputs and can't able to produce outputs of every variation from the dataset. For
example- take the case of mnist digits dataset (digits from 0 to 9) , we want the generator should generate all
type of digits but sometimes our generator got biased towards two to three digits and produce them only.
Because of that the discriminator also got optimized towards that particular digits only, and this state is known
as mode collapse. But this problem can be overcome by using DCGANs.
Architecture:
The generator of the DCGAN architecture takes 100 uniform generated values using normal distribution as an
input. First, it changes the dimension to 4x4x1024 and performed a fractionally stridden convolution 4 times
with a stride of 1/2 (this means every time when applied, it doubles the image dimension while reducing the
number of output channels). The generated output has dimensions of (64, 64, 3). There are some architectural
changes proposed in the generator such as the removal of all fully connected layers, and the use of Batch
Normalization which helps in stabilizing training. In this paper, the authors use ReLU activation function in
all layers of the generator, except for the output layers. We will be implementing generator with similar
guidelines but not completely the same architecture.
The role of the discriminator here is to determine that the image comes from either a real dataset or a generator.
The discriminator can be simply designed similar to a convolution neural network that performs an image
classification task. However, the authors of this paper suggested some changes in the discriminator
architecture. Instead of fully connected layers, they used only strided-convolutions with LeakyReLU as an
activation function, the input of the generator is a single image from the dataset or generated image and the
output is a score that determines whether the image is real or generated.
Implementation:
In this section we will be discussing the implementation of DCGAN in Keras, since our dataset in the Fashion
MNIST dataset, this dataset contains images of size (28, 28) of 1 color channel instead of (64, 64) of 3 color
channels. So, we need to make some changes in the architecture, we will be discussing these changes as we
go along.
In the first step, we need to import the necessary classes such as TensorFlow, Keras, matplotlib, etc. We will
be using TensorFlow version 2. This version of TensorFlow provides inbuilt support for the Keras library as
its default High-level API.
# code % matplotlib inline
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from IPython import display
generator = keras.models.Sequential([
keras.layers.Dense(7 * 7 * 128, input_shape =[num_features]),
keras.layers.Reshape([7, 7, 128]),
keras.layers.BatchNormalization(),
keras.layers.Conv2DTranspose(
64, (5, 5), (2, 2), padding ="same", activation ="selu"),
keras.layers.BatchNormalization(),
keras.layers.Conv2DTranspose(
1, (5, 5), (2, 2), padding ="same", activation ="tanh"),
])
generator.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 6272) 633472
_________________________________________________________________
reshape (Reshape) (None, 7, 7, 128) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 7, 7, 128) 512
_________________________________________________________________
conv2d_transpose (Conv2DTran (None, 14, 14, 64) 204864
_________________________________________________________________
batch_normalization_1 (Batch (None, 14, 14, 64) 256
_________________________________________________________________
conv2d_transpose_1 (Conv2DTr (None, 28, 28, 1) 1601
=================================================================
Total params: 840, 705
Trainable params: 840, 321
Non-trainable params: 384
_________________________________________________________________
Now, we define discriminator architecture, the discriminator takes an image of size 28*28 with 1 color
channel and outputs a scalar value representing an image from either dataset or generated image.
discriminator = keras.models.Sequential([
keras.layers.Conv2D(64, (5, 5), (2, 2), padding ="same", input_shape =[28, 28, 1]),
keras.layers.LeakyReLU(0.2),
keras.layers.Dropout(0.3),
keras.layers.Conv2D(128, (5, 5), (2, 2), padding ="same"),
keras.layers.LeakyReLU(0.2),
keras.layers.Dropout(0.3),
keras.layers.Flatten(),
keras.layers.Dense(1, activation ='sigmoid')
])
discriminator.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 14, 14, 64) 1664
_________________________________________________________________
leaky_re_lu (LeakyReLU) (None, 14, 14, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 7, 7, 128) 204928
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 7, 7, 128) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 7, 7, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 6272) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 6273
=================================================================
Total params: 212, 865
Trainable params: 212, 865
Non-trainable params: 0
_________________________________________________________________
Now we need to compile our DCGAN model (combination of generator and discriminator), we will first
compile the discriminator and set its training to False, because we first want to train the generator.
# compile discriminator using binary cross entropy loss and adam optimizer
discriminator.compile(loss ="binary_crossentropy", optimizer ="adam")
# make discriminator no-trainable as of now
discriminator.trainable = False
# Combine both generator and discriminator
gan = keras.models.Sequential([generator, discriminator])
# compile generator using binary cross entropy loss and adam optimizer
def generator_loss(preds):
return bce_loss(tf.ones_like(preds), preds)
in_label = tf.keras.layers.Input(shape=(1,))
li = tf.keras.layers.Embedding(n_class, 50)(in_label)
n_nodes = 8 * 8
li = tf.keras.layers.Dense(n_nodes)(li)
li = tf.keras.layers.Reshape((8, 8, 1))(li)
in_lat = tf.keras.layers.Input(shape=(noise_dim,))
n_nodes = 128 * 8 * 8
gen = tf.keras.layers.Dense(n_nodes)(in_lat)
gen = tf.keras.layers.LeakyReLU(alpha=0.2)(gen)
gen = tf.keras.layers.Reshape((8, 8, 128))(gen)
merge = tf.keras.layers.Concatenate()([gen, li])
gen = tf.keras.layers.Conv2DTranspose(
128, (4, 4), strides=(2, 2), padding='same')(merge)
gen = tf.keras.layers.LeakyReLU(alpha=0.2)(gen)
gen = tf.keras.layers.Conv2DTranspose(
128, (4, 4), strides=(2, 2), padding='same')(gen)
gen = tf.keras.layers.LeakyReLU(alpha=0.2)(gen)
out_layer = tf.keras.layers.Conv2D(
3, (8, 8), activation='tanh', padding='same')(gen)
g_model = build_generator()
g_model.summary()
Output:
Building the Generator Model
Step 6: Building the Discriminator Model
• Input: image and label.
• Embed label into a 50-dimensional vector.
• Reshape and concatenate label embedding with the input image.
• Apply two Conv2D layers with LeakyReLU activations to extract features.
• Flatten features, apply dropout to prevent overfitting.
• Final dense layer with sigmoid activation outputs probability of real or fake.
def build_discriminator():
in_label = tf.keras.layers.Input(shape=(1,))
li = tf.keras.layers.Embedding(n_class, 50)(in_label)
fe = tf.keras.layers.Flatten()(fe)
fe = tf.keras.layers.Dropout(0.4)(fe)
return model
d_model = build_discriminator()
d_model.summary()
Step 7: Creating Training Step Function
• Use TensorFlow’s Gradient Tape to calculate and apply gradients for both networks.
• Alternate training discriminator on real and fake data.
• Train generator to fool discriminator.
• Use @tf.function for efficient graph execution.
@tf.function
def train_step(dataset):
for l in np.arange(10):
random_noise = tf.random.normal(shape=(num_samples, noise_dim))
label = tf.ones(num_samples)*l
gen_imgs = g_model.predict([random_noise, label])
for j in range(gen_imgs.shape[0]):
img = image.array_to_img(gen_imgs[j], scale=True)
axes[l,j].imshow(img)
axes[l,j].yaxis.set_ticks([])
axes[l,j].xaxis.set_ticks([])
if j ==0:
axes[l,j].set_ylabel(tags[l])
plt.show()
Step 9: Train the Model
• At the final step we will start training the model for specified epochs.
• Print losses regularly to monitor performance.
• Longer training typically results in higher quality images.
def train(dataset, epochs=epoch_count):
itern = 0
for image_batch in tqdm(dataset):
d_loss, g_loss = train_step(image_batch)
d_loss_list.append(d_loss)
g_loss_list.append(g_loss)
itern=itern+1
train(dataset, epochs=epoch_count)
2. Backward Cycle Consistency Loss: Ensures that when we apply F and then G to an image we
get back the original image.
For example: x→GG(x)→FF(G(x))≈xxGG(x)FF(G(x))≈x
Generator Architecture
Each CycleGAN generator has three main sections:
1. Encoder: The input image is passed through three convolution layers which extract
features and compress the image while increasing the number of channels. For
example a 256×256×3 image is reduced to 64×64×256 after this step.
2. Transformer: The encoded image is processed through 6 or 9 residual blocks
depending on the input size which helps retain important image details.
3. Decoder: The transformed image is up-sampled using two deconvolution layers and
restoring it to its original size.
Generator Structure:
c7s1-64 → d128 → d256 → R256 (×6 or 9) → u128 → u64 → c7s1-3
• c7s1-k: 7×7 convolution layer with k filters.
• dk: 3×3 convolution with stride 2 (down-sampling).
• Rk: Residual block with two 3×3 convolutions.
• uk: Fractional-stride deconvolution (up-sampling).
4. Photo Generation from Paintings: CycleGAN can transform a painting into a photo and vice
versa. This is useful for artistic applications where you want to blend the look of photos with
artistic styles. This loss can be defined as :
Lidentity(G,F)=Ey p(y)[∥G(y)−y∥1]+Ex p(x)[∥F(x)−x∥1]Lidentity(G,F)=Ey p(y)[∥G(y)−y∥1]+Ex p(x)
[∥F(x)−x∥1]
5. Photo Enhancement: CycleGAN can enhance photos taken with smartphone cameras which
typically have a deeper depth of field to look like those taken with DSLR cameras which have a
shallower depth of field. This application is valuable for image quality improvement.
Evaluating CycleGAN’s Performance
• AMT Perceptual Studies: It involve real people reviewing generated images to see if
they look real. This is like a voting system where participants on Amazon Mechanical
Turk compare AI-created images with actual ones.
• FCN Scores: It help to measure accuracy especially in datasets like Cityscapes. These
scores check how well the AI understands objects in images by evaluating pixel
accuracy and IoU (Intersection over Union) which measures how well the shapes of
objects match real.
state = next_state
episode_reward += reward
if done:
break
for _ in range(num_eval_episodes):
state = env.reset()
eval_reward = 0
for _ in range(max_steps_per_episode):
action = np.argmax(dqn_agent(state[np.newaxis, :]))
next_state, reward, done, _ = env.step(action)
eval_reward += reward
state = next_state
if done:
break
eval_rewards.append(eval_reward)
average_eval_reward = np.mean(eval_rewards)
print(f"Average Evaluation Reward: {average_eval_reward}")
Output:
Average Evaluation Reward: 180.1
Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents can
learn to make decisions through trial and error to maximize cumulative rewards. RL allows
machines to learn by interacting with an environment and receiving feedback based on their
actions. This feedback comes in the form of rewards or penalties.
Reinforcement Learning revolves around the idea that an agent (the learner or decision-maker)
interacts with an environment to achieve a goal. The agent performs actions and receives
feedback to optimize its decision-making over time.
• Agent: The decision-maker that performs actions.
• Environment: The world or system in which the agent operates.
• State: The situation or condition the agent is currently in.
• Action: The possible moves or decisions the agent can make.
• Reward: The feedback or result from the environment based on the agent’s action.
How Reinforcement Learning Works?
The RL process involves an agent performing actions in an environment, receiving rewards or
penalties based on those actions, and adjusting its behavior accordingly. This loop helps the
agent improve its decision-making over time to maximize the cumulative reward.
Here’s a breakdown of RL components:
• Policy: A strategy that the agent uses to determine the next action based on the
current state.
• Reward Function: A function that provides feedback on the actions taken, guiding the
agent towards its goal.
• Value Function: Estimates the future cumulative rewards the agent will receive from
a given state.
• Model of the Environment: A representation of the environment that predicts future
states and rewards, aiding in planning.
Reinforcement Learning Example: Navigating a Maze
Imagine a robot navigating a maze to reach a diamond while avoiding fire hazards. The goal is to
find the optimal path with the least number of hazards while maximizing the reward:
• Each time the robot moves correctly, it receives a reward.
• If the robot takes the wrong path, it loses points.
The robot learns by exploring different paths in the maze. By trying various moves, it evaluates
the rewards and penalties for each path. Over time, the robot determines the best route by
selecting the actions that lead to the highest cumulative reward.
print(f"Action: {action}, Reward: {reward}, Next State: {next_state}, Done: {done}, Info: {info}")
if terminated:
state = env.reset() # Reset the environment if the episode is finished
Markov Decision Process (MDP) is a way to describe how a decision-making agent like a robot or
game character moves through different situations while trying to achieve a goal. MDPs rely on
variables such as the environment, agent’s actions and rewards to decide the system’s next
optimal action. It helps us answer questions like:
• What actions should the agent take?
• What happens after an action?
• Is the result good or bad?
In artificial intelligence Markov Decision Processes (MDPs) are used to model situations where
decisions are made one after another and the results of actions are uncertain. They help in
designing smart machines or agents that need to work in environments where each action might
led to different outcomes.
Key Components of an MDP
An MDP has five main parts:
Types of Autoencoders
Autoencoders are a type of neural network designed to learn efficient data representations. They
work by compressing input data into a smaller, dense format called the latent space using an
encoder and then reconstructing the original input from this compressed form using a decoder.
This makes autoencoders useful for tasks such as dimensionality reduction, feature extraction
and noise removal. In this article, we’ll see various types of autoencoders and their core concepts.
n = 10
encoding_dim = 32
input_img = tf.keras.Input(shape=(784,))
encoded = tf.keras.layers.Dense(encoding_dim, activation='relu')(input_img)
decoded = tf.keras.layers.Dense(784, activation='sigmoid')(encoded)
autoencoder.fit(x_train_flat, x_train_flat,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test_flat, x_test_flat))
decoded_imgs = autoencoder.predict(x_test_flat)
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test_flat[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
ax.axis('off')
plt.show()
2. Sparse Autoencoder
• Sparse Autoencoder add sparsity constraints that encourage only a small subset of
neurons in the hidden layer to activate at once helps in creating a more efficient and
focused representation.
• Unlike vanilla models, they include regularization methods like L1 penalty and
dropout to enforce sparsity.
• KL Divergence is used to maintain the sparsity level by matching the latent
distribution to a predefined sparse target.
• This selective activation helps in feature selection and learning meaningful patterns
while ignoring irrelevant noise.
Applications of Sparse Autoencoders
1. Feature Selection: Highlights the most relevant features by encouraging sparse
activation helps in improving interpretability.
2. Dimensionality Reduction: Creates efficient, low-dimensional representations by
limiting active neurons.
3. Noise Reduction: Reduces irrelevant information and noise by activating only key
neurons helps in improving model generalization.
Now lets see the practical implementation.
• encoded = tf.keras.layers.Dense(encoding_dim, activation='relu',
activity_regularizer=tf.keras.regularizers.l1(1e-5))(input_img): Creates the encoded
layer with ReLU activation and adds L1 regularization to encourage sparsity.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
n = 10
encoding_dim = 32
input_img = tf.keras.Input(shape=(784,))
encoded = tf.keras.layers.Dense(encoding_dim, activation='relu',
activity_regularizer=tf.keras.regularizers.l1(1e-5))(input_img)
decoded = tf.keras.layers.Dense(784, activation='sigmoid')(encoded)
sparse_autoencoder.fit(x_train_flat, x_train_flat,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test_flat, x_test_flat))
decoded_imgs = sparse_autoencoder.predict(x_test_flat)
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test_flat[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
ax.axis('off')
plt.show()
Training
3. Denoising Autoencoder
• Denoising Autoencoders are designed to handle corrupted or noisy inputs by
learning to reconstruct the clean, original data.
• Training involves feeding intentionally corrupted inputs and minimizing the
reconstruction error against the clean version.
• This approach forces the model to capture robust features that are invariant to noise.
Applications of Denoising Autoencoders
1. Image Denoising: Removes noise from images to increase quality and improve
downstream processing.
2. Signal Cleaning: Filters noise from audio and sensor signals helps in boosting
detection accuracy.
3. Data Preprocessing: Cleans corrupted data before input to other models helps in
increasing robustness and performance.
Now lets see the practical implementation.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
n = 10
encoding_dim = 32
input_img = tf.keras.Input(shape=(784,))
encoded = tf.keras.layers.Dense(encoding_dim, activation='relu')(input_img)
decoded = tf.keras.layers.Dense(784, activation='sigmoid')(encoded)
denoising_autoencoder = tf.keras.Model(input_img, decoded)
denoising_autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
noise_factor = 0.5
x_train_noisy = x_train_flat + noise_factor * np.random.normal(loc=0.0, scale=1.0,
size=x_train_flat.shape)
x_test_noisy = x_test_flat + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test_flat.shape)
x_train_noisy = np.clip(x_train_noisy, 0., 1.)
x_test_noisy = np.clip(x_test_noisy, 0., 1.)
denoising_autoencoder.fit(x_train_noisy, x_train_flat,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test_noisy, x_test_flat))
decoded_imgs = denoising_autoencoder.predict(x_test_noisy)
plt.figure(figsize=(20, 6))
for i in range(n):
ax = plt.subplot(3, n, i + 1)
plt.imshow(x_test_flat[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(3, n, i + 1 + n)
plt.imshow(x_test_noisy[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(3, n, i + 1 + 2*n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
ax.axis('off')
plt.show()
4. Undercomplete Autoencoder
• Undercomplete Autoencoders intentionally restrict the size of the hidden layer to be
smaller than the input layer.
• This bottleneck forces the model to compress the data helps in learning only the
most significant features and discarding redundant information.
• The model is trained by minimizing the reconstruction error while ensuring the latent
space remains compact.
Applications of Undercomplete Autoencoders
• Anomaly Detection: Detects unusual data points by capturing deviations in
compressed features.
• Feature Extraction: Focuses on key data characteristics to improve classification and
analysis.
• Data Compression: Encodes input data efficiently to save storage and speed up
transmission.
Now lets see the practical implementation.
• encoded = tf.keras.layers.Dense(encoding_dim, activation='relu',(input_img):
Builds the encoder layer with ReLU activation.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
n = 10
encoding_dim = 16
input_img = tf.keras.Input(shape=(784,))
encoded = tf.keras.layers.Dense(encoding_dim, activation='relu')(input_img)
decoded = tf.keras.layers.Dense(784, activation='sigmoid')(encoded)
undercomplete_autoencoder.fit(x_train_flat, x_train_flat,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test_flat, x_test_flat))
decoded_imgs = undercomplete_autoencoder.predict(x_test_flat)
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test_flat[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
ax.axis('off')
plt.show()
5. Contractive Autoencoder
• Contractive Autoencoders introduce an additional penalty during training to make
the learned representations robust to small changes in input data.
• They minimize both reconstruction error and a regularization term that penalizes
sensitivity to input perturbations.
• This results in stable, invariant features useful in noisy or fluctuating environments.
Applications of Contractive Autoencoders
1. Stable Representation: Learns features that remain consistent despite small input
variations.
2. Transfer Learning: Provides robust feature vectors for tasks with limited labeled data.
3. Data Augmentation: Generates stable variants of input data to increase training
diversity.
Now lets see the practical implementation.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
n = 10
encoding_dim = 32
input_img = tf.keras.Input(shape=(784,))
encoded = tf.keras.layers.Dense(encoding_dim, activation='relu')(input_img)
decoded = tf.keras.layers.Dense(784, activation='sigmoid')(encoded)
contractive_autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
contractive_autoencoder.fit(x_train_flat, x_train_flat,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test_flat, x_test_flat))
decoded_imgs = contractive_autoencoder.predict(x_test_flat)
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test_flat[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
ax.axis('off')
6. Convolutional Autoencoder
• Convolutional Autoencoders use convolutional layers to effectively capture spatial
and hierarchical features in high-dimensional data such as images.
• These models optimize reconstruction error using loss functions suited for images like
mean squared error or binary cross-entropy.
• The architecture helps in handling structured inputs by preserving spatial relationships.
Applications of Convolutional Autoencoders
Convolutional autoencoders find applications in various domains where hierarchical features are
important. Some applications include:
1. Image Reconstruction: Restores high-quality images from compressed latent codes.
2. Image Denoising: Removes noise while preserving spatial detail in images.
3. Feature Extraction: Captures hierarchical spatial features for tasks like classification
and segmentation.
Now lets see the practical implementation.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
n = 10
input_img = tf.keras.Input(shape=(28, 28, 1))
conv_autoencoder.fit(x_train_cnn, x_train_cnn,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test_cnn, x_test_cnn))
decoded_imgs = conv_autoencoder.predict(x_test_cnn)
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test_cnn[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
ax.axis('off')
plt.show()
7. Variational Autoencoder
Variational Autoencoder (VAEs) extend traditional autoencoders by learning probabilistic latent
distributions instead of fixed representations. Training optimizes the Evidence Lower Bound
(ELBO) which balances:
1. Reconstruction loss to ensure accurate data reconstruction.
2. KL Divergence to regularize the latent space towards a standard Gaussian helps in
preventing overfitting and smooth latent structure.
By balancing these two terms VAEs can generate meaningful outputs while keeping the latent
space structured.
Applications of Variational Autoencoders (VAEs)
Here are some common applications:
1. Image Generation: Creates new realistic images by sampling from learned latent
distributions.
2. Anomaly Detection: Identifies anomalies by measuring how well input data is
reconstructed.
3. Dimensionality Reduction: Produces low-dimensional latent spaces useful for
visualization and clustering.
Now lets see the practical implementation.
• x_train = np.reshape(x_train, (len(x_train), 28, 28, 1)) : Reshapes training images to
28x28 with 1 channel for Conv2D input.
• input_img = tf.keras.Input(shape=(28, 28, 1)) : Defines input layer for grayscale
images with shape 28x28x1.
• tf.keras.layers.MaxPooling2D((2, 2), padding='same')(x) : Reduces spatial
dimensions by half using max pooling with same padding.
• decoded = tf.keras.layers.Conv2D(1, (3, 3), activation='sigmoid',
padding='same')(x) : Outputs reconstructed image with 1 channel and sigmoid
activation for pixel values between 0 and 1.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.datasets import fashion_mnist
latent_dim = 2
n = 10
def sampling(args):
z_mean, z_log_var = args
batch = tf.shape(z_mean)[0]
dim = tf.shape(z_mean)[1]
epsilon = tf.random.normal(shape=(batch, dim))
return z_mean + tf.exp(0.5 * z_log_var) * epsilon
z = tf.keras.layers.Lambda(sampling)([z_mean, z_log_var])
latent_inputs = tf.keras.Input(shape=(latent_dim,))
x = tf.keras.layers.Dense(7 * 7 * 64, activation='relu')(latent_inputs)
x = tf.keras.layers.Reshape((7, 7, 64))(x)
x = tf.keras.layers.Conv2DTranspose(64, 3, strides=2, padding='same',
activation='relu')(x)
x = tf.keras.layers.Conv2DTranspose(32, 3, strides=2, padding='same',
activation='relu')(x)
decoder_outputs = tf.keras.layers.Conv2DTranspose(1, 3, padding='same',
activation='sigmoid')(x)
outputs = decoder(z)
class VAELossLayer(tf.keras.layers.Layer):
def __init__(self, **kwargs):
super(VAELossLayer, self).__init__(**kwargs)
reconstruction_loss = tf.keras.losses.binary_crossentropy(
K.flatten(x), K.flatten(x_decoded)
)
reconstruction_loss *= 28 * 28
vae.compile(optimizer='adam')
decoded_imgs = vae.predict(x_test_cnn)
plt.figure(figsize=(20, 4))
for i in range(n):
ax = plt.subplot(2, n, i + 1)
plt.imshow(x_test_cnn[i].reshape(28, 28), cmap='gray')
ax.axis('off')
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28), cmap='gray')
ax.axis('off')
plt.show()