B19ADT602 – DEEP LEARNING
UNIT – I BASICS OF DEEP LEARNING
Linear Algebra: Scalars
• In linear algebra, scalars are single numbers.
They can be positive, negative, zero, fractions,
or decimals any real number. Scalars are used
to scale vectors, matrices, or other quantities
by multiplying them. Unlike vectors (which
have direction and magnitude) or matrices
(which organize numbers in rows and
columns), scalars are just numbers without
any additional structure.
• Real-Life Analogy
• Imagine you’re making lemonade. The recipe
says:
• For 1 glass of lemonade, you need 2 tablespoons
of lemon juice.
• Now, if you want to make 3 glasses of lemonade,
you need to scale up the ingredients:
• Multiply 2 tablespoons by 3 glasses.
• Lemon juice required=2×3=6 tablespoons.
• Here, the 3 is the scalar — it scales the amount
of lemon juice.
• Example in Linear Algebra
• Let’s say we have a vector representing a car’s
motion:
Here:
2 could represent movement 2 meters to the right (x-
direction).
3 could represent movement 3 meters upward (y-direction).
Now, if we multiply this vector by a scalar k=3k = 3, we get:
This means:
The car moves 6 meters to the right and
9 meters upward.
• Real-Life Example
• Think about zooming in or out on a photo:
• If you scale the photo by k=2k = 2, everything
becomes twice as large (zoom in).
• If you scale by k=0.5k = 0.5, everything
becomes half as large (zoom out).
• In this case:
• The scalar kk represents the scaling factor.
Scalars in Deep Learning
• 1. Learning Rate (Scaler to Adjust Model Updates)
• The learning rate is a scalar value that controls how much a
model adjusts its parameters (weights and biases) during training.
• Real-Life Example:
• Imagine you’re learning to ride a bicycle:
• If you steer too hard (large adjustments), you might fall or go off
track.
• If you steer too little (tiny adjustments), you’ll take a long time to
learn.
• Similarly:
• A small learning rate slows down training but ensures accuracy.
• A large learning rate speeds up training but risks overshooting the
correct solution.
• 2. Normalization (Scaling Input Data)
• In deep learning, input data is often scaled using a
scalar value to ensure that all features (like
height, weight, or age) are in a similar range. This
helps the model learn faster and perform better.
• Real-Life Example:
• Imagine you’re comparing heights in meters (e.g.,
1.75) and weights in kilograms (e.g., 75). The
numbers are very different in size, making it hard
to compare. By dividing each feature by a scalar
(e.g., maximum value in the dataset), both can be
scaled to a similar range (0 to 1).
• 3. Weights and Biases (Scaling Data Inside the Model)
• Deep learning models use weights and biases, which are
scalar values, to transform input data into meaningful
outputs.
• Real-Life Example:
• Imagine you’re baking cookies. The weight of the flour
you add determines the size of the cookie. Adjusting
these weights in the right way ensures that your cookies
turn out perfect!
• In a neural network:
• Weights decide how much importance to give to a
feature.
• Biases act as a “starting point” for the model’s prediction.
• 4. Loss Function and Gradients (Scaling Error)
• A loss function calculates how far off the model’s
prediction is from the actual answer. The scalar loss value
is used to update the model to improve its accuracy.
• Real-Life Example:
• If you’re practicing for a math test and score 70 out of
100:
• The “30” marks you missed represent the error (loss).
• You use this scalar loss value to focus on the weak areas
in your preparation.
• In deep learning:
• The scalar loss guides the model to improve predictions
in the next step.
• 5. Dropout and Regularization (Scaling Weights to
Avoid Overfitting)
• Regularization techniques like dropout or L2
regularization use scalars to control how much
adjustment is applied to weights during training. This
prevents the model from memorizing the training data
(overfitting).
• Real-Life Example:
• Imagine studying for a test:
• If you only memorize answers, you’ll fail when
questions are slightly different.
• By practicing more generally (like scaling focus across
topics), you perform better in new scenarios.
VECTORS
Vectors are ordered arrays of single numbers
and are an example of 1st-order tensor. Vectors
are members of objects known as vector spaces.
A vector space can be thought of as the entire
collection of all possible vectors of a particular
length (or dimension). The three-dimensional
real-valued vector space, denoted by R3 is often
used to represent our real-world notion of
three-dimensional space mathematically.
More formally a vector space is an n-
dimensional Cartesian product of a set with
itself, along with proper definitions on how to
add vectors and multiply them with scalar
values. If all of the scalars in a vector are real-
valued then the notation x∈ Rn states that the
(boldface lowercase) vector value x is a member
of the n-dimensional vector space of real
numbers, Rn.
Sometimes it is necessary to identify
the components of a vector explicitly. The ith
scalar element of a vector is written as xi
Notice that this is non-bold lowercase since the
element is a scalar. An n-dimensional vector itself
can be explicitly written using the following
notation:
X=[x1
x2
..
xn]
Given that scalars exist to represent values why are
vectors necessary? One of the primary use cases
for vectors is to represent physical quantities that
have both a magnitude and a direction. Scalars are
only capable of representing magnitudes.
For instance scalars and vectors encode the
difference between the speed of a car and
its velocity. The velocity contains not only its speed
but also its direction of travel. It is not difficult to
imagine many more physical quantities that
possess similar characteristics such as gravitational
and electromagnetic forces or wind velocity.
In machine learning vectors often
represent feature vectors, with their individual
components specifying how important a
particular feature is. Such features could include
relative importance of words in a text
document, the intensity of a set of pixels in a
two-dimensional image or historical price values
for a cross-section of financial instruments.
Matrices
• Matrices are rectangular arrays consisting of numbers
and are an example of 2nd-order tensors. If m and n are
positive integers , that is m,nϵN then the mxn matrix
contains mn numbers, with m rows and n columns.
• If all of the scalars in a matrix are real-valued then a
matrix is denoted with uppercase boldface letters, such
as Aϵ Rmxn
• That is the matrix lives in a m×n-dimensional real-valued
vector space. Hence matrices are really vectors that are
just written in a two-dimensional table-like manner.
Its components are now identified by two indices i and j. i represents the index to the
matrix row, while j represents the index to the matrix column. Each component of A is
identified by aij.
• It is often useful to abbreviate the full matrix
component display into the following
expression:
A=[aij]mxn
Where aij is referred to as the (i,j)-element of the
matrix A. The subscript of m×n can be dropped if
the dimension of the matrix is clear from the
context.
Note that a column vector is a size m×1 matrix,
since it has m rows and 1 column. Unless
otherwise specified all vectors will be
considered to be column vectors.
Matrices represent a type of function known as
a linear map. Based on rules that will be
outlined in subsequent articles, it is possible to
define multiplication operations between
matrices or between matrices and vectors. Such
operations are immensely important across the
physical sciences, quantitative finance,
computer science and machine learning.
In deep learning neural network weights are
stored as matrices, while feature inputs are
stored as vectors. Formulating the problem in
terms of linear algebra allows compact handling
of these computations. By casting the problem
in terms of tensors and utilising the machinery
of linear algebra, rapid training times on modern
GPU hardware can be obtained.
EIGEN DECOMPOSITION
• Eigen decomposition is a method used in
linear algebra to break down a square matrix
into simpler components called eigenvalues
and eigenvectors. This process helps us
understand how a matrix behaves and how it
transforms data.
• Fundamental Theory of Eigen Decomposition
• Eigen decomposition separates a matrix into its
eigenvalues and eigenvectors. Mathematically, for
a square matrix A, if there exists a scalar λ
(eigenvalue) and a non-zero vector v (eigenvector)
such that:
AV=λv
• Where:
• A is the matrix.
• λ is the eigenvalue.
• v is the eigenvector.
Then, the matrix A can then be represented as:
A=VΛV-1
• Where:
• V is the matrix of eigenvectors.
• Λ is the diagonal matrix of eigenvalues.
• V-1 is the inverse of the matrix.
• This decomposition is significant because it
transforms matrix operations into simpler,
scalar operations involving eigenvalues,
making computations easier.
How to Perform Eigen decomposition?
• To perform Eigen decomposition on a matrix, follow these
steps:
Step 1: Find the Eigenvalues:
• Solve the characteristic equation:
• det (A−λI=0
• Here, A is the square matrix, λ is the eigenvalue, and I is the
identity matrix of the same dimension as A.
Step 2: Find the Eigenvectors:
• For each eigenvalue λ, substitute it back into the equation:
• (A−λI)v=0
• This represents a system of linear equations where v is the
eigenvector corresponding to the eigenvalue λ.
Step 3: Construct the Eigenvector Matrix V:
• Place all the eigenvectors as columns in the matrix V. If there
are n distinct eigenvalues, V will be an n×n matrix.
Probability Distribution: Marginal probability-
conditional
• Probability is a fundamental concept in statistics that
helps us understand the likelihood of different events
occurring. Within probability theory, there are three key
types of probabilities: joint, marginal, and conditional
probabilities.
• Marginal Probability refers to the probability of a single
event occurring, without considering any other events.
• Joint Probability is the probability of two or more events
happening at the same time. It is the probability of the
intersection of these events.
• Conditional Probability deals with the probability of an
event occurring given that another event has already
occurred.
• Probability of an Event
• Probability of an event quantifies how likely it
is for that event to occur. It is a measure that
ranges from 0 to 1, where 0 indicates the
event cannot happen and 1 indicates the
event is certain to happen.
• The probability of an event A, denoted as P(A),
is defined as:
• P(A) = \frac{\text{Number of favorable
outcomes}}{\text{Total number of possible
outcomes}}
• Sample Space (S)
• The set of all possible outcomes of a random
experiment. For example, if you roll a die, the
sample space S is {1, 2, 3, 4, 5, 6}.
• Event (A)
• A subset of the sample space is called event in
probability.
• Event is the specific outcome or set of
outcomes that we are interested in. For
instance, getting an even number when rolling
a die is an event A = {2, 4, 6}.
• Joint Probability
• Joint probability is the probability of two (or
more) events happening simultaneously. It is
denoted as P(A∩B) for two events A and B,
which reads as the probability of both A and B
occurring.
• For two events A and B, the joint probability is
defined as:
• P(A \cap B) = P(\text{both } A \text{ and } B \
text{ occur})
• Examples of Joint Probability
• Rolling Two Dice
• Let A be the event that the first die shows a 3.
• Let B be the event that the second die shows a
5.
• The joint probability P(A∩B) is the probability
that the first die shows a 3 and the second die
shows a 5. Since the outcomes are independent,
• P(A∩B) = P(A) ⋅ P(B).
• Given: P(A) = 1/6 and P(B) = 1/6, so
• ⇒ P(A∩B) = 1/6 × 1/6 = 1/36.
• Marginal Probability
• Marginal probability refers to the probability of an
event occurring, irrespective of the outcomes of other
variables. It is obtained by summing or integrating the
joint probabilities over all possible values of the other
variables.
• For two events A and B, the marginal probability of
event A is defined as:
• P(A) = \sum_{B} P(A, B)
• Where P(A, B) is the joint probability of both events A
and B occurring together. If the variables are
continuous, the summation is replaced by integration:
• P(A) = \int_{B} P(A, B) \, dB
• Examples of Marginal Probability
• Consider a table showing the joint probability
distribution of two discrete random variables
X and Y:
X/Y Y=1 Y=1
X=1 0.1 0.2
X=2 0.3 0.4
To find the marginal probability of X = 1:
P(X = 1) = P(X = 1, Y = 1) + P(X = 1, Y = 2) = 0.1 + 0.2 = 0.3
• Conditional Probability
• Conditional probability is the probability of an event
occurring given that another event has already occurred.
It provides a way to update our predictions or beliefs
about the occurrence of an event based on new
information.
• The conditional probability of event A given event B is
denoted as P(A∣B) and is defined by the formula:
• P(A|B) = \frac{P(A \cap B)}{P(B)}
• Where:
• P(A∩B) is the joint probability of both events A and B
occurring.
• P(B) is the probability of event BBB occurring.
• Examples of Conditional Probability
• Suppose we have a deck of 52 cards, and we
want to find the probability of drawing an Ace
given that we have drawn a red card.
• Let A be the event of drawing an Ace.
• Let B be the event of drawing a red card.
• There are 2 red Aces in a deck (Ace of hearts
and Ace of diamonds) and 26 red cards in total.
• P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{\frac{2}
{52}}{\frac{26}{52}} = \frac{2}{26} = \frac{1}{13}
Probability-Bayes rule
• Bayes’ Theorem is used to determine the
conditional probability of an event. It is used
to find the probability of an event, based on
prior knowledge of conditions that might be
related to that event.
• Bayes’ Theorem and Conditional Probability
• Bayes theorem (also known as the Bayes Rule or Bayes
Law) is used to determine the conditional probability
of event A when event B has already occurred.
• The general statement of Bayes’ theorem is “The
conditional probability of an event A, given the
occurrence of another event B, is equal to the product
of the event of B, given A and the probability of A
divided by the probability of event B.” i.e. P(A|B) =
P(B|A)P(A) / P(B)
• where,
• P(A) and P(B) are the probabilities of events A and B
• P(A|B) is the probability of event A when event B
happens
• P(B|A) is the probability of event B when A happens
• For example, if we want to find the probability that a
white marble drawn at random came from the first
bag, given that a white marble has already been
drawn, and there are three bags each containing
some white and black marbles, then we can use
Bayes’ Theorem.
• Bayes Theorem Statement
• Bayes’ Theorem for n set of events is defined as,
• Let E1, E2,…, En be a set of events associated with
the sample space S, in which all the events E1, E2,
…, En have a non-zero probability of occurrence.
All the events E1, E2,…, E form a partition of S. Let
A be an event from space S for which we have to
find probability, then according to Bayes’
theorem,
• P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
• for k = 1, 2, 3, …., n
• Bayes Theorem Formula
• For any two events A and B, then the formula
for the Bayes theorem is given by: (the image
given below gives the Bayes’ theorem
formula)
• where,
• P(A) and P(B) are the probabilities of events A
and B also P(B) is never equal to zero.
• P(A|B) is the probability of event A when
event B happens
• P(B|A) is the probability of event B when A
happens
• Terms Related to Bayes Theorem
• After learning about Bayes theorem in detail, let us understand
some important terms related to the concepts we covered in
formula and derivation.
• Hypotheses: Events happening in the sample space E1, E2,… En is
called the hypotheses
• Priori Probability: Priori Probability is the initial probability of an
event occurring before any new data is taken into account. P(Ei) is
the priori probability of hypothesis Ei.
• Posterior Probability: Posterior Probability is the updated
probability of an event after considering new information.
Probability P(Ei|A) is considered as the posterior probability of
hypothesis Ei.
Numerical computation : Gradient Based
Optimization
In Neural Networks, we have the concept of Loss
Functions, which tell us about the performance of our
neural networks, i.e., at the current instant, how good or
poor the model is performing. Now, to train our network
to perform better on unseen datasets, we need to use
loss. We aim to minimize the loss, as a lower loss implies
that our model will perform better. So, Optimization
means minimizing (or maximizing) any mathematical
expression. In this article, we’ll explore and deep dive into
the world of gradient-based optimizers for deep learning
models.
• Role of an Optimizer
• As discussed in the introduction, Optimizers
update the parameters of neural networks,
such as weights and learning rate, to minimize
the loss function. Here, the loss function
guides the terrain, telling the optimizer if it is
moving in the right direction to reach the
bottom of the valley, the global minimum.
• The Intuition Behind Optimizers with an Example
• Let us imagine a climber hiking down the hill with
no direction. He doesn’t know the right way to
reach the valley in the hills, but he can
understand whether he is moving closer (going
downhill) or further away (uphill) from his final
destination. If he keeps taking steps in the correct
direction, he will reach his aim i.,e the valley
• This is the intuition behind optimizers- to reach a
global minimum concerning the loss function.
• Instances of Gradient-Based Optimizers
• Different instances of Gradient descent
Optimizers are as follows:
• Batch Gradient Descent or Vanilla Gradient
Descent or Gradient Descent (GD)
• Stochastic Gradient Descent (SGD)
• Mini batch Gradient Descent (MB-GD)
• Batch Gradient Descent
• Gradient descent is an optimization algorithm
used when training deep learning models. It’s
based on a convex function and updates its
parameters iteratively to minimize a given
function to its local minimum.
• The notation used in the above Formula is given below,
• In the above formula,
• α is the learning rate,
• J is the cost function, and
• ϴ is the parameter to be updated.
• As you can see, the gradient represents the partial derivative
of J(cost function) with respect to ϴj
• Note that as we reach closer to the global minima, the slope
or the gradient of the curve becomes less and less steep,
which results in a smaller value of the derivative, which in
turn reduces the step size or learning rate automatically.
• It is the most basic but most used optimizer that directly uses
the derivative of the loss function and learning rate to reduce
the loss function and tries to reach the global minimum.
Backpropagation in Neural Networks, etc.
• Role of Gradient
• In general, a Gradient represents the slope of
the equation, while gradients are partial
derivatives. They describe the change
reflected in the loss function with respect to
the small change in the function’s parameters.
This slight change in loss functions can tell us
about the next step to reduce the loss
function’s output.
• Role of Learning Rate
• The learning rate represents our optimisation
algorithm’s steps to reach the global minima. To
ensure that the gradient descent algorithm reaches
the local minimum, we must set the learning rate
to an appropriate value that is neither too low nor
too high.
• Taking very large steps, i.e., a large learning rate
value, may skip the global minima, and the model
will never reach the optimal value for the loss
function. On the contrary, taking very small steps,
i.e., a small learning rate value, will take forever to
converge.
The gradient represents the direction of
increase. However, we aim to find the minimum
point in the valley, so we have to go in the
opposite direction of the gradient. Therefore,
we update parameters in the negative gradient
direction to minimize the loss.
• Advantages of Batch Gradient Descent
• Efficient Computation: By processing the entire
dataset in one go, batch gradient descent efficiently
computes gradients, especially with matrix
operations, optimizing performance on large datasets.
• Simple Implementation: The straightforward
approach of calculating gradients on all data points
makes batch gradient descent easy to code, especially
with frameworks like TensorFlow and PyTorch.
• Enhanced Convergence Stability: With gradients
computed over the full dataset, batch gradient
descent offers a smoother path to convergence,
reducing fluctuations in updates and aiding in reliable
model training.
Stochastic Gradient Descent
• To overcome some of the disadvantages of the GD
algorithm, the SGD algorithm comes into the
picture as an extension of the Gradient Descent.
One of the disadvantages of the Gradient Descent
algorithm is that it requires a lot of memory to
load the entire dataset at a time to compute the
derivative of the loss function. So, In the SGD
algorithm, we compute the derivative by taking
one data point at a time, i.e., try to update the
model’s parameters more frequently. Therefore,
the model parameters are updated after the loss
computation on each training example.
So, let’s have a dataset that contains 1000 rows,
and when we apply SGD, it will update the
model parameters 1000 times in one complete
cycle of a dataset instead of one time as in
Gradient Descent.
• Advantages of Stochastic Gradient Descent
• Faster Convergence: With frequent updates to
model parameters, SGD converges more quickly
than other methods, ideal for large datasets.
• Lower Memory Usage: SGD processes one data
point at a time, eliminating the need to store
the entire loss function and saving memory.
• Better Minima Exploration: The random
updates allow SGD to escape local minima
potentially, increasing the chance of reaching a
better global minimum.
• Mini-Batch Gradient Descent
• To overcome the problem of large time complexity
in the case of the SGD algorithm. MB-GD
algorithm comes into the picture as an extension
of the SGD algorithm. It’s not all, but it also
overcomes the Gradient descent problem.
Therefore, It’s considered the best among all the
variations of gradient descent algorithms. MB-GD
algorithm takes a batch of points or a subset of
points from the dataset to compute derivate.
• It is observed that the derivative of the loss function for
MB-GD is almost the same as a derivate of the loss
function for GD after several iterations. However, the
number of iterations to achieve minima is large for MB-
GD compared to GD, and the computation cost is also
large.
• Therefore, the weight updation depends on the
derivate of loss for a batch of points. The updates in the
case of MB-GD are much more noisy because the
derivative does not always go towards minima.
• It updates the model parameters after every batch. This
algorithm divides the dataset into batches, updating the
parameters after every batch.
• Advantages of Mini Batch Gradient Descent
• Frequent, Stable Updates: Mini-batch
gradient descent offers frequent updates to
model parameters while lowering variance,
balancing speed and stability.
• Moderate Memory Requirement: It balances
memory usage, needing only a medium
amount to store mini-batches, making it
feasible for large datasets.
Constrained Optimization
• What is Constrained Optimization?
Constrained optimization is a technique used to find the
optimal solution to a problem subject to a set of
constraints. The goal is to maximize or minimize an
objective function while satisfying a set of constraints.
The constraints can be either equality or inequality
constraints, and they limit the feasible region of the
problem. Constrained optimization problems can be
solved using various methods, including linear
programming, quadratic programming, and nonlinear
programming
How Can Constrained Optimization Be Used?
• Constrained optimization can be used in various applications, such
as:
• Portfolio Optimization: Constrained optimization can be used to
optimize investment portfolios subject to constraints such as risk
and return.
• Process Optimization: Constrained optimization can be used to
optimize manufacturing processes subject to constraints such as
time and resources.
• Supply Chain Optimization: Constrained optimization can be used
to optimize supply chain operations subject to constraints such as
inventory and transportation.
• Machine Learning: Constrained optimization can be used in
machine learning to optimize model parameters subject to
constraints such as regularization
Benefits of Constrained Optimization
• Constrained optimization has various benefits,
including:
• Efficient Resource Allocation: Constrained
optimization can be used to allocate resources
efficiently, maximizing the desired outcome while
satisfying constraints.
• Improved Decision-Making: Constrained
optimization can help decision-makers make
informed decisions by providing them with the
optimal solution to a problem subject to constraints.
• Reduced Costs: Constrained optimization can help
reduce costs by optimizing processes and operations
subject to constraints.
UNIT - II
• The process of receiving an input to produce some
kind of output to make some kind of prediction is
known as Feed Forward." Feed Forward neural
network is the core of many other important
neural networks such as convolution neural
network.
• In the feed-forward neural network, there are not
any feedback loops or connections in the network.
Here is simply an input layer, a hidden layer, and
an output layer.
There can be multiple hidden layers which
depend on what kind of data you are dealing
with. The number of hidden layers is known as
the depth of the neural network. The deep
neural network can learn from more functions.
Input layer first provides the neural network
with data and the output layer then make
predictions on that data which is based on a
series of functions. ReLU Function is the most
commonly used activation function in the deep
neural network.
To gain a solid understanding of the feed-
forward process, let's see this mathematically.
1) The first input is fed to the network, which is
represented as matrix x1, x2, and one where
one is the bias value.
2) Each input is multiplied by weight with
respect to the first and second model to obtain
their probability of being in the positive region
in each model.
So, we will multiply our inputs by a matrix of
weight using matrix multiplication.
3) Afterthat, we will take the sigmoid of our scores and gives
us the probability of the point being in the positive region in
both models.
4)We multiply the probability which we have obtained from
the previous step with the second set of weights. We always
include a bias of one whenever taking a combination of
inputs.
And as we know to obtain the probability of the point being in the
positive region of this model, we take the sigmoid and thus
producing our final output in a feed-forward process.
Let takes the neural network which we had previously with the
following linear models and the hidden layer which combined to
form the non-linear model in the output layer.
So, what we will do we use our non-linear model to
produce an output that describes the probability of the
point being in the positive region. The point was
represented by 2 and 2. Along with bias, we will
represent the input as
The first linear model in the hidden layer recall and
the equation defined it
Which means in the first layer to obtain the
linear combination the inputs are multiplied by -
4, -1 and the bias value is multiplied by twelve.
The weight of the inputs are multiplied by -1/5,
1, and the bias is multiplied by three to obtain
the linear combination of that same point in our
second model.
Now, to obtain the probability of the point is in
the positive region relative to both models we
apply sigmoid to both points as
• The second layer contains the weights which
dictated the combination of the linear models
in the first layer to obtain the non-linear
model in the second layer. The weights are
1.5, 1, and a bias value of 0.5.
• Now, we have to multiply our probabilities
from the first layer with the second set of
weights as
Now, we will take the sigmoid of our final score
Regularization for Deep Learning
• Regularization is a technique used to address
overfitting by directly changing the
architecture of the model by modifying the
model’s training process. The following are the
commonly used regularization techniques:
• L2 regularization
• L1 regularization
• Dropout regularization
L2 regularization
L2 regularization is also called ridge regression.
In this type of regularization, the squared
magnitude of the coefficients or weights
multiplied with a regularizer term is added to
the loss or cost function. L2 regression can be
represented with the following mathematical
equation.
• Lambda is the hyperparameter that is tuned to
prevent overfitting i.e. penalize the
insignificant weights by forcing them to be
small but not zero.
• L2 regularization works best when all the
weights are roughly of the same size, i.e.,
input features are of the same range.
• This technique also helps the model to learn
more complex patterns from data without
overfitting easily.
• L1 regularization
L1 regularization is also referred to as lasso
regression. In this type of regularization, the
absolute value of the magnitude of coefficients
or weights multiplied with a regularizer term is
added to the loss or cost function. It can be
represented with the following equation.
• A fraction of the sum of absolute values of weights to the
loss function is added in the L1 regularization. In this way,
you will be able to eliminate some coefficients with
lesser values by pushing those values towards 0. You can
observe the following by using L1 regularization:
• Since the L1 regularization adds an absolute value as a
penalty to the cost function, the feature selection will be
done by retaining only some important features and
eliminating the lower or unimportant features.
• This technique is also robust to outliers, i.e., the model
will be able to easily learn about outliers in the dataset.
• This technique will not be able to learn complex patterns
from the input data.
• Dropout regularization
• Dropout regularization is the technique in which some of the
neurons are randomly disabled during the training such that
the model can extract more useful robust features from the
model. This prevents overfitting. You can see the dropout
regularization in the following diagram:
• In figure (a), the neural network is fully connected. If all the
neurons are trained with the entire training dataset, some
neurons might memorize the patterns occurring in training
data. This leads to overfitting since the model is not
generalizing well.
• In figure (b), the neural network is sparsely connected, i.e.,
only some neurons are active during the model training. This
forces the neurons to extract robust features/patterns from
training data to prevent overfitting.
Optimization Rule in Deep Neural Networks
• There are various optimization techniques to
change model weights and learning rates, like
Gradient Descent, Stochastic Gradient Descent,
Stochastic Gradient descent with momentum,
Mini-Batch Gradient Descent, AdaGrad,
RMSProp, AdaDelta, and Adam. These
optimization techniques play a critical role in the
training of neural networks, as they help improve
the model by adjusting its parameters to
minimize the loss of function value. Choosing the
best optimizer depends on the application.
• Before we proceed, it's essential to acquaint
yourself with a few terms
• The epoch is the number of times the algorithm
iterates over the entire training dataset.
• Batch weights refer to the number of samples
used for updating the model parameters.
• A sample is a single record of data in a dataset.
• Learning Rate is a parameter determining the
scale of model weight updates
• Weights and Bias are learnable parameters in a
model that regulate the signal between two
neurons.
UNIT - III
CONVOLUTIONAL NEURAL NETWORKS
Artificial networks: Convolutional neural
networks(CNN)
• A Convolutional Neural Network (CNN) is a type of Deep Learning neural network
architecture commonly used in Computer Vision. Computer vision is a field of Artificial
Intelligence that enables a computer to understand and interpret the image or visual data
• Neural Networks are used in various datasets like images, audio, and text. Different types
of Neural Networks are used for different purposes, for example for predicting the
sequence of words we use Recurrent Neural Networks more precisely an LSTM, similarly for
image classification we use Convolution Neural networks.
Neural Networks: Layers and Functionality
• In a regular Neural Network there are three types of layers:
• Input Layers: It’s the layer in which we give input to our model. The
number of neurons in this layer is equal to the total number of features in
our data (number of pixels in the case of an image).
• Hidden Layer: The input from the Input layer is then fed into the hidden
layer. There can be many hidden layers depending on our model and data
size. Each hidden layer can have different numbers of neurons which are
generally greater than the number of features. The output from each layer
is computed by matrix multiplication of the output of the previous layer
with learnable weights of that layer and then by the addition of learnable
biases followed by activation function which makes the network nonlinear.
• Output Layer: The output from the hidden layer is then fed into a logistic
function like sigmoid or softmax which converts the output of each class
into the probability score of each class.
Convolution Neural Network
• Convolutional Neural Network (CNN) is the extended version of artificial
neural networks (ANN) which is predominantly used to extract the feature
from the grid-like matrix dataset. For example visual datasets like images
or videos where data patterns play an extensive role.
• How Convolutional Layers Works?
• Convolution Neural Networks or covnets are neural networks that share
their parameters. Imagine you have an image. It can be represented as a
cuboid having its length, width (dimension of the image), and height (i.e
the channel as images generally have red, green, and blue channels).
• Mathematical Overview of Convolution
• Convolution layers consist of a set of learnable filters (or kernels)
having small widths and heights and the same depth as that of input
volume (3 if the input layer is image input).
• For example, if we have to run convolution on an image with
dimensions 34x34x3. The possible size of filters can be axax3, where
‘a’ can be anything like 3, 5, or 7 but smaller as compared to the
image dimension.
• During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have
a value of 2, 3, or even 4 for high-dimensional images) and compute
the dot product between the kernel weights and patch from input
volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll
stack them together as a result, we’ll get output volume having a
depth equal to the number of filters. The network will learn all the
filters.
• Layers Used to Build ConvNets
• A complete Convolution Neural Networks architecture is also known as
covnets. A covnets is a sequence of layers, and every layer transforms one
volume to another through a differentiable function.
Types of layers: datasets
Let’s take an example by running a covnets on of image of dimension 32 x
32 x 3.
• Input Layers: It’s the layer in which we give input to our model. In CNN,
Generally, the input will be an image or a sequence of images. This layer
holds the raw input of the image with width 32, height 32, and depth 3.
• Convolutional Layers: This is the layer, which is used to extract the feature
from the input dataset. It applies a set of learnable filters known as the
kernels to the input images. The filters/kernels are smaller matrices usually
2×2, 3×3, or 5×5 shape. it slides over the input image data and computes
the dot product between kernel weight and the corresponding input
image patch. The output of this layer is referred as feature maps. Suppose
we use a total of 12 filters for this layer we’ll get an output volume of
dimension 32 x 32 x 12.
• Activation Layer: By adding an activation function to the output of
the preceding layer, activation layers add nonlinearity to the
network. it will apply an element-wise activation function to the
output of the convolution layer. Some common activation
functions are RELU: max(0, x), Tanh, Leaky RELU, etc. The volume
remains unchanged hence output volume will have dimensions 32
x 32 x 12.
• Pooling layer: This layer is periodically inserted in the covnets and
its main function is to reduce the size of volume which makes the
computation fast reduces memory and also prevents overfitting.
Two common types of pooling layers are max pooling and average
pooling. If we use a max pool with 2 x 2 filters and stride 2, the
resultant volume will be of dimension 16x16x12.
• Flattening: The resulting feature maps are
flattened into a one-dimensional vector after
the convolution and pooling layers so they can
be passed into a completely linked layer for
categorization or regression.
• Fully Connected Layers: It takes the input
from the previous layer and computes the
final classification or regression task
Output Layer: The output from the fully
connected layers is then fed into a logistic
function for classification tasks like sigmoid or
softmax which converts the output of each class
into the probability score of each class.
• Advantages of CNNs:
• Good at detecting patterns and features in
images, videos, and audio signals.
• Robust to translation, rotation, and scaling
invariance.
• End-to-end training, no need for manual feature
extraction.
• Can handle large amounts of data and achieve
high accuracy.
Data Types
• In convolutional neural networks (CNNs), "1D", "2D", and "3D" images
refer to the dimensionality of the input data, indicating how many spatial
dimensions the data has, with 1D representing data with a single
dimension (like a time series), 2D representing data with two dimensions
(like a standard image), and 3D representing data with three dimensions
(like a video sequence) where you have width, height, and depth
information.
Breakdown:
• 1D Images:
• Used for data that has only one spatial dimension, like a signal over time
(e.g., audio waveforms, stock market data).
• Convolutional filters in a 1D CNN slide along this single dimension to
extract features.
• 2D Images:
• Most commonly used for standard images where you
have both width and height information.
• Filters in a 2D CNN slide across both the width and
height dimensions to identify patterns within an image.
• 3D Images:
• Used for data with three spatial dimensions, like
volumetric medical scans (CT scans) or video sequences
where each frame is considered a "slice".
• 3D convolutional filters move across all three
dimensions (width, height, and depth) to extract
features.
• Key points to remember:
• Data type determines dimension:
• The type of data you are working with dictates
which type of convolution (1D, 2D, or 3D) is
most appropriate.
• Applications:
• 1D CNNs: Time series analysis, natural
language processing
• 2D CNNs: Image recognition, object detection
• 3D CNNs: Video analysis, medical imaging
Efficient Convolution Algorithms:
1) Output side Algorithm :
We begin by introducing the most naive and straightforward algorithm that
implements convolution. The algorithm is called the ”Output-side algorithm”
because it computes the individual samples of the output time series one-by-
one
Input side algorithm
• In this example, is a nine point signal and is a four point signal. In our
next example, shown in Fig. 6-7, we will reverse the situation by
making a four point signal, and a nine point signal. The same two
waveforms are used, they are just swapped. As shown by the output
signal components, the four samples in result in four shifted and
scaled versions of the nine point impulse response. Just as before,
leading and trailing zeros are added as place holders.
• But wait just one moment! The output signal in Fig. 6-7 is identical to
the output signal in Fig. 6-5. This isn't a mistake, but an important
property. Convolution is commutative: a[n]*b[n] = b[n]*a[n]. The
mathematics does not care which is the input signal and which is the
impulse response, only that two signals are convolved with each other.
Although the mathematics may allow it, exchanging the two signals
has no physical meaning in system theory. The input signal and
impulse response are two totally different things and exchanging them
doesn't make sense. What the commutative property provides is
a mathematical tool for manipulating equations to achieve various
results.
Karatsuba (adapted) algorithm
The Karatsuba algorithm achieves fast multiplication of two
numbers by reducing the number of elementary operations
required to perform a traditional long multiplication. The long
multiplication algorithm is similar to the Input-side trivial
algorithm discussed in the previous section, with time series
being represented by integer numbers and samples by digits. It
also involves an extra step of carry shifting, which, takes linear
time in terms of the length of the output. Thus, assuming that
both input factors of the multiplication have at most n digits, the
traditional long multiplication algorithm belongs to O(n 2 + 2n)
∼= O(n 2 ) as well.
In 1960, Andrey Kolmogorov conjectured that the quadratic complexity was
asymptotically optimal for the problem of multiplication, but was soon
proven wrong by Anatoly Karatsuba, then a 23- year-old student.
Karatsuba resorted to the following trick:
given two n-digit numbers x and y in base B, he rewrote them as follows:
x = x1B m + x0, y = y1B m + y0
He then expressed their product x · y as:
x · y = (x1B m + x0)(y1B m + y0) = z2B 2m + z1B m + z0
where z2 = x1 · y1, z1 = x1 · y0 + x0 · y1, z0 = x0 · y0
It might seem that x · y can only be computed using 4 multiplications, this
being the number of multiplication required to calculate zi .
However, Karatsuba observed that they can be computed with just 3
multiplications, at the cost of few extra additions (subtractions) as follows:
z1 = x1 · y0 + x0 · y1 = (x0 − x1) · (y1 − y0) + z0 + z2
Depth wise Separable Convolutional Neural Networks
• Convolution is a very important mathematical operation in
artificial neural networks(ANN’s).
Convolutional neural networks (CNN’s) can be used to learn
features as well as classify data with the help of image frames.
There are many types of CNN’s. One class of CNN’s are depth wise
separable convolutional neural networks.
• These type of CNN’s are widely used because of the following two
reasons
• They have lesser number of parameters to adjust as compared to
the standard CNN’s, which reduces overfitting
• They are computationally cheaper because of fewer computations
which makes them suitable for mobile vision applications
Understanding Normal Convolution operation
Suppose there is an input data of size Df x Df x
M, where Df x Df can be the image size and M is
the number of channels (3 for an RGB image).
Suppose there are N filters/kernels of size Dk x
Dk x M. If a normal convolution operation is
done, then, the output size will be Dp x Dp x N.
• The number of multiplications in 1 convolution
operation = size of filter = Dk x Dk x M
• Since there are N filters and each filter slides
vertically and horizontally Dp times,
• the total number of multiplications become N x
Dp x Dp x (Multiplications per convolution)
• So for normal convolution operation
• Total no of multiplications = N x Dp2 x Dk2 x M
• Depth-Wise Separable Convolutions
• Now look at depth-wise separable convolutions. This
process is broken down into 2 operations
• Depth-wise convolutions
• Point-wise convolutions
• DEPTH WISE CONVOLUTIONIn depth-wise operation,
convolution is applied to a single channel at a time
unlike standard CNN’s in which it is done for all the M
channels. So here the filters/kernels will be of size Dk x
Dk x 1. Given there are M channels in the input data,
then M such filters are required.
• Output will be of size Dp x Dp x M.
POINT WISE CONVOLUTIONIn point-wise
operation, a 1×1 convolution operation is
applied on the M channels. So the filter size for
this operation will be 1 x 1 x M. Say we use N
such filters, the output size becomes Dp x Dp x
N.
• Cost of this operation:
• A single convolution operation require 1 x M multiplications.
• Since the filter is being slided by Dp x Dp times,
• the total number of multiplications is equal to M x Dp x Dp x (no.
of filters)
• So for point wise convolution operation
• Total no of multiplications = M x Dp2 x N
• Therefore, for overall operation:
•
Total multiplications = Depth wise conv. multiplications + Point
wise conv. multiplications
• Total multiplications = M * Dk2 * Dp2 + M * Dp2 * N = M * Dp2 *
(Dk2 + n)
• So for depth wise separable convolution operation
• Total no of multiplications = M x Dp2 x (Dk2 + N)
The Fast Fourier Convolution Network: A Fast
and Efficient Approach for Convolutional
Neural Networks
• The Fast Fourier Transform (FFT) is an efficient
algorithm for computing the discrete Fourier
transform (DFT) of a sequence.
• The DFT is a mathematical transformation
that decomposes a signal into its frequency
components, which can be used to analyze the
spectral content of the signal.
• The FFT is a fast implementation of the DFT
that can compute the DFT of a sequence in
O(n log n) time, compared to the O(n²) time
required by the naive algorithm.
• The FFT works by expressing the DFT of a
sequence as a sum of complex exponential
functions.
• These complex exponentials can be computed
efficiently using a divide-and-conquer approach,
which is what makes the FFT algorithm fast.
• The FFT is widely used in signal processing and
has many applications, including filtering, spectral
analysis, and image processing.
• It is also used in many scientific and engineering
fields, such as meteorology, medical imaging, and
geophysics.