0% found this document useful (0 votes)

8 views64 pages

AI Unit II Lec Notes Deep Learning

This document provides an overview of feedforward networks, particularly focusing on multilayer perceptrons (MLPs) and their significance in machine learning. It discusses the architecture of MLPs, the backpropagation algorithm for training neural networks, and various types of gradient descent optimization methods. Additionally, it covers the Kohonen Self-Organizing Feature Map as a competitive learning model in neural networks.

Uploaded by

Praneeth B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views64 pages

AI Unit II Lec Notes Deep Learning

Uploaded by

Praneeth B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

UNIT II

Feedforward Networks

P Jyothi,Asst. Prof., CSE Dept.

P Jyothi
Asst. prof.,
CSE Dept.

P Jyothi,Asst. Prof., CSE Dept.

Introduction

 Deep feedforward networks, also often called feedforward

neural networks, or multilayer perceptrons (MLPs)
 These models are called feedforward because information
flows through the function being evaluated from x,
through the intermediate computations used to define f ,
and finally to the output y.
 There are no feedback connections in which outputs of
the model are fed back into itself. When feedforward
neural networks are extended to include feedback
connections, they are called recurrent neural networks
P Jyothi,Asst. Prof., CSE Dept.
Multilayer Perceptron

 Feedforward networks are of extreme importance to

machine learning practitioners. They form the basis of
many important commercial applications.
For example, the convolutional networks used for object
recognition from photos are a specialized kind of
feedforward network. Feedforward networks are a
conceptual stepping stone on the path to recurrent
networks, which power many natural language applications

P Jyothi,Asst. Prof., CSE Dept.

 Feedforward neural networks are called networks because
they are typically represented by composing together
many different functions. The model is associated with a
directed acyclic graph describing how the functions are
composed together.
 For example, we might have three functions f (1), f (2),
and f (3) connected in a chain, to form f(x) = f(3)(f
(2)(f(1) (x))). These chain structures are the most
commonly used structures of neural networks. In this
case, f (1) is called the first layer of the network, f (2) is
called the second layer, and so on.
P Jyothi,Asst. Prof., CSE Dept.
 The overall length of the chain gives the depth of the
model. It is from this terminology that the name “deep
learning” arises. The final layer of a feedforward network
is called the output layer
 Feedforward networks have introduced the concept of a
hidden layer, and this requires us to choose the activation
functions that will be used to compute the hidden layer
values.

P Jyothi,Asst. Prof., CSE Dept.

 We must also design the architecture of the network,
including how many layers the network should contain,
how these layers should be connected to each other, and
how many units should be in each layer.
 Learning in deep neural networks requires computing the
gradients of complicated functions. We present the back-
propagation algorithm and its modern generalizations,
which can be used to efficiently compute these gradients.

P Jyothi,Asst. Prof., CSE Dept.

Multilayer Perceptron

Perceptrons bear similarity to

neurons as the structure is
very similar. Perceptron also
takes input and give output in
the same fashion as a neuron
does. Perceptrons are the
building block of all the
architectures in deep-learning.
The input given to the
perceptron is the dot product
of weights and the input. The
function takes this input and
gives some output. If the
output is greater than 0 then
the final output(y^) is 1 else 0.
P Jyothi,Asst. Prof., CSE Dept. You can choose any function
as an activation function. For
Multilayer Perceptron Conti..

Multilayer perceptron(MLP)

P Jyothi,Asst. Prof., CSE Dept.

Multilayer Perceptron Conti..

 MLP we have multiple layers of perceptrons. MLPs are

feed-forward artificial neural networks. In MLP we have at
least 3 layers. The first layer is called the input layer, the
next ones are called hidden layers and last on is called the
output layer. The nodes in the input layer don’t have
activation, in fact, the nodes in the input layers represent
the data point. If the data point is represented using a d-
dimensional vector then the input layer will have d nodes.

P Jyothi,Asst. Prof., CSE Dept.

Multilayer Perceptron Conti..

 In the above diagram, we have one input layer, 2 hidden

layers, and the last final layer. All layers are fully connected.
This means the current node is connected with the nodes
from the previous layer. We have a weight matrix in each
layer that stores all the weight for that layer. This essentially
is what we get once training is over. All these weights get
updated during training using back-propagation.

P Jyothi,Asst. Prof., CSE Dept.

Multilayer Perceptron Conti..

P Jyothi,Asst. Prof., CSE Dept.

Multilayer Perceptron Conti..

 Multilayer Perceptron falls under the category

of feedforward algorithms, because inputs are combined
with the initial weights in a weighted sum and subjected to
the activation function, just like in the Perceptron. But the
difference is that each linear combination is propagated to
the next layer.
 Each layer is feeding the next one with the result of their
computation, their internal representation of the data. This
goes all the way through the hidden layers to the output
layer.
P Jyothi,Asst. Prof., CSE Dept.
Gradient Descent

 Gradient Descent is known as one of the most commonly used

optimization algorithms to minimize errors between actual and
expected results. Further, gradient descent is also used to train
Neural Networks.
 In mathematical terminology, Optimization algorithm refers to the
task of minimizing/maximizing an objective function f(x)
parameterized by x. Similarly, in machine learning, optimization
is the task of minimizing the cost function parameterized by the
model's parameters. The main objective of gradient descent is to
minimize the convex function using iteration of parameter
updates. Once these machine learning models are optimized,
these models can be used as powerful tools for Artificial
Intelligence and various computer science applications.
P Jyothi,Asst. Prof., CSE Dept.
Gradient Descent Conti..

 It is also called as Gradient Descent or Steepest Descent

 Gradient Descent is defined as one of the most
commonly used iterative optimization algorithms of
machine learning to train the machine learning and
deep learning models. It helps in finding the local
minimum of a function.

P Jyothi,Asst. Prof., CSE Dept.

Gradient Descent Conti..

 The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:

P Jyothi,Asst. Prof., CSE Dept.

Gradient Descent Conti..

• If we move towards a negative gradient or away

from the gradient of the function at the current
point, it will give the local minimum of that
function.
• Whenever we move towards a positive gradient or
towards the gradient of the function at the current
point, we will get the local maximum of that
function.
P Jyothi,Asst. Prof., CSE Dept.
Gradient Descent Conti..

 The main objective of using a gradient descent

algorithm is to minimize the cost function using
iteration. To achieve this goal, it performs two steps
iteratively:
• Calculates the first-order derivative of the function to
compute the gradient or slope of that function.
• Move away from the direction of the gradient, which means
slope increased from the current point by alpha times,
where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to
decide the length of the steps.
P Jyothi,Asst. Prof., CSE Dept.
Gradient Descent Conti..

 The cost function is defined as the measurement of

difference or error between actual values and expected
values at the current position and present in the form
of a single real number. It helps to increase and improve
machine learning efficiency by providing feedback to this
model so that it can minimize error and find the local or
global minimum. Further, it continuously iterates along the
direction of the negative gradient until the cost function
approaches zero. At this steepest descent point, the model
will stop learning further.
P Jyothi,Asst. Prof., CSE Dept.
Gradient Descent Conti..

Types of Gradient Descent

 Based on the error in various training models, the Gradient

Descent learning algorithm can be divided into Batch
gradient descent, stochastic gradient descent, and
mini-batch gradient descent.

P Jyothi,Asst. Prof., CSE Dept.

Gradient Descent Conti..

 1. Batch Gradient Descent:

 Batch gradient descent (BGD) is used to find the error for each point in the
training set and update the model after evaluating all training examples. This
procedure is known as the training epoch. In simple words, it is a greedy
approach where we have to sum over all examples for each update.
 Advantages of Batch gradient descent:
• It produces less noise in comparison to other gradient descent.
• It produces stable gradient descent convergence.
• It is Computationally efficient as all resources are used for all training samples.

P Jyothi,Asst. Prof., CSE Dept.

Gradient Descent Conti..

2. Stochastic gradient descent

 Stochastic gradient descent (SGD) is a type of gradient descent
that runs one training example per iteration. Or in other words, it
processes a training epoch for each example within a dataset
and updates each training example's parameters one at a time.
As it requires only one training example at a time, hence it is
easier to store in allocated memory.
 However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent
updates that require more detail and speed. Further, due to
frequent updates, it is also treated as a noisy gradient. However,
sometimes it can be helpful in finding the global minimum and
also escaping the local minimum.
P Jyothi,Asst. Prof., CSE Dept.
Gradient Descent Conti..

 Advantages of Stochastic gradient descent:

 In Stochastic gradient descent (SGD), learning happens on every example, and
it consists of a few advantages over other gradient descent.
• It is easier to allocate in desired memory.
• It is relatively fast to compute than batch gradient descent.
• It is more efficient for large datasets.

P Jyothi,Asst. Prof., CSE Dept.

Gradient Descent Conti..

 3. MiniBatch Gradient Descent:

 Mini Batch gradient descent is the combination of both batch gradient descent
and stochastic gradient descent. It divides the training datasets into small batch
sizes then performs the updates on those batches separately. Splitting training
datasets into smaller batches make a balance to maintain the computational
efficiency of batch gradient descent and speed of stochastic gradient descent.
Hence, we can achieve a special type of gradient descent with higher
computational efficiency and less noisy gradient descent.
 Advantages of Mini Batch gradient descent:
• It is easier to fit in allocated memory.
• It is computationally efficient.
• It produces stable gradient descent convergence.

P Jyothi,Asst. Prof., CSE Dept.

Backpropagation

 Backpropagation is one of the important concepts

of a neural network. Our task is to classify our data
best. For this, we have to update the weights of
parameter and bias, but how can we do that in a
deep neural network? In the linear regression
model, we use gradient descent to optimize the
parameter. Similarly here we also use gradient
descent algorithm using Backpropagation.

P Jyothi,Asst. Prof., CSE Dept.

Backpropagation Conti..

 The main features of Backpropagation are the

iterative, recursive and efficient method through
which it calculates the updated weight to improve
the network until it is not able to perform the task
for which it is being trained.

P Jyothi,Asst. Prof., CSE Dept.

Backpropagation Conti..

 Backpropagation is the essence of neural network training. It is

the method of fine-tuning the weights of a neural network
based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce
error rates and make the model reliable by increasing its
generalization.
 Backpropagation in neural network is a short form for
“backward propagation of errors.” It is a standard method of
training artificial neural networks. This method helps calculate
the gradient of a loss function with respect to all the weights in
the network.
P Jyothi,Asst. Prof., CSE Dept.
Backpropagation Conti..

How Backpropagation Algorithm Works

 The Back propagation algorithm in neural network
computes the gradient of the loss function for a single
weight by the chain rule. It efficiently computes one
layer at a time, unlike a native direct computation. It
computes the gradient, but it does not define how the
gradient is used. It generalizes the computation in the
delta rule.
P Jyothi,Asst. Prof., CSE Dept.
Backpropagation Conti..

P Jyothi,Asst. Prof., CSE Dept.

Backpropagation Conti..

1. Inputs X, arrive through the preconnected path

2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to
the output layer.
4. Calculate the error in the outputs

5. Travel back from the output layer to the hidden layer to adjust the weights such that
the error is decreased.
 Keep repeating the process until the desired output is achieved

P Jyothi,Asst. Prof., CSE Dept.

Backpropagation Conti..

Most prominent advantages of Backpropagation are:

• Backpropagation is fast, simple and easy to program
• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be learned.


P Jyothi,Asst. Prof., CSE Dept.

Backpropagation Conti..

 Two Types of Backpropagation Networks are:

• Static Back-propagation
• Recurrent Backpropagation
 Static back-propagation
 It is one kind of backpropagation network which produces a mapping of a static input for
static output. It is useful to solve static classification issues like optical character
recognition.
 Recurrent Backpropagation
 Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.
 The main difference between both of these methods is: that the mapping is rapid in static
back-propagation while it is nonstatic in recurrent backpropagation.

P Jyothi,Asst. Prof., CSE Dept.

Kohonen Self- Organizing Feature Map

 Kohonen Self-Organizing feature map (SOM)

refers to a neural network, which is trained using
competitive learning. Basic competitive learning
implies that the competition process takes place
before the cycle of learning. The competition
process suggests that some criteria select a
winning processing element. After the winning
processing element is selected, its weight vector is
adjusted according to the used learning law
P Jyothi,Asst. Prof., CSE Dept.
Kohonen Self- Organizing Feature
Map Conti..
 The self-organizing map is typically represented as a
two-dimensional sheet of processing elements
described in the figure given below. Each processing
element has its own weight vector, and learning of
SOM (self-organizing map) depends on the adaptation
of these vectors.
 The processing elements of the network are made
competitive in a self-organizing process, and specific
criteria pick the winning processing element whose
weights are updated. Generally, these criteria are used
to limit the Euclidean distance between the input
vector and the weight vector.
P Jyothi,Asst. Prof., CSE Dept.
Kohonen Self- Organizing Feature Map
Conti..
 SOM (self-organizing map) varies from basic
competitive learning so that instead of adjusting only
the weight vector of the winning processing element
also weight vectors of neighboring processing
elements are adjusted. First, the size of the
neighborhood is largely making the rough ordering of
SOM and size is diminished as time goes on.
 At last, only a winning processing element is adjusted,
making the fine-tuning of SOM possible. The use of
neighborhood makes topologically ordering procedure
possible, and together with competitive learning makes
process non-linear.
P Jyothi,Asst. Prof., CSE Dept.
Kohonen Self- Organizing Feature Map
Conti..
 The self-organizing map refers to an unsupervised learning
model proposed for applications in which maintaining a
topology between input and output spaces.
 It is fundamentally a method for dimensionality reduction,
as it maps high-dimension inputs to a low dimensional
discretized representation and preserves the basic
structure of its input space.

P Jyothi,Asst. Prof., CSE Dept.

Kohonen Self- Organizing Feature Map
Conti..

P Jyothi,Asst. Prof., CSE Dept.

Kohonen Self- Organizing Feature Map
Conti..
 All the entire learning process occurs without supervision
because the nodes are self-organizing. They are also
known as feature maps, as they are basically retraining the
features of the input data, and simply grouping themselves
as indicated by the similarity between each other. It has
practical value for visualizing complex or huge quantities of
high dimensional data and showing the relationship
between them into a low, usually two-dimensional field to
check whether the given unlabeled data have any structure
to it.
P Jyothi,Asst. Prof., CSE Dept.
Kohonen Self- Organizing Feature Map
Conti..
 A Self-Organizing Map utilizes competitive learning instead of error-correction
learning, to modify its weights. It implies that only an individual node is activated
at each cycle in which the features of an occurrence of the input vector are
introduced to the neural network, as all nodes compete for the privilege to
respond to the input.
 The architecture of the Self Organizing Map with two clusters and n input
features of any sample is given below:

P Jyothi,Asst. Prof., CSE Dept.

Learning Vector Quantization (LVQ)

 This algorithm stands at the intersection of clustering and classification, offering

a unique approach to solving multi-class classification problems.
 Architecture

P Jyothi,Asst. Prof., CSE Dept.

Learning Vector Quantization (LVQ)
Conti..
 LVQ network is a two-layered network. The first layer can be called the competitive
layer and the second layer can be called the linear layer. The names of the layers
are the result of the activation function used in that layer. The LVQ network is
illustrated in the diagram.
 Notation used in this diagram:
R is the size of the input vector
W is the weight matrix
S is the number of neurons
n is the net input to the activation function
a is the net output of the activation function
All the superscript numbers denote the neural network layer.
Example: W¹ is the weight matrix of the first layer and W² is the weight matrix of the
second layer.
P Jyothi,Asst. Prof., CSE Dept.
Learning Vector Quantization (LVQ)
Conti..
 Activation Functions
 Since LVQ is a two-layered network, we have two activation functions, one for
each layer.
1. Competitive
 The main purpose of the competitive activation function is to identify the
“winning” neuron or prototype vector that best matches the input data.

P Jyothi,Asst. Prof., CSE Dept.

Learning Vector Quantization (LVQ)
Conti..
Linear
 The linear activation function is very straightforward. The output of the layer is
the same as the net input for the layer.

P Jyothi,Asst. Prof., CSE Dept.

Learning Vector Quantization (LVQ)
Conti..
 LVQ serves well for simpler classification tasks with moderate-sized datasets
and distinct class separations. However, its constraints should be taken into
account when tackling more intricate scenarios.\
 Advantages of Using LVQ:
1. Ease of Understanding: LVQ’s straightforward nature makes it accessible and
suitable for those new to machine learning.
2. Clear Interpretation: By assigning labels to prototypes, LVQ offers insights into
the reasoning behind classification decisions.
3. Partial Labeling Support: LVQ can handle situations where only a portion of
the data is labeled, enhancing its applicability.

P Jyothi,Asst. Prof., CSE Dept.

Learning Vector Quantization (LVQ)
Conti..
Disadvantages of the LVQ Algorithm:
1. Sensitivity to Starting Points: LVQ’s performance can vary based on where
prototypes are initially placed, affecting results.
2. Complex Boundary Limitation: When dealing with intricate decision
boundaries, LVQ might struggle to accurately model them using existing
prototypes.
3. Susceptibility to Noise: Noise within training data may misposition prototypes,
leading to compromised classification quality.
4. Scalability Challenges: Managing an adequate number of prototypes
becomes demanding as the number of classes or features increases.
5. Bias towards Dominant Classes: In cases of imbalanced training data, LVQ
may exhibit a bias towards the more prevalent classes.
P Jyothi,Asst. Prof., CSE Dept.
COUNTER-PROPAGATION (CPN) NETWORK

 The data compression yields to the data reduction which is to be send

or stored usually with the possibility of its full reproduction
(decompression). Image data compression is used to encode large
amounts of image data for transmission over a limited-capacity
channels. Recently, neural networks algorithms have been developed
for data compression, yielding superior performance over classical
techniques.
 For the efficient compression, the image data are passed through a
network producing the binary vectors as their compressed version.
After decompression the image data and output data are expected to
be very close. The compression ratio depends greatly on the tolerated
amount of error. Algorithm is made of two major steps, namely
network training (or learning) and processing (data compression and
decompression)