Deep learning
Deep learning
2
their parallel processing only makes the brain’s abilities possible. Figure 1 represents a human
biological nervous unit. Various parts of biological neural network(BNN) is marked in Figure 1.
3
Information flow in a neural cell
The input/output and the propagation of information are shown below.
1.3. Artificial neuron model
An artificial neuron is a mathematical function conceived as a simple model of a real (biological)
neuron.
The McCulloch-Pitts Neuron
This is a simplified model of real neurons, known as a Threshold Logic Unit.
A set of input connections brings in activations from other neuron.
A processing unit sums the inputs, and then applies a non-linear activation function (i.e.
squashing/transfer/threshold function).
An output line transmits the result to other neurons.
1.3.1 Basic Elements of ANN:
Neuron consists of three basic components –weights, thresholds and a single activation
function. An Artificial neural network(ANN) model based on the biological neural sytems is shown
in figure 2.
4
Different Training /Learning procedure available in ANN are
Supervised learning
Unsupervised learning
Reinforced learning
Hebbian learning
Gradient descent learning
Competitive learning
Stochastic learning
1.4.1. Requirements of Learning Laws:
• Learning Law should lead to convergence of weights
• Learning or training time should be less for capturing the information from the training
pairs
• Learning should use the local information
• Learning process should able to capture the complex non linear mapping available
between the input & output pairs
• Learning should able to capture as many as patterns as possible
• Storage of pattern information's gathered at the time of learning should be high for the
given network
1.4.1.1.Supervised learning :
5
Every input pattern that is used to train the network is associated with an output pattern which is
the target or the desired pattern.
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the error.The
error can then be used to change network parameters, which result in an improvement in performance.
1.4.1.2 Unsupervised learning:
In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering and
adapting to structural features in the input patterns.
1.4.1.3 Reinforced learning:
In this method, a teacher though available, doesnot present the expected answer but only indicates
if the computed output correct or incorrect.The information provided helps the network in the
learning process.
1.4.1.4 Hebbian learning:
This rule was proposed by Hebb and is based on correlative weight adjustment.This is the oldest
learning mechanism inspired by biology.In this, the input-output pattern pairs (𝑥𝑖 , 𝑦𝑖 ) are associated
by the weight matrix W, known as the correlation matrix.
It is computed as
W = ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 𝑇 ------------ eq(1)
Here 𝑦𝑖 𝑇 is the transposeof the associated output vector 𝑦𝑖 .Numerous variants of the rule have
been proposed.
1.4.1.5 Gradient descent learning:
This is based on the minimization of error E defined in terms of weights and activation function
of the network.Also it is required that the activation function employed by the network is
differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗 is the weight update of the link connecting the 𝑖𝑡ℎ and 𝑗𝑡ℎ neuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗 is defined as,
𝜕𝐸
∆𝑤𝑖𝑗 = ɳ ----------- eq(2)
𝜕𝑤𝑖𝑗
𝜕𝐸
Where, ɳ is the learning rate parameter and is the error gradient with reference to the
𝜕𝑤𝑖𝑗
weight 𝑤𝑖𝑗 .
6
into only two categories, all we need is a single output neuron. Here we will use bipolar neurons. The
simplest architecture that could do the job consists of a layer of N input neurons, an output layer with
a single output neuron, and no hidden layers. This is the same architecture as we saw before for Hebb
learning. However, we will use a different transfer function here for the output neurons as given
below in eq (7). Figure 7 represents a single layer perceptron network.
--------------------- eq (7)
Equation 7 gives the bipolar activation function which is the most common function used in
the perceptron networks. Figure 7 represents a single layer perceptron network. The inputs arising
from the problem space are collected by the sensors and they are fed to the aswociation
units.Association units are the units which are responsible to associate the inputs based on their
similarities. This unit groups the similar inputs hence the name association unit. A single input from
each group is given to the summing unit.Weights are randomnly fixed intially and assigned to this
inputs. The net value is calculate by using the expression
x = Σ wiai – θ ___________________ eq(8)
This value is given to the activation function unit to get the final output response.The actual
output is compared with the Target or desired .If they are same then we can stop training else the
weights haqs to be updated .It means there is error .Error is given as δ = b-s , where b is the desired
7
/ Target output and S is the actual outcome of the machinehere the weights are updated based on the
perceptron Learning law as given in equation 9.
Weight change is given as Δw= η δ ai. So new weight is given as
Wi (new) = Wi (old) + Change in weight vector (Δw) _________eq(9)
1.5.2. Perceptron Algorithm
Step 1: Initialize weights and bias.For simplicity, set weights and bias to zero.Set learning rate
in the range of zero to one.
• Step 2: While stopping condition is false do steps 2-6
• Step 3: For each training pair s:t do steps 3-5
• Step 4: Set activations of input units xi = ai
• Step 5: Calculate the summing part value Net = Σ aiwi-θ
• Step 6: Compute the response of output unit based on the activation functions
• Step 7: Update weights and bias if an error occurred for this pattern(if yis not equal to t)
Weight (new) = wi(old) + atxi , & bias (new) = b(old) + at
Else wi(new) = wi(old) & b(new) = b(old)
• Step 8: Test Stopping Condition
1.5.3. Limitations of single layer perceptrons:
• Uses only Binary Activation function
• Can be used only for Linear Networks
• Since uses Supervised Learning ,Optimal Solution is provided
• Training Time is More
• Cannot solve Linear In-separable Problem
8
Figure 5: Multi-Layer Perceptron
1. Initialize the weights (Wi) & Bias (B0) to small random values near Zero
2. Set learning rate η or α in the range of “0” to “1”
3. Check for stop condition. If stop condition is false do steps 3 to 7
4. For each Training pairs do step 4 to 7
5. Set activations of Output units: xi = si for i=1 to N
6. Calculate the output Response
yin = b0 + Σ xiwi
7. Activation function used is Bipolar sigmoidal or Bipolar Step functions
For Multi Layer networks, based on the number of layers steps 6 & 7 are repeated
8. If the Targets is (not equal to) = to the actual output (Y), then update weights and bias based
on Perceptron Learning Law
Wi (new) = Wi (old) + Change in weight vector
Change in weight vector = ηtixi
Where η = Learning Rate
ti = Target output of ith unit
xi = ith Input vector
b0(new) = b0 (old) + Change in Bias
Change in Bias = ηti
Else Wi (new) = Wi (old)
b0(new) = b0 (old)
9. Test for Stop condition
9
1.6. linearly seperable & Linear in separable tasks:
Perceptron are successful only on problems with a linearly separable solution sapce.Figure 9
represents both linear separable as well as linear in seperable problem.Perceptron cannot handle, in
particular, tasks which are not linearly separable.(Known as linear inseparable problem).Sets of
points in two dimensional spaces are linearly separable if the sets can be seperated by a straight
line.Generalizing, a set of points in n-dimentional space are that can be seperated by a straight line.is
called Linear seperable as represented in figure 9.
Single layer perceptron can be used for linear separation.Example AND gate.But it cant be
used for non linear ,inseparable problems.(Example XOR Gate).Consider figure 10.
10
Convex regions can be created by multiple decision lines arising from multi layer
networks.Single layer network cannot be used to solve inseparable problem.Hence we go for
multilayer network there by creating convex regions which solves the inseparable problem.
1.6.1 Convex Region:
Select any Two points in a region and draw a straight line between these two points. If the
points selected and the lines joining them both lie inside the region then that region is known as
convex regions.
1.6.2. Types of convex regions
(a) Open Convex region (b) Closed Convex region
Figure 9 A: Circle - Closed convex region Figure 9 B: Triangle - Closed convex region
1.7. Logistic Regression
Logistic regression is a probabilistic model that organizes the instances in terms of
probabilities. Because the classification is probabilistic, a natural method for optimizing the
parameters is to ensure that the predicted probability of the observed class for each training
occurrence is as large as possible. This goal is achieved by using the notion of maximumlikelihood
estimation in order to learn the parameters of the model. The likelihood of the training data is defined
as the product of the probabilities of the observed labels of each training instance. Clearly, larger
values of this objective function are better. By using the negative logarithm of this value, one obtains
a loss function in minimization form. Therefore, the output node uses the negative log-likelihood as
a loss function. This loss function replaces the squared error used in the Widrow-Hoff method. The
output layer can be formulated with the sigmoid activation function, which is very common in neural
network design.
11
Logistic regression is another supervised learning algorithm which is used
to solve the classification problems. In classification problems, we have
dependent variables in a binary or discrete format such as 0 or 1.
12
1.8. Support Vector Machines
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However, primarily,
it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to
create the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane. SVM chooses the extreme points/vectors that help in creating
the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear data
and classifier used is called as Non-linear SVM classifier
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of
the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
13
1.8.2. Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair (x1, x2) of coordinates
in either green or blue. Consider the below image figure11. It is 2-d space so by just
using a straight line, we can easily separate these two classes. But there can be multiple
lines that can separate these classes. Consider the below image:
14
1.9.1. Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
1.9.2. Stochastic Gradient Descent (SGD):
The word ‘stochastic‘ means a system or a process that is linked with a random probability.
Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole
data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total
number of samples from a dataset that is used for calculating the gradient for each iteration. In typical
Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole
dataset. Although, using the whole dataset is really useful for getting to the minima in a less noisy
and less random manner, but the problem arises when our datasets gets big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique, you will have to use all of the one million samples for completing one
iteration while performing the Gradient Descent, and it has to be done for every iteration until the
minima is reached. Hence, it becomes computationally very expensive to perform.
Reference Books:
1. B. Yegnanarayana, “Artificial Neural Networks” Prentice Hall Publications.
2. Simon Haykin, “Artificial Neural Networks”, Second Edition, Pearson Education.
3. Laurene Fausett, “Fundamentals of Neural Networks, Architectures, Algorithms and
Applications”, Prentice Hall publications.
4. Cosma Rohilla Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015.
5. 2. Deng & Yu, Deep Learning: Methods and Applications, Now Publishers, 2013.
6. 3. Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, MIT Press, 2016.
7. 4. Michael Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.
Note: For further reference, kindly refer the class notes, PPTs, Video lectures
available in the Learning Management System (Moodle)
15