0% found this document useful (0 votes)
29 views106 pages

Ann mod1

Uploaded by

imnavinbabu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views106 pages

Ann mod1

Uploaded by

imnavinbabu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

20MCA283

DEEP LEARNING
Module 1
Artificial Neural Networks
Introduction
• Two major Problem Solving Techniques are
• Hard computing
• Soft Computing
Introduction
Hard computing
• Hard computing deals with precise models where
accurate solutions are achieved quickly.
• Hard Computing technique require exact input
data
• It is strictly sequential and provides precise
answers to complex problems.
Introduction
Soft computing
• Soft Computing deals with approximate models.
• The term "soft computing" was introduced by
Professor Lofti Zadeh in 1994.
• Provides solutions to complex problems and deals
with imprecise, uncertain and partial truth data.
• Soft computing is a combination of Neural
Network, Fuzzy Logic and Genetic Algorithm.
Deep learning
Deep learning
Deep learning
Area
Facilities
Price
Age
Location
Deep learning
Deep learning
• Deep learning is a subset of machine learning in artificial
intelligence.
• It is capable of implementing a function that is used to mimic the functionality
of the brain by creating patterns(ability to learn and encode significant features
from the input data) and processing data.
• Uses artificial neural network with many layers to address complex problems.

• Problems in Deep Learning


• Categorization (Classification)
• Prediction
Applications of Deep learning

• Computer Vision: Computer vision deals with algorithms for


computers to understand the world using the image and video data and
tasks.
• Eg:Image recognition, image classification, object detection, image
segmentation, image restoration, etc.
• Speech and Natural Language Processing: Natural language
processing deals with algorithms for computers to understand,
interpret, and manipulate human language. NLP algorithms work with
text and audio data and transform them into audio or text output.
• Eg: Sentiment analysis, speech recognition, language transition, and natural
language generation, etc.
Applications of Deep learning

• Autonomous Vehicles: Deep learning models are trained with a huge


amount of data for identifying street signs; some models specialize in
identifying pedestrians, identifying humans, etc., for driverless cars
while driving.
• Text Generation: By using deep learning models trained by language,
grammar, and types of texts, etc., can be used to create a new text with
correct spelling and grammar.
• Image Filtering: By using deep learning models such as adding color
to black-and-white images, it can be done by deep learning models,
which will take more time if we do it manually.
Deep learning
Biological Neuron
• Human Brain consist of billions of neural cells
called neurons (approximately 1011)
• Neuron process information.
• Each cell works like single processor.
• Interaction between cells and their parallel
processing helps brain to learn, re-organize
itself from experience and adapt to the
environment.
• Information transported between neurons in
the form of electric signals.
Biological Neuron
Neuron consists of the following four parts
• Dendrites−responsible for receiving the
information from other neurons it is connected
to.
• Soma (Cell body) − It is responsible for
processing of information they have received
from dendrites. Nucleus is located here.
• Axon − It is just like a cable through which
neurons send the information.
• Synapses − It is the connection between the
axon and other neuron dendrites.
Biological Neuron

• The incoming information from dendrites are


added up at nucleus and then delivered to synapse
via axon.

• If the incoming stimulation has exceeded a


certain threshold, the neuron is activated.

• If the stimulation is too low the neuron is


inhibited, and the information will not be
transported to any further.
Artificial Neuron
Artificial Neuron

• Artificial Neuron is a mathematical model of the


biological neuron.
• Artificial Neuron is the basic unit of Artificial
Neural Network(ANN).
• A neuron can accept any number of inputs and
send a single output signal.
Artificial Neuron – Mathematical Model

• Let the inputs be x1, x2, …, xn


• Inputs are connected to the cell body through
links having weights w1, w2, …, wn
respectively
• Weights represents the connection strength
similar to synaptic strength in biological
neuron.
Artificial Neuron – Net Input Calculation
• Net input
Yin = x1w1 +x2w2+ … +xnwn
=σ𝑛𝑖=1 𝑥𝑖 𝑤𝑖

• Output Calculation
To calculate output, an activation function is applied
over the net input Yin

Eg: Activation functions


Hard Limiter
Sign
Sigmoid
Ramp Function
Hyperbolic Tangent
Biological Neuron Vs. Artificial Neuron
Terminology relationship between biological and
artificial neuron

Biological Artificial Neuron


Neuron
Cell Neuron
Dendrites Weight or
Interconnection
Soma Net Input
Axon Output
Biological Neuron Vs. Artificial Neuron
Biological Neuron Artificial Neuron
1. Speed of execution Milliseconds Nano seconds
Can perform massive parallel Can perform massive parallel
2. Processing Data operations operations

Total number of neurons in Size and complexity is based


brain is about 1011 and total on application and designer.
3. Size and Complexity number of interconnections is
about 1015
Complexity is High
Fault tolerant Information got corrupted if
Can store and retrieve data interconnections got
4. Tolerance
even if the interconnections disconnected.
got disconnected
Biological Neuron Vs. Artificial Neuron

Biological Neuron Artificial Neuron

Store data on synapse Store data at weight


and new data can be matrix.
5. Storage or added by adjusting Adding new data may
memory synaptic strength destroy old data.
without destroying old
data.
No control unit. A control unit present in
6. Control CPU, which transfer
Mechanism values from one unit to
another
Eg: ANN Models
• McCulloch Pits Model (1943)
• Hebb Network (1949)
• Perceptron (1958)
• Adaline (1960)
• Back Propagation Network(1986)
Characteristics of ANN

• Mathematical model.
• Contains interconnected processing elements(neurons).
• WEIGHTED LINKS (interconnections) hold information.
• Neurons can learn, recall, generalize data by adjusting
weights.
• No single neuron carries specific information; only a
collection of neurons hold data.
ANN
• In Neural networks neurons are
organized in layers.
Input layer
Hidden Layer
Output layer
• When some data is fed to the
ANN, it is processed via layers
of neurons to produce desired
output.
• Data is presented to the network
via the input layer.
• Input layer communicates to
hidden layers
• The hidden layers then link to an
output layer where the answer is
output
Basic Models of Artificial Neual Networks (ANN)

• Ann Models are specified by three entities


1. Inter Connections
2. Learning
3. Activation Function
Basic Models of ANN
1. Inter Connections

• Connection pattern formed within and between layers is called the


network architecture.
• There exist five basic types of neuron connection architectures.
1. single-layer feed-forward network;
2. multilayer feed-forward network;
3. single node with its own feedback;
4. single-layer recurrent network;
5. multilayer recurrent network.
1.1 Single-layer feed-forward network
• In feed forward networks the data flow in
a single direction, from the input layer data to X1
w11

the output layer. w12


y1
y1

• Single layer feed forward network contain input w21

layer and output layer X2

w22
y2
y2

• Here inputs are given directly to the neurons of w31

X3

output layer. w32

Input Layer Output Layer


• Each neurons of output layer will calculate
NET INPUT(yin) and ACTIVATION
FUNCTION is applied over it to produce the
output(y)
1.2 Multi layer Feed Forward network

• This type network contain


v11
one or more layers (hidden X1

Z1
layers) between input and v12 w11

y
output layer. X2 v21 Y1

v22

Z2 w21
v31

• More the number of the X3


v32
hidden layers, more is the
complexity of the network. Input Layer Hidden Layer Output Layer
1.3 Single layer Recurrent Network (Feed back Network)

• Networks contain output w11


X1
links directed back as inputs y1
y1
to the same or preceding w12

layer nodes is called feedback X2


w21

networks. w22
y2
• Recurrent networks are
y2
w31
X3
feedback networks with w32

closed loop.
1.4 Multi layer Recurrent Network (Feed back Network)

• Networks contain output v11

links directed back as inputs X1

Z1
w11
to the same or preceding v12
y
layer nodes is called feedback X2 v21 Y1

networks. v22

Z2 w21
• Recurrent networks are X3
v31

feedback networks with v32

closed loop.
1.5 Single node with its own feedback (Feed back Network)

• Single recurrent network


having a single neuron with
feedback to itself.
• Recurrent networks are
feedback networks with
closed loop.
2. Learning or Training
• Learning is the process which improves ANN’s performance
and is applied repeatedly over the network.
• Learning is done by updating weights till the desired output is
obtained.
• For learning there is a learning algorithm.
• Data called as training data set is fed to the learning algorithm
and the algorithm draws inferences from the training data set.
• Types of learning are:
• Supervised learning
• Unsupervised learning
• Reinforcement learning
2. Learning or Training – Supervised learning
2. Learning or Training – Supervised learning
• Learn with the help of a Teacher/ Supervisor.
• In this method both input and output patterns (training pair)
are provided.
• During training, input is given to ANN, which gives an
output(actual output).
• This actual output is compared with the target output(output
given in the training pair).
• If there exists a difference an error signal is generated.
• This error signal is used for adjusting weights.
• This process repeats until actual output matches the target
output or we attain some accuracy.
• Supervisor helps to reduce error, so it is called supervised
learning.
2. Learning or Training – Unsupervised learning

• Learning process is independent


• Inputs are grouped as clusters without the use of training
data, it finds the hidden structure in the input.
• In the training process, the network receives the input
patterns and organize these patterns to form clusters.
• When an input is applied to ANN, it gives a response
indicating the cluster to which that input belongs.
• If an input didn’t belong to a cluster, a new cluster is formed.
• There is no mechanism to check whether the outputs are
correct or not.
2. Learning or Training – Unsupervised learning

• ANN must itself discover pattern regularities and features from input and the
relations for the input data over the output.
• While discovering these features network undergo changes in weights. This
process is called self –organizing.

Changes in
weight
values
2. Learning or Training – Reinforcement Learning

• This is a form of supervised learning.


• The exact information if output is not
available but a critic information is
available.
• Feedback is sent as reinforcement signal.
• This reinforcement signal is processed in
an error signal generator.
• ANN adjusts weights according to the
error signal and the training process
repeats.
3. Activation Functions

Let T=0
1. Hard Limiter or STEP function or
Binary Step function

𝑓 𝑦𝑖𝑛 = 0 𝑖𝑓 𝑦𝑖𝑛 < 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑇


= 1 𝑖𝑓 𝑦𝑖𝑛 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑇)
3. Activation Functions
Let T=0

2. Sign function or Bipolar Step

𝑓 𝑦𝑖𝑛 = −1 𝑖𝑓 𝑦𝑖𝑛 < 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑇


= 1 𝑖𝑓 𝑦𝑖𝑛 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑(𝑇)
+
1
1
Here y is -1 or +1
-
1
Let T=1
3. Activation Functions

3. Binary Sigmoid f(yin)

1 1
𝑓 𝑦𝑖𝑛 =
1 + 𝑒 −𝑦𝑖𝑛

Here y lies between 0 and 1


3. Activation Functions

4. Bipolar Sigmoid f(yin)

1 − 𝑒 −𝑦𝑖𝑛
𝑓 𝑦𝑖𝑛 =
1 + 𝑒 −𝑦𝑖𝑛

-1

Here y lies between -1 and 1


3. Activation Functions

5. RAMP function f(yin)

1
𝑓 𝑦𝑖𝑛 =1 if 𝑦𝑖𝑛 >1
= 0 if 𝑦𝑖𝑛 <0
= 𝑦𝑖𝑛 if 0 ≤ 𝑦𝑖𝑛 ≤ 1 1

Here y lies between 0 and 1


3. Activation Functions

6. Tanh — Hyperbolic tangent


mathematical formula is
𝑒 2𝑦𝑖𝑛 − 1
𝑓 𝑥 = 2𝑦𝑖𝑛
𝑒 +1
3. Activation Functions

7. ReLu -Rectified linear units

Almost all deep learning Models use ReLu

Mathamatical formula is :

R(x) = max(0,x)

i.e if x < 0 , R(x) = 0 and if x >= 0 , R(x) = x.

 It is very simple and efficient .

 ReLu is only be used within Hidden layers of a Neural Network Model.


3. Activation Functions

8. Softmax activation function

• The softmax function would squeeze the outputs for each class between 0 and
1 and would also divide by the sum of the outputs.

• This gives the probability of the input being in a particular class.


Terminologies of ANN

• Weight : Each neuron is connected to another neuron via links (weights). The
weights contain information about input signal.

• Bias : Bias is an additional input(x0=1) with some weight (bj).


Bias has an impact in calculating net input(yin) X0 =1
If bias is positive net input increases b0
If bias is negative net input decreases w1
x1
Y1

Net input 𝑦𝑖𝑛 = 𝑥0 𝑏𝑗 + σ𝑛𝑖=1 𝑥𝑖 𝑤𝑖 xn wn


v11
X1

V = 𝑉1 𝑣11 𝑣12
= 𝑣21 𝑣22 Z1
𝑉2 w11
v12
𝑉3 𝑣31 𝑣32
y

X2
v21 Y1
𝑤11
W = 𝑤21 v22

Z2 w21
v31

X3

v32
Terminologies of ANN

• Threshold : Threshold is a set value based upon which the final


output of the network may be calculated.
The threshold value is used in activation function.
A comparison is made between the net input and the
threshold to obtain the actual output.
𝑓 𝑦𝑖𝑛 = 0 𝑖𝑓 𝑦𝑖𝑛 < 𝜭
= 1 𝑖𝑓 𝑦𝑖𝑛 ≥ 𝜭

• Learning Rate : It is used to control the amount of weight


adjustment at each step in the training.
The learning rate α, ranging from 0 -to 1.
Determines the rate of learning at each time step
Mc Culloch Pits Neuron (MP Neuron Model)

• Warren Mc Culloch and Walter Pits introduced MP Neural Network in 1943.


• MP Neuron model are used in logic functions like AND, OR, NOT, NAND,
NOR etc.
• MP Neuron accepts Binary Input (0/1).
• The output of MP Neuron model is also Binary (0/1).
• Weights can be positive (excitatory) or negative (Inhibitory).
• Weight values used are +1 or -1
• Activation function is Hard Limiter.
• The Threshold value is constant for an MP Neuron.
• No bias is used in MP Model
Mc Culloch Pits Neuron (MP Neuron Model)

• When the net input yin exceeds the threshold, the


neuron FIRES(y=1) ; Otherwise the neuron won’t
FIRE(y=0) x1 w1
y = f(yin) = 1 ; yin ≥ Threshold
y
= 0 ; yin < Threshold Σ T
x2
w2
• The M-P neuron has no particular training
algorithm.
• An analysis has to be performed to determine the
values of the weights and the Threshold.
• Weights of the neuron are set along with the threshold.
Perceptron Network

• Inputs are directly connected to the neurons in the output layer via adjustable weight
values.
• The activation function used at the output layer is modified SIGN function.
output y = 1 if yin > +Threshold
0 if -Threshold ≤ 𝑦𝑖𝑛 ≤ +𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 X0 =1

-1 if yin < -Threshold b0

x1 w1
Σ| y
w2 Y1
AF
x2
xn wn
Training Perceptron Network
• Learning is the process of updating the weight values.
• Perceptron can be trained by perceptron Learning rule.
• Updation of weight is done by calculating the ERROR between the
desired output (Target) and the calculated output (actual).
ERROR = Target - Actual
• If ERROR is zero goal has been achieved; otherwise update weight values.
• The perceptron networks are used to classify input pattern as a
‘member’ or ‘not a member’ to a particular class.

• The perceptron algorithm can be used for either binary or bipolar


input vectors, having bipolar targets
Perceptron Training Algorithm

Step1: initialise weights, learning parameter α and threshold θ.


Step2: for each training pair, activate inputs.
Step3: calculate output of the network.
1. obtain net input (yin)
yin =b+ σ𝑛𝑖=1 𝑥𝑖𝑤𝑖
2. Apply activation function over net input to get output (y)
y = 1 if yin > + θ
0 if - θ ≤ 𝑦𝑖𝑛 ≤ +θ
-1 if yin < -θ
Perceptron Training Algorithm

Step4: Compare calculated output, y and the target output, t.


Update weights if necessary.
if y≠ 𝐭 ERROR exists, then update weights as follows
wi (new) = wi(old)+ α xi t
b(new) = b(old)+ α t
else No need to update weight
wi (new) = wi(old)
b(new) = b(old)
Step5: If there is no change in weights for all training pair, stop
training.
Else go to step 2
Perceptron Testing Algorithm

Once the training process is complete, test the performance of


perceptron network.
Step1 :Initialise weights as the final weights obtained during training.
Step2 : For each input perform the following.
yin = b + σ𝑛𝑖=1 𝑥𝑖𝑤𝑖

y = 1 if yin > + θ (Fire)


0 if - θ ≤ 𝑦𝑖𝑛 ≤ +θ (Not fire)
-1 if yin < -θ (Not fire)
AND GATE using Perceptron

Threshold=0
x1 x2 X0 target yin Actual w1 w2 b α=1
=1 output
y
Epoch 1 0 0 0
X0 =1
1 1 1 1 0 0 1 1 1
1 -1 1 -1 1 1 0 2 0 x1 -1
-1 1 1 -1 2 1 1 1 -1
1
-1 -1 1 -1 -3 -1 1 1 -1 y
Y1
Epoch 2
x2 1
1 1 1 1 1 1 1 1 -1
1 -1 1 -1 -1 -1 1 1 -1
-1 1 1 -1 -1 -1 1 1 -1
-1 -1 1 -1 -3 -1 1 1 -1

In epoch 2 weight is constant for all the I/P pattern, so we can stop the training.
Backpropagation Networks (BPN)

• Backpropagation network is a multilayer feedforward neural network


containing an input layer, one or more hidden layers and an output
layer.
• This network is also called multilayer perceptron.
• Neurons present in the hidden layers and output layer have bias inputs.
• The input and output can be either binary or bipolar.
• The activation functions are either binary sigmoid or bipolar
sigmoid.
Backpropagation Networks
Binary Sigmoid Bipolar Sigmoid
1
𝑦 = 𝑓 𝑦𝑖𝑛 =
1 + 𝑒 −𝑦𝑖𝑛 −𝑦𝑖𝑛
f(yin
1−𝑒 )
𝑦 = 𝑓 𝑦𝑖𝑛 = 1
f(yin)
1 + 𝑒 −𝑦𝑖𝑛
Here y lies between 0 and 1 1
-
Here y lies between -1 and 1 1

Sigmoid activation function is used in BPN because


• It is continuous
• Monotonically Non decreasing
• Differentiable
Backpropagation Networks (BPN): Learning Rule

• Learning algorithm provides procedure for changing weights.


• The basic concept for this weight update algorithm is the Gradient
descent method used in simple perceptron networks with differentiable
units.
• To update weights, error must be calculated.
• In BPN the error is propagated back to the hidden unit.
• The error at output layer can be calculated easily (ie, target – actual).
• The calculation of error at hidden layer is difficult because output of
hidden layer is not known .
Backpropagation Networks (BPN): Learning Rule

• When the hidden layers are increased, training become more


complex.
• Back propagation algorithm works in three phases.
• Feed forward phase
• Backpropagation of errors
• Weight and bias updation.
Backpropagation Networks (BPN): Learning Rule
x – Training I/P vector(x1,x2,
…,xn)
t- target o/p vector(t1,t2,…tm)
α – learning rate
Xi –i/p unit i
v0j – bias in jth hidden unit
w0k –bias on kth o/p unit
Zj – jth hidden unit
vij – weight of link from ith I/P
unit to jth hidden unit.
wjk - weight of link from jth
hidden unit to kth output unit.
Backpropagation Networks (BPN): Learning Rule
1. Net i/p to Zj
zinj = 𝒗𝒐𝒋 + σ𝒏𝒊=𝟏 𝒙𝒊𝒗𝒊𝒋
2. o/p of Zj
𝟏 𝟏−𝒆−𝒛𝒊𝒏𝒋
zj = f(zinj) = −𝒛𝒊𝒏𝒋 or
𝟏+𝒆 𝟏+𝒆−𝒛𝒊𝒏𝒋

3. Yk- kth o/p unit


4. net i/p to Yk
𝒑
yink = 𝒘𝒐𝒌 + σ𝒋=𝟏 𝒛𝒋𝒘𝒋𝒌

5. o/p of Yk
𝟏 𝟏−𝒆−𝒚𝒊𝒏𝒌
yk = f(yink) = or
𝟏+𝒆−𝒚𝒊𝒏𝒌 𝟏+𝒆−𝒚𝒊𝒏𝒌
Backpropagation Networks
Training Algorithm
1.Feed forward of input training pattern – calculate the net input of each
neuron.
2. Calculation of back propagation of error – calculate errors for weight
updation
3. Updation of weights – weights updation of all units (including bias)

Testing Algorithm
Computation of feedforward phase only – calculation of output value
Backpropagation Networks : Training Algorithm

1. Initialize weights and learning parameter (to some small value).


2. Perform steps 3 -10 when stopping condition is false
3. Perform steps 4 -9 for each training pair
Phase 1 Feedforward phase
4. Each input unit receives input signal x i and send it to hidden unit(i=1
to n).
5. Each hidden unit Zj(j=1 to p) calculate the net input.
zinj = 𝑣𝑜𝑗 + σ𝑛𝑖=1 𝑥𝑖𝑣𝑖𝑗
Backpropagation Networks : Training Algorithm
5. Also calculate the output of hidden units
1 1−𝑒 −𝑧𝑖𝑛𝑗
zj = f(zinj) = or
1+𝑒 −𝑧𝑖𝑛𝑗 1+𝑒 −𝑧𝑖𝑛𝑗

6. For each output unit Yk calculate the net input


𝑝
yink = 𝑤0𝑘 + σ𝑗=1 𝑧𝑗𝑤𝑗𝑘

Also calculate the output of output units


1 1−𝑒 −𝑦𝑖𝑛𝑘
yk = f(yink) = or
1+𝑒 −𝑦𝑖𝑛𝑘 1+𝑒 −𝑦𝑖𝑛𝑘
Backpropagation Networks : Training Algorithm
Phase II Backpropagating Error
7. Calculate error at output layer and send this error back to hidden layer.
Error at output node k is denoted by δk or Errk
if bipolar activation function
δk = (tk – yk) yk (1- yk)
δK= (tk – yk) 0.5(1+yk) (1- yk)
8. Calculate the error at the hidden layer.
Error at jth hidden layer neuron = δj or Errj

δj = zj(1-zj) δinj δj = 0.5(1+zj) (1- zj)σ𝑚


𝑘=1 δk𝑤𝑗𝑘
= zj(1-zj) σ𝑚
𝑘=1 δk𝑤𝑗𝑘
Backpropagation Networks : Training Algorithm
Phase III :Weights and Bias updations
9. For each neuron in the Hidden Layer
vij (new)= vij (old)+αδjxi
v0j(new) = v0j(old) +αδj

For each neuron in the Output Layer


wjk (new) = wjk (old)+αδkzj
w0k(new) = w0k(old) +αδk
10. Check for the stopping condition. The stopping condition may be
certain number of epochs or calculated output =target output
Equations (based on the following n/w)

Net i/p hidden layer 1


zin1 =v01+x1v11 +x2v21
zin2 =v02+x1v12 +x2v22 v01
1
v11
X1
w01
Output Hidden layer Z1
𝟏 𝟏 v12 w11
z1= −𝒛𝒊𝒏𝟏 z2= Y y1
𝟏+𝒆 𝟏+𝒆−𝒛𝒊𝒏𝟐 w12
X2 v21 1
Net i/p output layer
v22
yin1 =w01+z1w11 +z2w21 Z2
w21
Y
yin2 =w02+z1w12 +z2w22 2 y2
w22

Output of output layer v02 w02


𝟏 𝟏
y1= y2=
𝟏+𝒆−𝒚𝒊𝒏𝟏 𝟏+𝒆−𝒚𝒊𝒏𝟐 1 1

Error output layer Error hidden layer


δ1 = (t1 – y1) y1 (1- y1) δz1 = z1(1 – z1) [δ1𝑤11 + δ2𝑤12 ]
δ2 = (t2 – y2) y2 (1- y2) δz2 = z2(1 – z2) [δ1𝑤21 + δ2𝑤22 ]
Backpropagation Networks : Training Algorithm
Phase III :Weights and Bias updation
9. For neurons in the Hidden Layer
v11 (new)= v11 (old)+αδz1x1
v12 (new)= v12 (old)+αδz2x1
v21 (new)= v21 (old)+αδz1x2
v22 (new)= v22 (old)+αδz2x2
v01(new) = v01(old) +αδz1
v02(new) = v02(old) +αδz2
For neurons in the Output Layer
w11 (new) = w11 (old)+αδ1z1
w12 (new) = w12 (old)+αδ2z1
w21 (new) = w21 (old)+αδ1z2
w22 (new) = w22 (old)+αδ2z2
w01(new) = w01(old) +αδ1
w02(new) = w02(old) +αδ2
Use BPN find new weights for the following network.
Input pattern is[0,1] and output is 1, learning rate is 0.25. use binary
sigmoid activation function.
1

0.3
X1 0.6
1
Z1
-0.3
Given 0.4 -0.2
[x1 x2] =[0 1] t=1
α = 0.25 y
-0.1
𝑣01 𝑣𝑜2 0.3 0.5 X2 Y1

𝑣11 𝑣12 = 0.6 −0.3 0.4


𝑣21 𝑣22 −0.1 0.4 0.1
Z2
𝑤01 −0.2
𝑤11 = 0.4
𝑤21 0.1 0.5

1
Phase 1 Forward Phase Given
[x1 x2] =[0 1] t=1
Net i/p hidden layer α = 0.25
zin1 =v01+x1v11 +x2v21 =0.3+0 * 0.6+1 * -0.1 =0.2 𝑣01 𝑣𝑜2 0.3 0.5
zin2 =v02+x1v12 +x2v22 = 0.5 +0 * -0.3 + 1 * 0.4 =0.9 𝑣11 𝑣12 = 0.6 −0.3
Output Hidden layer 𝑣21 𝑣22 −0.1 0.4
𝟏 𝟏
z1= −𝒛𝒊𝒏𝟏 z 2= −𝒛𝒊𝒏𝟐 𝑤01 −0.2
𝟏+𝒆 𝟏+𝒆
𝟏 𝟏 𝑤11 = 0.4
= =0.5498 = = 0.711 𝑤21 0.1
𝟏+𝒆−𝟎.𝟐 𝟏+𝒆−𝟎.𝟗

Net i/p output layer 1


yin1 =w01+z1w11 +z2w21 = -0.2+0.5498 * 0.4 + 0.711 * 0.1 X
0.3
0.6
=0.0910 1
Z 1
-0.3 1
0.4 -0.2
Output of output layer X -0.1 Y y
𝟏 𝟏 2 1
y 1= = =0.5227 0.4
𝟏+𝒆−𝒚𝒊𝒏𝟏 𝟏+𝒆−𝟎.𝟎𝟗𝟏𝟎 Z 0.1
2

0.5

1
Phase 2. Error Calculation
Y1 =0.5227
Z1=0.5498
Error output layer Z2=0.711
δ1 = (t1 – y1) y1 (1- y1)
= (1- 0.5227) * 0.5227 * (1-0.5227) =
0.11908

Error hidden layer


δz1 = z1(1 – z1) [δ1𝑤11] = 0.5498(1-0.5498)[0.11908 * 0.4)
=0.01178
1
δz2 = z2(1 – z2) [δ1𝑤21 ]=0.711(1-0.711)[0.11908 * 0.1] =0.00245 X
0.6
0.3
1
Z 1
-0.3 1
0.4 -0.2
X -0.1 Y y
2 1
0.4
Z 0.1
2

0.5

1
Phase 3. Weight Updation
Y1 =0.5227
For neurons in the Hidden Layer Z1=0.5498
v11 (new)= v11 (old)+αδz1x1 = 0.6+0.25 * 0.01178 * 0 =0.6 Z2=0.711
δz1 =0.01178
v12 (new)= v12 (old)+αδz2x1 = -0.3 + 0.25* 0.00245*0 = -0.3
δz2 = 0.00245
v21 (new)= v21 (old)+αδz1x2 = -0.1 +0.25*0.01178 * 1= -0.09706 δ1 = 0.11908
v22 (new)= v22 (old)+αδz2x2 = 0.4+0.25*0.00245 *1 =0.4006125
v01(new) = v01(old) +αδz1 = 0.3 + 0.25*0.01178 = 0.302945
v02(new) = v02(old) +αδz2 = 0.5 +0.25*0.00245 = 0.5006125

1
X
0.3
0.6
1
For neurons in the Output Layer Z1
1
-0.3
w11 (new) = w11 (old)+αδ1z1 = 0.4+0.25 * 0.11908 *0.5498 0.4 -0.2

=0.41637 X -0.1 Y y
2 1

w21 (new) = w21 (old)+αδ1z2 = 0.1+0.25 * 0.11908 * 0.711 =


0.4
0.1
Z2
0.12117 0.5
w01(new) = w01(old) +αδ1 = -0.2 + 0.25 * 0.11908 = -0.17023
1
Updated BPN after Epoch 1.

1
0.302945
X1 0.6
1
Z1
-0.3
0.41637 -0.17023

X2 -0.09706 Y1
y

0.4006125
0.12117
Z2

0.500612

1
Back propagation

• The goal of training is to minimize the cost function.

• Back propagation algorithm allows the gradients to back


propagate through the network and then these are used to
adjust weights and biases to move the solution space towards
the direction of reducing cost function.
Loss function and cost function

• During training, we predict the output of a model for different inputs


and compare the predicted output with actual output in our training
set.
• The difference in actual and predicted output is termed as loss
over that input.
• The sum of squares of losses across all inputs is termed as
cost function.

• Selection of a loss and cost functions depends on the kind of output


we are targeting.
Eg: For classification we use cross entropy cost function.
Gradient Descent

• The goal of all supervised machine learning algorithms is to


best estimate a target function (f) that maps input data (x) onto
output variables (y).
(This describes all classification and regression problems.)
• Machine learning algorithms require a process of optimization to
find the set of coefficients that result in the best estimate of the
target function.
• Gradient descent method can be used to optimize coefficients.
• Gradient descent is best used when the parameters cannot be
calculated analytically (e.g. using linear algebra) and must be
searched for by an optimization algorithm.
Gradient Descent

• Gradient descent is a fundamental optimization algorithm used


to minimize the cost or loss function of a neural network or any
other model.
• The goal of gradient descent is to find the parameters (weights
and biases) of the model that minimize the error between the
predicted output and the actual target values.
Gradient Descent

• Gradient descent is one of the famous ways to calculate the local


minimum.
• By Changing the weights, we are moving towards the minimum value
of the error function.
• The weights are changed by taking steps in the negative direction of
the function gradient(derivative)
Gradient Descent Algorithm

• GD is an optimization algorithm to find the minimum of a function.


• Start with a random point function and move in the negative direction
of the gradient of the function to reach the local/global minima.
• Gradient Descent Algorithm iteratively calculates the next point
using gradient at the current position, then scales it (by a
learning rate) and subtracts obtained value from the current
position (makes a step).
• This process can be written as:
𝜕𝑦
𝑤𝑖+1 = 𝑤𝑖 − 𝛼
𝜕𝑥
Gradient Descent Algorithm

• There’s an important parameter α which scales the gradient and


thus controls the step size. α is called learning rate .
• The smaller learning rate the longer GD converges or may
reach maximum iteration before reaching the optimum point.
• If learning rate is too big the algorithm may not converge to the
optimal point (jump around) or even to diverge completely.
Gradient Descent Procedure

• The goal is to continue to try different values for the coefficients,


evaluate their cost and select new coefficients that have a slightly
better (lower) cost.
• Repeating this process enough times will lead to the values of the
coefficients that result in the minimum cost
Gradient Descent Procedure

1.Initialization: Initially, the model's parameters (weights and biases) are


set to random or small values.

2.Forward Pass: For a given set of input data, the model computes a output
using the current parameter values. This output is compared to the target
values using a cost or loss function, which quantifies how far off the
prediction is from the truth.

3.Backward Pass (Backpropagation): Gradient descent gets its name


from the way it updates the model parameters. It calculates the gradient
(partial derivative) of the loss function with respect to each parameter. This
gradient tells us how much the loss would change if we made small
adjustments to each parameter.
Gradient Descent Procedure

4. Update Parameters: The parameters are updated in the opposite


direction of the gradient to minimize the loss function.
The learning rate, which is a hyperparameter, determines the size of
each step taken during this update. A smaller learning rate results in
smaller steps, which can help the algorithm converge more accurately
but may take longer to converge. A larger learning rate can lead to
faster convergence but may risk overshooting the optimal parameter
values.

4.Repeat: Steps 2-4 are repeated for a specified number of iterations


(epochs) or until the loss converges to a satisfactory level.
Gradient Descent Procedure

Steps
1. Initialize the values for the coefficient or coefficients for the function. These could be 0.0
or a small random value.
coefficient = 0.0
2. The cost of the coefficients is evaluated by plugging them into the function and
calculating the cost.
cost = f(coefficient)
3. The derivative of the cost is calculated.
(The derivative is refers to the slope of the function at a given point)
delta = derivative(cost)
(We need to know the slope so that we know the direction (sign) to move the
coefficient values in order to get a lower cost on the next iteration)
Gradient Descent Procedure

4. Update the coefficient values. A learning rate parameter (alpha) must be


specified that controls how much the coefficients can change on each
update.
coefficient = coefficient – (alpha * delta)

5. This process is repeated until the cost of the coefficients (cost) is 0.0 or
close enough to zero to be good enough.
Gradient Descent Algorithm

• Gradient Descent method’s steps are:


1. choose a starting point (initialization)
2. calculate gradient at this point
3. make a scaled step in the opposite( negative) direction to the
gradient (objective: minimize)
4. repeat points 2 and 3 until one of the criteria is met:
• maximum number of iterations reached
• step size is smaller than the tolerance.
Local vs. Global Minimum
• The neural network might give different results with
different start weights.
• The algorithm tries to find the local minima rather
than global minima.
• There can be many local minima’s, which means
there can be many solutions to neural network
problem
• We need to perform the validation checks before
choosing the final model.
Gradient Descent Algorithm

• Gradient descent algorithm does not work for all functions.


There are two specific requirements. A function has to be:
• differentiable
• convex
• Differentiable : If a function is differentiable, it has a derivative
for each point in its domain.
Gradient Descent Algorithm

Function has to be convex.


• For a univariate function, this means that the line segment
connecting two function’s points lays on or above its curve (it
does not cross it).
• If it does it means that it has a local minimum which is not a
global one.
Gradient Descent Algorithm
Gradient Descent Algorithm

• To check mathematically if a univariate function is convex is to


calculate the second derivative and check if its value is always
bigger than 0.
Gradient Descent
1.Batch Gradient Descent:
• Batch gradient descent, also called vanilla gradient descent,
calculates the error for each example within the training dataset,
but only after all training examples have been evaluated does
the model get updated.
• This whole process is like a cycle and it's called a training
epoch.
• But if the number of training examples is large, then batch
gradient descent is computationally very expensive and is not
preferred.
Gradient Descent
2. Stochastic Gradient Descent:
• Updates the parameters for each training example one by one.
• The parameters are being updated even after one iteration in
which only a single example has been processed.
• The frequent updates allow us to have a pretty detailed rate of
improvement.
• Hence this is quite faster than batch gradient descent.
• When the number of training examples is large cause additional
overhead for the system.
Gradient Descent
2. Stochastic Gradient Descent algorithm
Initialize w1
for k = 1 to K do
Sample an observation i uniformly at random
Update wK +1 =wK − α∇fi(wK )
end for
Return wK
Gradient Descent
3. Mini Batch gradient descent:
This is a type of gradient descent which works faster than both
batch gradient descent and stochastic gradient descent.
Mini-batch gradient descent is the go-to method since it’s a
combination of the concepts of SGD and batch gradient descent.
It simply splits the training dataset into small batches and
performs an update for each of those batches.
Sigmoid Neuron
• A sigmoid neuron is a type of artificial neuron that was commonly used in the
early days of neural network research.

• It is named after the sigmoid function, which is an S-shaped curve, and it's also
known as the logistic sigmoid or logistic function.

• In Sigmoid neurons a small changes in their weights and bias cause only a small
change in their output.

• Output function (sigmoid activation function)is much smoother than the step
function.
Sigmoid Neuron

• The output y is not binary but a real value between 0 and 1 which can
be interpreted as a probability
Sigmoid Neuron
• Sigmoid neurons have some limitations and drawbacks, which have led to
their decreased use in deep learning in favour of other activation functions
like ReLU (Rectified Linear Unit). These limitations include:
1. Vanishing Gradient Problem: Sigmoid neurons suffer from the
vanishing gradient problem, especially in deep networks. When gradients
become very small during backpropagation, weight updates can become
insignificant, which can slow down or even halt the learning process in
deep networks.

2. Output Centering: The output of the sigmoid function is not centered


around zero, which can lead to slower convergence when training neural
networks.
Sigmoid Neuron
3. Saturation: Sigmoid neurons saturate when the input is far from zero,
causing the gradient to be close to zero. This slows down the learning
process because weight updates become small.

• However, sigmoid neurons are still used in specific cases, such as the
output layer of binary classification models, where the sigmoid function's
output range (0, 1) is desirable for modeling probabilities.

You might also like