0% found this document useful (0 votes)
30 views

Unit 4 (2)

Uploaded by

sksharma3058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Unit 4 (2)

Uploaded by

sksharma3058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Machine Learning

Techniques

KCS 055
Artificial Neural Network (ANN)
• Inspired by information
processing model of Human
Brain.
• Human Consists of billions
of neurons that link with
each other.
• Every neuron receive
information from other
neurons.
Artificial Neural Network (ANN)
• ANNs are computational
algorithms.
• Simulates biological
behavior of nerves system of
human brain.
• Based on human mind
neuron pattern.
• Used in Deep Learning for
classification.
Artificial Neural Network (ANN)
Applications of ANN

Stock Price Character Fingerprint


Prediction Recognition Recognition

Classification
Autonomous
problems, E.g. Classification and
Vehicle Driving
Loan Application Regression Tasks
Using ANN
Approval
Basic Terminology in ANN

• Artificial Neurons: Interconnected Nodes in ANN.


• Interconnections: Several processing units are interconnected to
each other. In biological brain, these interconnections are called
synapse. The general model of processing unit consists of a summing
part with N inputs, activation function and an output.
• Processing Unit: Consists of several unit which are interconnected to
each other.
• Weight Update: To learn a target function each link of a neural
network is updated repeatedly until the target function is obtained.
Basic Terminology in ANN

• Activation Function: The function which decides the type of


output after learning the weighted sum of inputs.
• Input Layer: Initial data for neural network.
• Hidden Layer: Intermediate layer between input and output
layers. All computations link take place in hidden layer.
• Output layer: Produces the output result from given inputs.
Model of Artificial Neurons

Perceptron ADALINE
Pitts Model
Model Model
Perceptron
• Basic unit used to build
ANN.
• Takes real valued input.
• Calculate linear combination
of these inputs and
generates output
• If result > threshold, output
= 1, otherwise, output = 0
Perceptron Training Rule

Linear Combination :
σ 𝑤𝑖𝑥𝑖 = 𝑤0 + 𝑤1𝑥1 + 𝑤2𝑥2 + 𝑤3𝑥3 + 𝑤4𝑥4 + … … … … … … + 𝑤𝑛𝑥𝑛
Perceptron Training Rule
o = actual output
t = target output
If actual = target ➔ weights are fixed weights.
Otherwise, Weights need to be changed
𝑤𝑖 = 𝑤𝑖 + ∆𝑤𝑖
∆𝑤𝑖 = 𝑛 𝑡 − 𝑜 𝑥𝑖
where,
n = learning rate
t = targetoutput
o = actual output
xi = input associated with the weights wi.
Designing AND gate using Perceptron
Training Rule
w1 = 1.2, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B
0 0 0 • Example -1 : A= 0, B=0 and Target = 0
0 1 0
1 0 0 • σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 1 1 • σ 𝑤𝑖𝑥𝑖 = 0*1.2 + 0 * 0.6 = 0
• This is not greater than the threshold of 1,
so the output = 0
Designing AND gate using Perceptron
Training Rule
w1 = 1.2, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B
0 0 0 • Example -2 : A= 0, B=1 and Target = 0
0 1 0
1 0 0 • σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 1 1 • wi.xi = 0*1.2 + 1 * 0.6 = 0.6
• This is not greater than the threshold of 1,
so the output = 0
Designing AND gate using Perceptron
Training Rule
w1 = 1.2, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B • Example -3 : A= 1, B=0 and Target = 0
0 0 0
0 1 0
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 0 • σ 𝑤𝑖𝑥𝑖 = 1*1.2 + 0 * 0.6 = 1.2
1 1 1
• This is greater than the threshold of 1, so
the output = 1
• Actual output (o) ≠ Target output (t)
Designing AND gate using Perceptron
Training Rule
w1 = 1.2, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B
• Example -3 : A= 1, B=0 and Target = 0
0 0 0
0 1 0 • Actual output (o) ≠ Target output (t)
1 0 0 • wi = wi + Δwi = wi + n(t-o)xi
1 1 1
• w1 = 1.2 + 0.5 * (0 – 1) * 1 = 1.2 + (-0.5) = 0.7
• w2 = 0.6 + 0.5 * (0 – 1) * 0 = 0.6 + 0 = 0.6
Designing AND gate using Perceptron
Training Rule
w1 = 0.7, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B
0 0 0 • Example -1 : A= 0, B=0 and Target = 0
0 1 0 • σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 0
1 1 1
• σ 𝑤𝑖𝑥𝑖 = 0*0.7 + 0 * 0.6 = 0
• This is not greater than the threshold of 1,
so the output = 0
Designing AND gate using Perceptron
Training Rule
w1 = 0.7, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B
0 0 0 • Example -2 : A= 0, B=1 and Target = 0
0 1 0
1 0 0 • σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 1 1 • σ 𝑤𝑖𝑥𝑖 = 0*0.7 + 1 * 0.6 = 0.6
• This is not greater than the threshold of 1,
so the output = 0
Designing AND gate using Perceptron
Training Rule
w1 = 0.7, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B
0 0 0 • Example -3 : A= 1, B=0 and Target = 0
0 1 0
1 0 0 • σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 1 1 • σ 𝑤𝑖𝑥𝑖 = 1*0.7 + 0 * 0.6 = 0.7
• This is not greater than the threshold of 1,
so the output = 0
Designing AND gate using Perceptron
Training Rule
w1 = 0.7, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A^B
0 0 0 • Example -4 : A= 1, B=1 and Target = 1
0 1 0
1 0 0 • σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 1 1 • σ 𝑤𝑖𝑥𝑖 = 1*0.7 + 1 * 0.6 = 1.3
• This is greater than the threshold of 1, so
the output = 1.
Designing AND gate using Perceptron
Training Rule
• Hence, the final weights to design Logical AND
A B A^B gate using Perceptron Model are:
0 0 0
• w1 = 0.7
0 1 0
1 0 0 • w2 = 0.6
1 1 1 x1 w1=0.7
෍ 𝑤𝑖𝑥𝑖 𝑓 O

w2=0.6
x2
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A|B
0 0 0 • Example -1 : A= 0, B=0 and Target = 0
0 1 1
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 1 1 • σ 𝑤𝑖𝑥𝑖 = 0*0.6 + 0 * 0.6 = 0
• This is not greater than the threshold of 1,
so the output = 0
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -2 : A= 0, B=1 and Target = 1
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 0*0.6 + 1 * 0.6 = 0.6
1 1 1
• This is not greater than the threshold of 1,
so the output = 0
• Actual output (o) ≠ Target output (t)
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =0.6, Threshold = 1 and Learning Rate n = 0.5
A B A|B
• Example -2 : A= 0, B=1 and Target = 1
0 0 0
0 1 1 • Actual output (o) ≠ Target output (t)
1 0 1 • wi = wi + Δwi = wi + n(t-o)xi
1 1 1
• w1 = 0.6 + 0.5 * (1 – 0) * 0 = 0.6 + 0= 0.6
• w2 = 0.6 + 0.5 * (1– 0) * 1 = 0.6 + 0.5 = 1.1
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -1 : A= 0, B=0 and Target = 0
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 0*0.6 + 0 * 1.1 = 0
1 1 1
• This is not greater than the threshold of 1,
so the output = 0
• Actual output (o) = Target output (t)
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -2 : A= 0, B=1 and Target = 1
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 0*0.6 + 1 * 1.1 = 1.1
1 1 1
• This is greater than the threshold of 1, so
the output = 1
• Actual output (o) = Target output (t)
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -3 : A= 1, B=0 and Target = 1
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 1*0.6 + 0 * 1.1 = 0.6
1 1 1
• This is not greater than the threshold of 1,
so the output = 0
• Actual output (o) ≠ Target output (t)
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B
• Example -3 : A= 1, B=0 and Target = 1
0 0 0
0 1 1 • Actual output (o) ≠ Target output (t)
1 0 1 • wi = wi + Δwi = wi + n(t-o)xi
1 1 1
• w1 = 0.6 + 0.5 * (1 – 0) * 1 = 0.6 + 0.5= 1.1
• w2 = 1.1 + 0.5 * (1– 0) * 0 = 1.1 + 0 = 1.1
Designing OR gate using Perceptron
Training Rule
w1 = 1.1, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -1 : A= 0, B=0 and Target = 0
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 0*1.1 + 0 * 1.1 = 0
1 1 1
• This is not greater than the threshold of 1,
so the output = 0
• Actual output (o) = Target output (t)
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -2 : A= 0, B=1 and Target = 1
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 0*1.1 + 1 * 1.1 = 1.1
1 1 1
• This is greater than the threshold of 1, so
the output = 1
• Actual output (o) = Target output (t)
Designing OR gate using Perceptron
Training Rule
w1 = 0.6, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -3 : A= 1, B=0 and Target = 1
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 1*1.1 + 0 * 1.1 = 1.1
1 1 1
• This is greater than the threshold of 1, so
the output = 1
• Actual output (o) = Target output (t)
Designing OR gate using Perceptron
Training Rule
w1 = 1.1, w2 =1.1, Threshold = 1 and Learning Rate n = 0.5
A B A|B • Example -4 : A= 1, B=1 and Target = 1
0 0 0
0 1 1
• σ 𝑤𝑖𝑥𝑖 = 𝑤1𝑥1 + 𝑤2𝑥2
1 0 1 • σ 𝑤𝑖𝑥𝑖 = 1*1.1 + 1 * 1.1 = 2.2
1 1 1
• This is not greater than the threshold of 1,
so the output = 1
• Actual output (o) = Target output (t)
Designing OR gate using Perceptron
Training Rule
• Hence, the final weights to design Logical OR
A B A|B gate using Perceptron Model are:
0 0 0
0 1 1 • w1 = 1.1
1 0 1 • w2 = 1.1
1 1 1 x1 w1=1.1
෍ 𝑤𝑖𝑥𝑖 𝑓 O

w2=1.1
x2
Delta Rule

• Perceptron rule is used when training examples are


linearly separable.
• But when the training examples are non-linearly
separable then perceptron rule fails to converge the
target concept.
• Delta Rule is used in ANN when training examples are
non-linearly separable.
Main Idea Of Delta rule

• Uses gradient descent rule and finds out the best


weight.
Gradient Descent Rule

• How to modify weights?


𝑤𝑖 = 𝑤𝑖 + ∆𝑤𝑖
∆𝑤𝑖 = −η∇𝜀 𝑤
∇𝜀 𝑤 -->derivative of error w.r.t weights
» Also called as gradient
Derivation of Gradient Descent Rule

𝜕𝜀 𝜕𝜀 𝜕𝜀 𝜕𝜀 𝜕𝜀
∇𝜀 𝑤 = [ , , , ……….. ]
𝜕𝑤0 𝜕𝑤1 𝜕𝑤2 𝜕𝑤3 𝜕𝑤𝑛
𝜕𝜀 𝜕 1
= ( σ𝑑∈𝐷 𝑡𝑑 − 𝑜 𝑑 )2
𝜕𝑤𝑖 𝜕𝑤𝑖 2
𝜕𝜀 1 𝜕
= × (σ𝑑∈𝐷 𝑡𝑑 − 𝑜 𝑑 )2
𝜕𝑤𝑖 2 𝜕𝑤𝑖
𝜕𝜀 1 𝜕
= × 2(σ𝑑∈𝐷 𝑡𝑑 − 𝑜 𝑑 ) × (𝑡𝑑 − 𝑜𝑑)
𝜕𝑤𝑖 2 𝜕𝑤𝑖
Derivation of Gradient Descent Rule

𝜕𝜀 1 𝜕
= × 2 × σ𝑑∈𝐷 𝑡𝑑 − 𝑜𝑑 𝑡𝑑 − 𝑜𝑑
𝜕𝑤𝑖 2 𝜕𝑤𝑖
𝜕𝜀 𝜕
= σ𝑑∈𝐷(𝑡𝑑 − 𝑜 𝑑 ) 𝑡𝑑 − 𝑜𝑑
𝜕𝑤𝑖 𝜕𝑤𝑖
𝜕𝜀 𝜕
= σ𝑑∈𝐷(𝑡𝑑 − 𝑜 𝑑 ) 𝑡𝑑 − 𝑤𝑑𝑥𝑑
𝜕𝑤𝑖 𝜕𝑤𝑖
𝜕𝜀
= σ𝑑∈𝐷(𝑡𝑑 − 𝑜 𝑑 ) 0 − 𝑥𝑑
𝜕𝑤𝑖
Derivation of Gradient Descent Rule

𝜕𝜀
= σ𝑑∈𝐷(𝑡𝑑 − 𝑜 𝑑 ) 0 − 𝑥𝑑
𝜕𝑤𝑖
𝝏𝜺
= σ𝒅∈𝑫(𝒕𝒅 − 𝒐 𝒅) −𝒙𝒅
𝝏𝒘𝒊
Therefore, ∆𝑤𝑖 = −η∇𝜀 𝑤
∆𝑤𝑖 = −η σ𝑑∈𝐷(𝑡𝑑 − 𝑜 𝑑 ) −𝑥𝑑
∆𝒘𝒊 = η σ𝒅∈𝑫(𝒕𝒅 − 𝒐 𝒅) 𝒙𝒅
Backpropagation Algorithm
• Backward propagation
of errors
• When error occurs, we
go in backward direction
Output Layer → Hidden
Layer → Input Layer
Example
Part 1: Forward Pass
1) Calculate h1(in and out)
• h1(in) = w1i1 +w2i2 + b1
• h1(in) = 0.15*0.05 +
0.20* 0.10 +0.35
• h1(in) = 0.377
Example
(Forward Pass)
1
• h1(out) =
(1+𝑒 −ℎ1 𝑖𝑛 )
1
• h1(out) =
(1+𝑒 −0.377) )
• h1(out) = 0.5932
Example
(Forward Pass)
2) Calculate h2(in and out)
• h2(in) = w3i1 +w4i2 + b1
• h2(in) = 0.25*0.05 +
0.30*0.10 + 0.35
• h2(in) = 0.3925
Example
(Forward Pass)
1
• h2(out) =
(1+𝑒 −ℎ2 𝑖𝑛 )
1
• h2(out) =
(1+𝑒 −0.3925) )
• h2(out) = 0.5968
Example
(Forward Pass)
3) Calculate o1(in and out)
• o1(in) = w5h1(out) +
w6h2(out) + b2
• o1(in) = 0.40*0.593 +
0.45*0.596 + 0.60
• o1(in) = 1.105
Example
(Forward Pass)
1
• o1(out) =
(1+𝑒 −𝑜1 𝑖𝑛 )
1
• o1(out) =
(1+𝑒 −1.105) )
• o1(out) = 0.7513
Example
(Forward Pass)
4) Calculate o2(in and out)
• o2(in) = w7h1(out) +
w8h2(out) + b2
• o2(in) = 0.5932*0.5 +
0.5968*0.55 + 0.60
• o2(in) = 1.22484
Example
(Forward Pass)
1
• o2(out) =
(1+𝑒 −𝑜2 𝑖𝑛 )
1
• o2(out) =
(1+𝑒 −1.22484) )
• o2(out) = 0.7729
Example
(Forward Pass)
5) Calculate Ɛtotal
1
• Ɛtotal = σ (t – o)2
2
1
• Ɛtotal = 0.01 0.7513 2 +
2
1 2
0.99 − 0.7729
2
• Ɛtotal = 0.29837 (approx.)
Example
Part 2: Backward Pass
1) For Output Layer
w+5 = w5 + Δw5
𝜕Ɛtotal
• Δw5 = - η 𝜕𝑤5
𝜕Ɛtotal
• =
𝜕𝑤5
𝜕Ɛtotal 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
* *
𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑤5
Example (Backward Pass)
Output → Hidden Layer
1 1
• Ɛtotal = (targeto1- outo1)2 + (targeto2- outo2)2
2 2

𝜕Ɛtotal 1 𝜕(targeto1− outo1)2


• = +0
𝜕𝑜𝑢𝑡𝑜1 2 𝜕𝑜𝑢𝑡𝑜1
𝜕Ɛtotal 1
• =2 * ∗ (target o1− out o1)
2-1-1 +0
𝜕𝑜𝑢𝑡𝑜1 2
𝜕Ɛtotal
• = −targeto1+ outo1
𝜕𝑜𝑢𝑡𝑜1
Example (Backward Pass)
Output → Hidden Layerample
𝜕Ɛtotal
• = outo1 – targeto1
𝜕𝑜𝑢𝑡𝑜1
𝜕Ɛtotal
• = 0.751365-0.01
𝜕𝑜𝑢𝑡𝑜1
𝜕Ɛtotal
• = 0.7413565
𝜕𝑜𝑢𝑡𝑜1
Example (Backward Pass)
Output → Hidden Layer

• Now, we will find how much output (outo1) is changes


with respect to net input of o1.
1
• Outo1=
(1+𝑒 −𝑜1 𝑖𝑛 )
𝜕𝑜𝑢𝑡𝑜1 𝜕 1
• = ( −𝑜1 𝑖𝑛 ) = outo1 (1– outo1)
𝜕𝑛𝑒𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1 (1+𝑒 )
Example (Backward Pass)
Output → Hidden Layer
𝜕𝑜𝑢𝑡𝑜1
• = outo1 (1– outo1)
𝜕𝑛𝑒𝑡𝑜1
𝜕𝑜𝑢𝑡𝑜1
• = 0.751365 *
𝜕𝑛𝑒𝑡𝑜1
(1-0.751365)
𝜕𝑜𝑢𝑡𝑜1
• = 0.186815602
𝜕𝑛𝑒𝑡𝑜1
Example (Backward Pass)
Output → Hidden Layer

• Finally, how much does the total net input of o1 changes


with respect to w5.
• n𝑒to1= w5* outh1 + w6*outh2 + b2
𝜕𝑛𝑒𝑡𝑜1 𝜕
• = (w5* outh1 + w6*outh2 + b2)
𝜕𝑤5 𝜕𝑤5
𝜕𝑛𝑒𝑡𝑜1
• = 1 * outh1 + 0 + 0 = outh1
𝜕𝑤5
Example (Backward Pass)
Output → Hidden Layer
𝜕𝑛𝑒𝑡𝑜1
• = outh1
𝜕𝑤5
𝜕𝑛𝑒𝑡𝑜1
• = 0.59326992
𝜕𝑤5
Example (Backward Pass)
Output → Hidden Layer
𝜕Ɛtotal
• =
𝜕𝑤5
𝜕Ɛtotal 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
∗ ∗
𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑤5
𝜕Ɛtotal
• = 0.7413565 *
𝜕𝑤5
0.186815602 * 0.59326992
𝜕Ɛtotal
• = 0.08216704
𝜕𝑤5
Example (Backward Pass)
Output → Hidden Layer
𝜕Ɛtotal 𝜕Ɛtotal 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
• = ∗ ∗
𝜕𝑤5 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑤5
𝜕Ɛtotal 𝜕Ɛtotal
• = ∗ outh1
𝜕𝑤5 𝜕𝑛𝑒𝑡𝑜1
𝜕Ɛtotal
• can be represented with a geek letter 𝑑𝑒𝑙𝑡𝑎 δ
𝜕𝑛𝑒𝑡𝑜1
𝜕Ɛtotal 𝜕Ɛtotal 𝜕𝑜𝑢𝑡𝑜1
• δo1= = 𝜕𝑜𝑢𝑡𝑜 ∗
𝜕𝑛𝑒𝑡𝑜1 1
𝜕𝑛𝑒𝑡𝑜1
Example (Backward Pass)
Output → Hidden Layer

𝜕Ɛtotal 𝜕𝑜𝑢𝑡𝑜1
• δo1= ∗
𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
• δo1 = outo1 – targeto1 * outo1 (1– outo1)
𝜕Ɛtotal 𝜕Ɛtotal
• = ∗ outh1
𝜕𝑤5 𝜕𝑛𝑒𝑡𝑜1
𝜕Ɛtotal
• δo1 outh1
𝜕𝑤5 =
Example (Backward Pass)
Output → Hidden Layer
𝜕Ɛtotal
• Δw5 = -η
𝜕𝑤𝑖
• Let’s take n = 0.6
• Δw5 = 0.6 * 0.08216704
• Δw5 = 0.6 * 0.08216704
• w+5 = w5 + Δw5
• w+5 = 0.4 + (-0.6 *
0.08216704)
• w+5 = 0.350699776
Example (Backward Pass)
Output → Hidden Layer
Now, let’s calculate w6, w7
and w8.
w+6 = w6 + Δw6
• w+6 = 0.408666186
• w+ 7 = 0.511301270
• w+ 8 = 0.561370121
Example (Backward Pass)
Hidden → Input Layer
2) Hidden Layer → input
layer
w+1 = w1 + Δw1
𝜕Ɛtotal
• Δw1 = - η 𝜕𝑤𝑖
𝜕Ɛtotal
• =
𝜕𝑤1
𝜕Ɛtotal 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
* *
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
Example (Backward Pass)
Hidden → Input Layer
𝜕Ɛtotal 𝜕Ɛo1 𝜕Ɛo2
• = +
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1
𝜕Ɛo1
• Starting with,
𝜕𝑜𝑢𝑡ℎ1
𝜕Ɛo1 𝜕Ɛo1 𝜕𝑛𝑒𝑡o1
• = 𝜕𝑛𝑒𝑡 * 𝜕𝑜𝑢𝑡ℎ
𝜕𝑜𝑢𝑡ℎ1 𝑜1 1
Example (Backward Pass)
Hidden → Input Layer
𝜕Ɛo1
• 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑖𝑛𝑔 ,
𝜕𝑛𝑒𝑡𝑜1
𝜕Ɛo1 𝜕Ɛo1 𝜕𝑜𝑢𝑡o1
• = *
𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
𝜕Ɛo1
• = (−targeto1+ outo1) * outo1 (1– outo1)
𝜕𝑛𝑒𝑡𝑜1
𝜕Ɛo1
• = 0.7413565 * 0.186815602 = 0.138498562
𝜕𝑛𝑒𝑡𝑜1
Example (Backward Pass)
Hidden → Input Layer
𝜕𝑛𝑒𝑡o1
• 𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑖𝑛𝑔 ,
𝜕𝑜𝑢𝑡ℎ1
• n𝑒to1= w5* outh1 + w6*outh2 + b2
𝜕𝑛𝑒𝑡𝑜1 𝜕
• = (w5* outh1 + w6*outh2 + b2)
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1
𝜕𝑛𝑒𝑡𝑜1
• = w5
𝜕𝑜𝑢𝑡ℎ1
Example (Backward Pass)
Hidden → Input Layer
𝜕𝑛𝑒𝑡𝑜1
• = w5 = 0.40
𝜕𝑜𝑢𝑡ℎ1
• Plugging them in
𝜕Ɛo1 𝜕Ɛo1 𝜕𝑛𝑒𝑡o1
• = *
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡ℎ1
𝜕Ɛo1
• = 0.138498562 * 0.40
𝜕𝑜𝑢𝑡ℎ1
𝜕Ɛo1
• 𝜕𝑜𝑢𝑡ℎ
= 0.055399425
Example (Backward Pass)
Hidden → Input Layer
𝜕Ɛo2
• Similarly, we will calculate,
𝜕𝑜𝑢𝑡ℎ1
𝜕Ɛo2
• = -0.019049119
𝜕𝑜𝑢𝑡ℎ1
• Therefore,
𝜕Ɛtotal 𝜕Ɛo1 𝜕Ɛo2
• = +
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1
𝜕Ɛtotal
• = 0.055399425 - 0.019049119 = 0.036350306
𝜕𝑜𝑢𝑡ℎ1
Example (Backward Pass)
Hidden → Input Layer
𝜕𝑜𝑢𝑡ℎ1
• Let’s Calculate 𝜕𝑛𝑒𝑡ℎ1
,
1
• 𝑜𝑢𝑡ℎ1 = (1+𝑒 −𝑛𝑒𝑡 )
ℎ1

𝜕𝑜𝑢𝑡ℎ1 𝜕 1
• = ( − 𝑛𝑒𝑡 )
𝜕𝑛𝑒𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 (1+𝑒 ℎ1 )

𝜕𝑜𝑢𝑡ℎ1
• = outh1 (1– outh1)
𝜕𝑛𝑒𝑡ℎ1
Example (Backward Pass)
Hidden → Input Layer
𝜕𝑜𝑢𝑡ℎ1
• = outh1 (1– outh1)
𝜕𝑛𝑒𝑡ℎ1
𝜕𝑜𝑢𝑡ℎ1
• = 0.59326999(1-
𝜕𝑛𝑒𝑡ℎ1
0.59326999)
𝜕𝑜𝑢𝑡ℎ1
• = 0.241300709
𝜕𝑛𝑒𝑡ℎ1
Example (Backward Pass)
Hidden → Input Layer
𝜕𝑛𝑒𝑡ℎ1
• Now let’s derive 𝜕𝑤1
,
• 𝑛𝑒𝑡ℎ1 = w1* i1 + w2*i2 + b1
𝜕𝑛𝑒𝑡ℎ1 𝜕
• = 𝜕𝑤1 (w1* i1 + w2*i2 + b1)
𝜕𝑤1
𝜕𝑛𝑒𝑡ℎ1
• = i1 + 0 + 0
𝜕𝑤1
𝜕𝑛𝑒𝑡ℎ1
• = i1
𝜕𝑤1
Example (Backward Pass)
Hidden → Input Layer
𝜕𝑛𝑒𝑡ℎ1
• = i1
𝜕𝑤1
𝜕𝑛𝑒𝑡ℎ1
• = 0.05
𝜕𝑤1
Example (Backward Pass)
Hidden → Input Layer

• Putting it all together in a chain rule:


𝜕Ɛtotal 𝜕Ɛtotal 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
• = * *
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
𝜕Ɛtotal
• = 0.036350306 * 0.241300709 * 0.05
𝜕𝑤1
𝜕Ɛtotal
• = 0.000438568
𝜕𝑤1
Example (Backward Pass)
Hidden → Input Layer
𝜕Ɛtotal 𝜕Ɛtotal 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
• = * *
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
𝜕Ɛtotal 𝜕Ɛtotal
• = ∗ i1
𝜕𝑤1 𝜕𝑛𝑒𝑡ℎ1
𝜕Ɛtotal
• can be represented with a geek letter 𝑑𝑒𝑙𝑡𝑎 δ
𝜕𝑛𝑒𝑡ℎ1
𝜕Ɛtotal
• = δh1∗ i1
𝜕𝑤1
Example (Backward Pass)
Hidden → Input Layer
𝜕Ɛtotal
• Δw1 = -η
𝜕𝑤𝑖
• Let’s take η = 0.6
• Δw1 = 0.6 *0.000438568
• w+1 = w1 + Δw1
• w+1 = 0.15 + (-0.6
*0.000438568)
• w+1 = 0.15-0.0002631408
• w+1 = 0.1497368592
Example (Backward Pass)
Hidden → Input Layer
• Similarly, update w2, w3,
w4.
• w+2 = 0.199…..
• w+3 = 0.249…..
• w+4 = 0.299…..
Refer the following link for this –
https://mattmazur.com/2015/03/17/a-step-by-step-
backpropagation-example/
Advantages and Disadvantages of ANN
Advantages Disadvantages
• A neural network can implement tasks • The neural network required training to
that a linear program cannot. operate.

• When an item of the neural network • Neural networks are black boxes, meaning
declines, it can continue without some we cannot know how much each independent
issues by its parallel features. variable is influencing the dependent variables.

• A neural network determines and does • Large complexity of network structure.


not require to be reprogrammed. • It needed high processing time for big neural
• It can be executed in any application. networks.
Self Organizing Maps (SOM)
• Self Organizing Map(SOM) is a neural network
used for Unsupervised Learning Algorithm.
• It has only 2 layers: Input Layer and Output Layer.
• Self Organizing Map(SOM) proposed by Teuvo
Kohonen is a data visualization technique.
• It is also known as Kohonen Maps.
• It helps to understand high dimensional data by
reducing the dimensions of data to a map.
• It showcases clustering by grouping similar data
together.
Self Organizing Maps (SOM)
Kohonen Self-Organizing Maps

• Step-1: Initialize the weights wij, Random values may be


assumed.
• Step-2: Initialize the learning rate.
• Step-3: Calculate square of Euclidean distance. i.e. or each j=1
to m.
𝐷 𝑗 = σ𝑛𝑖=1 σ𝑚 𝑗=1(𝑥𝑖 − 𝑤𝑖𝑗) 2

• Step-4:Winning unit index j, so that D(j) is minimum.


Kohonen Self-Organizing Maps

• Step-5: For all unit j within a specific neighborhood of j


and for all i, calculate new weights.
𝑤𝑖𝑗 𝑛𝑒𝑤 = 𝑤𝑖𝑗 𝑜𝑙𝑑 + η(𝑥𝑖 − 𝑤𝑖𝑗 𝑜𝑙𝑑 )
Example

Construct KSOM to cluster four vectors. The four input vectors


are: [(0,0,1,1),(1,0,0,0),(0,1,1,0),(0,0,0,1)]. Number of clusters
to be formed is 2. Assume an initial learning rate of 0.5 and
random weights associated with each input are as follows:
0.2 0.9
𝑤𝑖𝑗 = 0.4 0.7
0.6 0.5
0.8 0.3
Solution
Y1 Y2

w42
w11

w12 w21 w22 w31


w41
w32

X1 X2 X3 X4
𝑤11 𝑤12 0.2 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.4 0.7
𝑤31 𝑤32 0.6 0.5
𝑤41 𝑤42 0.8 0.3

• First Input Vector: (0,0,1,1)


• Calculating Euclidean Distance D(1) :
𝐷 𝑗 = σ(𝑤𝑖𝑗 − 𝑥𝑖)2
𝐷 1 = σ(𝑤𝑖1 − 𝑥1)2
𝐷 1 = 0.2 − 0 2 + 0.4 − 0 2 + 0.6 − 1 2 + 0.8 − 1 2

𝐷 1 = 0.04 + 0.16 + 0.16 + 0.04 = 0.4


𝑤11 𝑤12 0.2 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.4 0.7
𝑤31 𝑤32 0.6 0.5
𝑤41 𝑤42 0.8 0.3

• First Input Vector: (0,0,1,1)


• Calculating Euclidean Distance D(2) :
𝐷 𝑗 = σ(𝑤𝑖𝑗 − 𝑥𝑖)2
𝐷 2 = σ(𝑤𝑖2 − 𝑥2)2
𝐷 2 = 0.9 − 0 2 + 0.7 − 0 2 + 0.5 − 1 2 + 0.3 − 1 2

𝐷 2 = 0.81 + 0.49 + 0.25 + 0.49 = 2.04


𝑤11 𝑤12 0.2 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.4 0.7
𝑤31 𝑤32 0.6 0.5
𝑤41 𝑤42 0.8 0.3

• First Input Vector: (0,0,1,1)


• D(1) = 0.4 and D(2) = 2.04
• As, D(1) < D(2), so, Y1 is a winning cluster.
• Thus, First Input belongs to Y1.
𝑤11 𝑤12 0.2 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.4 0.7
𝑤31 𝑤32 0.6 0.5
𝑤41 𝑤42 0.8 0.3

• Next step to update the initial weights on winning cluster


unit J=1.
𝑤𝑖𝑗 𝑛𝑒𝑤 = 𝑤𝑖𝑗 𝑜𝑙𝑑 + η 𝑥𝑖 − 𝑤𝑖𝑗 𝑜𝑙𝑑

𝑤𝑖1 𝑛𝑒𝑤 = 𝑤𝑖1 𝑜𝑙𝑑 + η(𝑥𝑖 − 𝑤𝑖1 𝑜𝑙𝑑 )


𝑤11 𝑤12 0.2 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.4 0.7
𝑤31 𝑤32 0.6 0.5
𝑤41 𝑤42 0.8 0.3

• First Input Vector: (0,0,1,1)


• Given, Learning Rate η = 0.5
• 𝑤11 𝑛𝑒𝑤 = 𝑤11 𝑜𝑙𝑑 + η 𝑥1 − 𝑤11 𝑜𝑙𝑑
• 𝑤11 𝑛𝑒𝑤 = 0.2 + 0.5 0 − 0.2 = 0.1
• 𝑤21 𝑛𝑒𝑤 = 𝑤21 𝑜𝑙𝑑 + η 𝑥2 − 𝑤21 𝑜𝑙𝑑
• 𝑤21 𝑛𝑒𝑤 = 0.4 + 0.5 0 − 0.4 = 0.2
𝑤11 𝑤12 0.2 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.4 0.7
𝑤31 𝑤32 0.6 0.5
𝑤41 𝑤42 0.8 0.3

• First Input Vector: (0,0,1,1)


• Given, Learning Rate η = 0.5
• 𝑤31 𝑛𝑒𝑤 = 𝑤31 𝑜𝑙𝑑 + η 𝑥3 − 𝑤31 𝑜𝑙𝑑
• 𝑤31 𝑛𝑒𝑤 = 0.6 + 0.5 1 − 0.6 = 0.8
• 𝑤41 𝑛𝑒𝑤 = 𝑤41 𝑜𝑙𝑑 + η 𝑥4 − 𝑤41 𝑜𝑙𝑑
• 𝑤41 𝑛𝑒𝑤 = 0.8 + 0.5 1 − 0.8 = 0.9
𝑤11 𝑤12 0.2 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.4 0.7
𝑤31 𝑤32 0.6 0.5
𝑤41 𝑤42 0.8 0.3

• Updated Weights:

𝑤11 𝑤12 0.1 0.9


• 𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3
𝑤11 𝑤12 0.1 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3

• Second Input Vector: (1,0,0,0)


• Calculating Euclidean Distance D(1) :
𝐷 𝑗 = σ(𝑤𝑖𝑗 − 𝑥𝑖)2
𝐷 1 = σ(𝑤𝑖1 − 𝑥1)2
𝐷 1 = 0.1 − 1 2 + 0.2 − 0 2 + 0.8 − 0 2 + 0.9 − 0 2

𝐷 1 = 0.81 + 0.04 + 0.64 + 0.01 = 2.3


𝑤11 𝑤12 0.1 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3

• Second Input Vector: (1,0,0,0)


• Calculating Euclidean Distance D(2) :
𝐷 𝑗 = σ(𝑤𝑖𝑗 − 𝑥𝑖)2
𝐷 2 = σ(𝑤𝑖2 − 𝑥2)2
𝐷 2 = 0.9 − 1 2 + 0.7 − 0 2 + 0.5 − 0 2 + 0.3 − 0 2

𝐷 2 = 0.01 + 0.49 + 0.25 + 0.09 = 0.84


𝑤11 𝑤12 0.1 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3

• Second Input Vector: (1,0,0,0)


• D(1) = 2.3 and D(2) = 0.84
• As, D(2) < D(1), so, Y2 is a winning cluster.
• Thus, Second Input belongs to Y2.
𝑤11 𝑤12 0.1 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3

• Next step to update the initial weights on winning cluster


unit J=2.
𝑤𝑖𝑗 𝑛𝑒𝑤 = 𝑤𝑖𝑗 𝑜𝑙𝑑 + η 𝑥𝑖 − 𝑤𝑖𝑗 𝑜𝑙𝑑

𝑤𝑖2 𝑛𝑒𝑤 = 𝑤𝑖2 𝑜𝑙𝑑 + η(𝑥𝑖 − 𝑤𝑖2 𝑜𝑙𝑑 )


𝑤11 𝑤12 0.1 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3

• Second Input Vector: (1,0,0,0)


• Given, Learning Rate η = 0.5
• 𝑤12 𝑛𝑒𝑤 = 𝑤12 𝑜𝑙𝑑 + η 𝑥1 − 𝑤12 𝑜𝑙𝑑
• 𝑤12 𝑛𝑒𝑤 = 0.9 + 0.5 1 − 0.9 = 0.95
• 𝑤22 𝑛𝑒𝑤 = 𝑤22 𝑜𝑙𝑑 + η 𝑥2 − 𝑤22 𝑜𝑙𝑑
• 𝑤22 𝑛𝑒𝑤 = 0.7 + 0.5 0 − 0.7 = 0.35
𝑤11 𝑤12 0.1 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3

• Second Input Vector: (1,0,0,0)


• Given, Learning Rate η = 0.5
• 𝑤32 𝑛𝑒𝑤 = 𝑤32 𝑜𝑙𝑑 + η 𝑥3 − 𝑤32 𝑜𝑙𝑑
• 𝑤32 𝑛𝑒𝑤 = 0.5 + 0.5 0 − 0.5 = 0.25
• 𝑤42 𝑛𝑒𝑤 = 𝑤42 𝑜𝑙𝑑 + η 𝑥4 − 𝑤42 𝑜𝑙𝑑
• 𝑤42 𝑛𝑒𝑤 = 0.3 + 0.5 0 − 0.3 = 0.15
𝑤11 𝑤12 0.1 0.9
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.7
𝑤31 𝑤32 0.8 0.5
𝑤41 𝑤42 0.9 0.3

• Updated Weights:

𝑤11 𝑤12 0.1 0.95


• 𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.35
𝑤31 𝑤32 0.8 0.25
𝑤41 𝑤42 0.9 0.15
𝑤11 𝑤12 0.1 0.95
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.35
𝑤31 𝑤32 0.8 0.25
𝑤41 𝑤42 0.9 0.15

• Third Input Vector: (0,1,1,0)


• D(1) = 1.5 and D(2) = 0.1.91
• As, D(1) < D(2), so, Y1 is a winning cluster.
• Thus, Third Input belongs to Y1.
𝑤11 𝑤12 0.1 0.95
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.35
𝑤31 𝑤32 0.8 0.25
𝑤41 𝑤42 0.9 0.15

• Third Input Vector: (0,1,1,0)


• w11 = 0.05
• w21 = 0.6
• w31 = 0.9
• w41 = 0.45
𝑤11 𝑤12 0.1 0.95
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.2 0.35
𝑤31 𝑤32 0.8 0.25
𝑤41 𝑤42 0.9 0.15

• Updated Weights:

𝑤11 𝑤12 0.05 0.95


• 𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.6 0.35
𝑤31 𝑤32 0.9 0.25
𝑤41 𝑤42 0.45 0.15
𝑤11 𝑤12 0.05 0.95
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.6 0.35
𝑤31 𝑤32 0.9 0.25
𝑤41 𝑤42 0.45 0.15

• Fourth Input Vector: (0,0,0,1)


• D(1) = 1.475 and D(2) = 0.1.81
• As, D(1) < D(2), so, Y1 is a winning cluster.
• Thus, Fourth Input belongs to Y1.
𝑤11 𝑤12 0.05 0.95
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.6 0.35
𝑤31 𝑤32 0.9 0.25
𝑤41 𝑤42 0.45 0.15

• Fourth Input Vector: (0,1,1,0)


• w11 = 0.025
• w21 = 0.3
• w31 = 0.45
• w41 = 0.475
𝑤11 𝑤12 0.05 0.95
𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.6 0.35
𝑤31 𝑤32 0.9 0.25
𝑤41 𝑤42 0.45 0.15

• Updated Weights:

𝑤11 𝑤12 0.025 0.95


• 𝑤𝑖𝑗 = 𝑤21 𝑤22 = 0.3 0.35
𝑤31 𝑤32 0.45 0.25
𝑤41 𝑤42 0.475 0.15
Final Architecture
Y1 Y2

0.15
0.025

0.95 0.3 0.45 0.25


0.35
0.475

X1 X2 X3 X4
Advantages and Disadvantages of SOM
Advantages Disadvantages
• Data mapping is easily interpreted. • Difficult to determine what input weights to

• Projects high dimensional data onto use.

lower dimensional map. • Clustering result depends on initial weight

• Capable of organizing large, complex data vector.

sets. • Mapping can result in divided clusters.

• Useful for Visualization. • Requires that nearby points behave similarly.

• A heuristic algorithm.
Deep learning is a subset of machine learning,
which is essentially a neural network with
three or more layers.
Convolutional Neural Network (CNN)

• CNN is a type of artificial neural network, which is widely


used for image/object recognition and classification.
Convolutional Neural Network (CNN)

• CNN is a type of artificial neural network, which is


widely used for image/object recognition and
classification.
• It is made up of multiple layers, including convolutional
layers, pooling layers, and fully connected layers.
Architecture Of CNN

Fully
Input Convolutio Flatten
ReLU Layer Pooling Connected Output
Image nal Layer Layer
Layer
Convolutional Layer

• A “filter” passes over the image, scanning a few


pixels at a time and creating a feature map that
predicts the class to which each feature belongs.
• 3 main components are:
– Kernel/Filter
– Stride
– Padding
Convolutional Layer

• Kernel/Filter → A kernel is a feature extractor that


extracts features in the image.
• Stride → It refers to the number of pixels by which
we move the filter across the input image.
• Padding → It is the addition of extra pixels around
the borders of the input images or feature map.
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12
x 2 2 0 =
3 1 2 2 3
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 3x0 + 3x1 + 2x2 + 0x2 + 0x2 + 1x0 + 3x0 + 1x1 + 2x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12
x 2 2 0 =
3 1 2 2 3
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 3x0 + 2x1 + 1x2 + 0x2 + 1x2 + 3x0 + 1x0 + 2x1 + 2x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 =
3 1 2 2 3
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 2x0 + 1x1 + 0x2 + 1x2 + 3x2 + 1x0 + 2x0 + 2x1 + 3x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 = 10
3 1 2 2 3
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 0x0 + 0x1 + 1x2 + 3x2 + 1x2 + 2x0 + 2x0 + 0x1 + 0x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 = 10 17
3 1 2 2 3
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 0x0 + 1x1 + 3x2 + 1x2 + 2x2 + 2x0 + 0x0 + 0x1 + 2x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 = 10 17 19
3 1 2 2 3
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 1x0 + 3x1 + 1x2 + 2x2 + 2x2 + 3x0 + 0x0 + 2x1 + 2x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 = 10 17 19
3 1 2 2 3
0 1 2
2 0 0 2 2
3x3
9
2 0 0 0 1 Kernel
5x5
Input Image 3x0 + 1x1 + 2x2 + 2x2 + 0x2 + 0x0 + 2x0 + 0x1 + 0x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 = 10 17 19
3 1 2 2 3
0 1 2
2 0 0 2 2
3x3
9 6
2 0 0 0 1 Kernel
5x5
Input Image 1x0 + 2x1 + 2x2 + 0x2 + 0x2 + 2x0 + 0x0 + 0x1 + 0x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 = 10 17 19
3 1 2 2 3
0 1 2
2 0 0 2 2
3x3
9 6 14
2 0 0 0 1 Kernel
5x5
Input Image 2x0 + 2x1 + 3x2 + 0x2 + 2x2 + 2x0 + 0x0 + 0x1 + 1x2
Convolutional Layer
(Kernel)

3 3 2 1 0
0 0 1 3 1 0 1 2 12 12 17
x 2 2 0 = 10 17 19
3 1 2 2 3
0 1 2
2 0 0 2 2
3x3
9 6 14
2 0 0 0 1 Kernel
3x3
5x5 Output
Size of Output = [size of Input – size of kernel] + 1
Input Image
O = [z – k] + 1
O = [5 – 3] + 1 = 2 + 1 = 3
Convolutional Layer
(Stride)
Stride (S) = 2

3 3 2 1 0
0 0 1 3 1 0 1 2
x 2 2 0 = 12
3 1 2 2 3
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 3x0 + 3x1 + 2x2 + 0x2 + 0x2 + 1x0 + 3x0 + 1x1 + 2x2
Convolutional Layer
(Stride)
Stride (S) = 2

3 3 2 1 0
0 1 2
0 0 1 3 1
x = 12 17
3 1 2 2 3 2 2 0
2 0 0 2 2 0 1 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 2x0 + 1x1 + 0x2 + 1x2 + 3x2 + 1x0 + 2x0 + 2x1 + 3x2
Convolutional Layer
(Stride)
Stride (S) = 2

3 3 2 1 0
0 1 2
0 0 1 3 1
x = 12 17
3 1 2 2 3 2 2 0
0 1 2 9
2 0 0 2 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 3x0 + 1x1 + 2x2 + 2x2 + 0x2 + 0x0 + 2x0 + 0x1 + 0x2
Convolutional Layer
(Stride)
Stride (S) = 2

3 3 2 1 0
0 1 2
0 0 1 3 1
x = 12 17
3 1 2 2 3 2 2 0
0 1 2 9 14
2 0 0 2 2
3x3
2 0 0 0 1 Kernel
5x5
Input Image 2x0 + 2x1 + 3x2 + 0x2 + 2x2 + 2x0 + 0x0 + 0x1 + 1x2
Convolutional Layer
(Stride)

3 3 2 1 0
0 0 1 3 1 0 1 2
x 2 2 0 = 12 17
3 1 2 2 3
2 0 0 2 2 0 1 2 9 14
3x3 2x2
2 0 0 0 1 Kernel
Output
5x5 [𝑧−𝑘]
𝑜= +1
𝑠
Input Image
[5−3]
𝑜= +1= 1+1=2
2
Convolutional Layer
(Padding)
Padding (p) = 1
00 0 0 0 00 6
03 3 2 1 00 0 1 2
00 0 1 3 10 x 2 2 0 =
03 1 2 2 30 0 1 2
02 0 0 2 20 3x3
02 0 0 0 10 Kernel
00 0 0 0 00
5x5 0x0 + 0x1 + 0x2 + 0x2 + 3x2 + 3x0 + 0x0 + 0x1 + 0x2
Input Image
Convolutional Layer
(Padding)
Padding (p) = 1,
Stride = 1 What is the size of
00 0 0 0 00
03 3 2 1 00 0 1 2 the output?
00 0 1 3 10 x 2 2 0 = → 𝑜 = [𝑧−𝑘+2𝑝] + 1
03 1 2 2 30 0 1 2 𝑠
[5−3+2∗1]
02 0 0 2 20 3x3 →𝑜= + 1
1
02 0 0 0 10 Kernel
[2+2]
00 0 0 0 00 →𝑜= +1
1
5x5 →𝑜 =4+1=5
Input Image
Convolutional Layer
(Padding)
Padding (p) = 1,
Stride = 1
00 0 0 0 00 6 14 17 11 3
03 3 2 1 00 0 1 2 14 12 12 17 11
00 0 1 3 10 x 2 2 0 = 8 10 17 19 13
11 9 6 14 12
03 1 2 2 30 0 1 2 6 4 4 6 4
02 0 0 2 20 3x3
02 0 0 0 10 Kernel
5x5
00 0 0 0 00 Output/ Feature Map
5x5
Input Image
Convolutional Layer
(Padding)
Padding (p) = 1,
Stride = 1
00 0 0 0 00 6 14 17 11 3
03 3 2 1 00 0 1 2 14 12 12 17 11
00 0 1 3 10 x 2 2 0 = 8 10 17 19 13
11 9 6 14 12
03 1 2 2 30 0 1 2 6 4 4 6 4
02 0 0 2 20 3x3
02 0 0 0 10 Kernel
5x5
00 0 0 0 00 Output/ Feature Map
5x5
Input Image
ReLU Layer

• The rectified linear activation function or ReLU for


short is a piecewise linear function that will output
the input directly if it is positive, otherwise, it will
output zero.
𝑓 𝑥 = max(𝐼𝑛𝑝𝑢𝑡 𝑧 , 0)
𝑓 𝑥 = max(𝐼𝑛𝑝𝑢𝑡(𝑧), 0)
Pooling

• To reduce the dimensions of the hidden layer by


combining the outputs of neuron clusters at the
previous layer into a single neuron in the next layer.
• There are mainly 2 types of Pooling :
– Max Pooling
– Average Pooling
Pooling
Pooling

6 14 17 11 3
14 12 12 17 11 2x2
8 10 17 19 13 Pool Size 14
11 9 6 14 12 Max pooling
6 4 4 6 4
Pooling

6 14 17 11 3
14 12 12 17 11 2x2
8 10 17 19 13 Pool Size 14 17
11 9 6 14 12 Max pooling
6 4 4 6 4
Pooling

6 14 17 11 3
14 12 12 17 11 2x2
8 10 17 19 13 Pool Size 14 17
11 9 6 14 12 Max pooling
6 4 4 6 4 11
Pooling

6 14 17 11 3
14 12 12 17 11 2x2
8 10 17 19 13 Pool Size 14 17
11 9 6 14 12 Max pooling
6 4 4 6 4 11 19
2x2
5x5
Feature Map
Pooling

6 14 17 11 3
14 12 12 17 11 2x2
8 10 17 19 13 Pool Size 11.5 14.25
11 9 6 14 12 Average 9.5 14.0
6 4 4 6 4 pooling

2x2
5x5
Feature Map
Flatten Layer
Converts the multi-
dimensional arrays into
flattened one-dimensional
arrays or single-
dimensional arrays.
Fully- Connected Layer
(Dense Layer)
• Each neuron is
connected to each
other.
• Takes the inputs from
the feature analysis and
applies weights to
predict the correct
label.
Training Of CNN

• Initialize all filter weights with random values.


• Forward Propagation (Input → Convolutional → ReLU → Pooling
→ Flatten → Fully Connected → Output).
1
• Calculate total error using Sum squared error σ 𝑡 − 𝑜 2.
2
• Backpropagation (using gradient descent), update weights and
parameters.
• Repeat till minimum error/loss.
• Repeat for all input images (training set).
Disadvantage Of CNN
• High computational requirements.
• Needs large amount of labeled data.
• Large memory footprint.
• Interpretability challenges.
• Limited effectiveness for sequential data.
• Tend to be much slower.
• Training takes a long time.
Reference Books

Tom M. Mitchell, Ethem Alpaydin, ―Introduction Stephen Marsland, Bishop, C., Pattern
―Machine Learning, to Machine Learning (Adaptive ―Machine Learning: An Recognition and Machine
McGraw-Hill Computation and Machine Algorithmic Perspective, Learning. Berlin:
Education (India) Learning), The MIT Press 2004. CRC Press, 2009. Springer-Verlag.
Private Limited, 2013.
Text Books

Saikat Dutt, Andreas C. Müller and John Paul Mueller and Dr. Himanshu
Subramanian Sarah Guido - Luca Massaron - Sharma, Machine
Chandramaouli, Amit Introduction to Machine Machine Learning for Learning, S.K.
Kumar Das – Machine Learning with Python Dummies Kataria & Sons -2022
Learning, Pearson

You might also like