0% found this document useful (0 votes)
12 views

Lec 8

Uploaded by

Omis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lec 8

Uploaded by

Omis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Convolutional Neural Network

(A Deep Neural Network)

A B H I S H EK M U K H O PA D H YAY
I N ST RUC TO R: D R . P R A DIP TA B I SWA S
I 3 D L A B O RATO RY, C P DM , I I S C
Image Source: deeplearning.ai

Computer Vision Problems


Classical Computer Vision
Pipeline
CV experts
1. Select / develop features: SURF, HoG, SIFT, RIFT, …
2. Add on top of this Machine Learning for multi-class recognition and
train classifier
Feature Detection,
Extraction: Classification
SIFT, HoG... Recognition

Classical CV feature definition is domain-specific and time-consuming


Neural Network

Warren
McCulloch Walter Pitts

A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN NERVOUS ACTIVITY,


Bulletin of Mathematical Biophysics, Vol. 5, pp. 115-133 (1943).
Neural Network
Here x1 and x2 are normalized attribute value of data.
y is the output of the neuron , i.e the class label.
x1 and x2 values multiplied by weight values w1 and w2 are input to the
neuron x.
Value of x1 is multiplied by a weight w1 and values of x2 is multiplied by a
weight w2.
Given that

◦ w1 = 0.5 and w2 = 0.5


◦ Say value of x1 is 0.3 and value of x2 is 0.8,

◦ So, weighted sum is :


◦ sum=
sum= w1 x x1 + w2 x x2 = 0.5 x 0.3 + 0.5 x 0.8 = 0.55
Why We Need Multi Layer ?

Linear Separable:

Linear inseparable:
Edge Detection

Vertical edges
How do we detect
these edges

Horizontal edges
Image Source: deeplearning.ai
Neural Network?
Suppose an image is of the size 68 X 68 X 3
◦ Input feature dimension then becomes 12,288
If Image size is of 720 X 720 X 3
◦ Input feature dimension becomes 1,555,200
Number of parameters will swell up to a HUGE number
Result in more computational and memory
requirements
Another Application
Digit Recognition

Classifier 5

X1,…,Xn ∈ {0,1} (Black vs. White pixels)


Y ∈ {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier
In class, we saw that a good strategy is to predict:

◦ (for example: what is the probability that the image represents a 5 given its
pixels?)

So … how do we compute that?


The Bayes Classifier
Use Bayes Rule!
Likelihood Prior

Normalization Constant

Why did this help? Well, we think that we might be able to specify how features are
“generated” by the class label
The Bayes Classifier
Let’s expand this for our digit recognition task:

To classify, we’ll simply compute these two probabilities and predict based on which one is greater
Model Parameters
For the Bayes classifier, we need to “learn” two functions, the likelihood and the prior

How many parameters are required to specify the prior for our digit recognition
example?
Model Parameters
How many parameters are required to specify the likelihood?
◦ (Supposing that each image is 30x30 pixels)

CNN
Drive into
CNN
In a convolutional network (ConvNet), there are basically
three types of layers:
1.Convolution layer
2.Pooling layer
3.Fully connected layer
A convolutional layer
A CNN is a neural network with some convolutional layers (and some
other layers). A convolutional layer has a number of filters that does
convolutional operation.

Edge detector

A filter
Convolution These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
Image Source: internet small pattern (3 x 3).
1 -1 -1
-1 1 -1 Filter 1
Convolution -1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Image Source: internet
1 -1 -1
-1 1 -1 Filter 1
Convolution -1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Image Source: internet
1 -1 -1
-1 1 -1 Filter 1
Convolution -1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
Image Source: internet
-1 1 -1
-1 1 -1 Filter 2
Convolution -1 1 -1

stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1 Two 4 x 4 images
0 1 0 0 1 0 Feature
Forming 4 x 4 x 2
0 0 1 0 1 0 -3 -3 Map0 1 matrix
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Image Source: internet
Convolution over Volume
11 -1-1 -1-1
Color image 1 -1 -1
1 0 0 0 0 1 -1-1 11 -1-1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 11 00 00 01 00 1 -1 -1 1
-1-1 -1-1 11
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0 -1-1 11 -1-1
0 11 00 00 01 10 0 -1 1 -1
0 1 0 0 1 0 -1-1 11 -1-1
0 00 11 00 01 10 0 -1 1 -1
0 0 1 0 1 0 -1-1 11 -1-1 Filter 2
0 0 1 0 1 0 -1 1 -1

Image Source: internet


Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
1
1

1
1
0 1 0 0 1 0
0 0 1 0 1 0
x36
Image Source: internet
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
1 0
6 x 6 image
3 0
14
fewer parameters! 15 1 Only connect to
16 1 9 inputs, not
fully connected

Image Source: internet
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3


1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters 
Image Source: internet
Suppose we have 10 filters applying
on input (6 X 6 X 3), each of shape 3 X
3 X 3. What will be the number of
parameters in that layer?

• Number of parameters for each filter = 3*3*3 =


27
• There will be a bias term for each filter, so total
parameters per filter = 28
• As there are 10 filters, the total parameters for
that layer = 28*10 = 280

Image Source: internet


Simple
Convolutional
Neural
Network

• Size of feature vector : (n+2p-f)/s +1


• n : dimension of matrix
• p : size of padding
• f : size of filter
• s : size of stride
Image Source: deeplearning.ai
The whole CNN
cat dog 11
Convolution

Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times

Max Pooling

Flattened
Image Source: internet
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Image Source: internet
Why Pooling
• Subsampling pixels will not change the object
bird
bird

Subsampling

We can subsample the pixels to make image


smaller fewer parameters to characterize the image
Image Source: internet
A CNN compresses a fully
connected network in two ways:

• Reducing number of connections


• Shared weights on the edges
• Max pooling further reduces the complexity
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
Image Source: internet
The whole CNN
3 0
-1 1 Convolution

3 1
0 3
Max Pooling
Can
A new image
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling

is the number of filters

Image Source: internet


The whole CNN
cat dog 11
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


Image Source: internet
3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
Image Source: internet
Classic Networks
1.LeNet-5
2.AlexNet
3.VGG
LeNet-
LeNet-5

•Parameters: 60k
•Layers flow: Conv Pool Conv Pool FC FC Output
•Activation functions: Sigmoid/tanh and ReLU
AlexNet

•Parameters: 60 million
•Activation functions: ReLU

Image Source: deeplearning.ai


VGG-
VGG-16

•Parameters: 138 million


•Pool: MAX with stride 2
•CONV layer: stride 1

Image Source: deeplearning.ai


Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Convolution
How many parameters for
each filter? 9 25 x 26 x 26

Max Pooling
25 x 13 x 13

Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5

Image Source: internet


Only modified the network structure and
CNN in Keras input format (vector -> 3-D array)

Input
1 x 28 x 28

Output Convolution

25 x 26 x 26
Fully connected Max Pooling
feedforward network
25 x 13 x 13

Convolution
50 x 11 x 11

Max Pooling
1250 50 x 5 x 5
Flattened

Image Source: internet


Object Detection using CNN
Classification Classification + Localization = Detection

Object Detection is modeled as a classification problem


• We take windows of fixed sizes
• Run over input image at all the possible locations
• Feed these patches to an image classifier.
• It predicts the class of the object in the window( or background if none is present)
Problem Solution
• Resize the image at multiple scales
• Most commonly, the image is downsampled(size is
reduced)
• On each of these images, a fixed size window detector
is run.
• Now, all these windows are fed to a classifier to
detect the object of interest
• Run Selective Search to
generate probable objects
(~2k regions)

• Feed these patches to CNN,


followed by SVM to predict
the class of each patch.

• Optimize patches by training


bounding box regression
Region-
Region-based Convolutional separately.

Neural Networks(R-
Networks(R-CNN)
• Calculate the CNN representation for entire
image only once

• It uses spatial pooling after the last


convolutional layer

• SPP layer divides a region of any arbitrary


size into a constant number of bins and
max pool is performed on each of the bins

• Since the number of bins remains the


same, a constant size vector is produced

Spatial Pyramid Pooling(SPP


Pooling(SPP-
SPP-
net)
net)
• Fast RCNN uses the ideas from
SPP-net and RCNN

• Apply the RoI pooling layer on the


extracted regions of interest to
make sure all the regions are of
the same size

• These regions are passed on to a


fully connected network which
classifies them, as well as returns
the bounding boxes using softmax
and linear regression layers
simultaneously
Fast R-
R-CNN
Faster R-
R-CNN
• We take an image as input and pass it to the
ConvNet which returns the feature map for that
image

• Region Proposal Network (lightweight CNN) is


applied on these feature maps. This returns the
object proposals along with their objectness
score

• A RoI pooling layer is applied on these proposals


to bring down all the proposals to the same size

• Finally, the proposals are passed to a fully


connected layer which has a softmax layer and a
linear regression layer at its top, to classify and
output the bounding boxes for objects.
• RPN uses a sliding window over
the feature maps

• At each window, it
generates k Anchor boxes of
different shapes and sizes

• For each anchor, RPN predicts


two things:
• first is the probability that an
anchor is an object
• Second is the bounding box
Region Proposal Network regressor for adjusting the
anchors to better fit the
object
(RPN)
RPN)
• We now have bounding boxes of
different shapes and sizes which
are passed on to the RoI pooling
layer
Region Proposal • It extracts fixed sized feature maps
Network (RPN)
RPN) for each anchor

• These feature maps are passed to a


fully connected layer

• It has a softmax and a linear


regression layer
• Classifies the object
• predicts the bounding boxes
for the identified objects
Summary of the object detection models
Algorithm Features Prediction time / image Limitations
Divides the image into multiple Needs a lot of regions to predict
CNN regions and then classify each – accurately and hence high
region into various classes. computation time.
High computation time as each
Uses selective search to
region is passed to the CNN
generate regions. Extracts
RCNN 40-50 seconds separately also it uses three
around 2000 regions from each
different model for making
image.
predictions.
Each image is passed only once
to the CNN and feature maps
are extracted. Selective search is Selective search is slow and
Fast RCNN used on these maps to generate 2 seconds hence computation time is still
predictions. Combines all the high.
three models used in RCNN
together.
Object proposal takes time and
Replaces the selective search as there are different systems
method with region proposal working one after the other, the
Faster RCNN 0.2 seconds
network which made the performance of systems
algorithm much faster. depends on how the previous
system has performed.
Two stage Detectors

• first generates so-called region proposals —


areas of the image that potentially contain an
object
Two stages • Then it makes a separate prediction for each of
these regions
and Single • Examples : R-CNN, Fast R-CNN, Faster R-CNN,
Mask R-CNN
stage Object
One stage Detectors
Detectors
• These models skip the explicit region proposal
stage but apply the detection directly on dense
sampled areas
• Examples: Single Shot Detector (SSD), YOLO
family
How does YOLO Framework Function
Divided into 3 X 3
grid • Image classification and localization
are applied on each grid
• Suppose We have 3 classes. Let’s say
the classes are Pedestrian, Car, and
Motorcycle, respectively. So, for each
grid cell, the label y will be an eight-
dimensional vector
Input Image

Image Source: deeplearning.ai


Bounding box in details

YOLO assign coordinates to all the grids

bx, by are the x


Grid contain
and y coordinates
bounding box
of the midpoint
of the object with
respect to this
grid

bh :height of the bounding box / height of the grid


bw : width of the bounding box / width of the grid

Image Source: deeplearning.ai


Intersection over Union and Non-
Non-Max Suppression

How can we decide whether Rather than detecting an object just


the predicted bounding box is once, they might detect it multiple
giving us a good outcome ? times

IoU = Area of the intersection / Area of the union Non-Max Suppression

If IoU>0.5, we accept predicted bounding box

Image Source: deeplearning.ai


Anchor Box
what if there are multiple objects in a single grid?

midpoint of both the objects


lies in the same grid

Image Source: deeplearning.ai


Anchor Box
what if there are multiple objects in a single grid?

Anchor box 1

Anchor box 2
• Since the shape of anchor box 1 is similar to the bounding
box for the person, the latter will be assigned to anchor box
1 and the car will be assigned to anchor box 2
• The output in this case, instead of 3 X 3 X 8 (using a 3 X 3
grid and 3 classes), will be 3 X 3 X 16 (since we are using 2
anchors)
Image Source: deeplearning.ai
You Only Look Once
• Training
• 3 X 3 grid with two anchors per grid
• 3 different object classes
• y labels will have a shape of 3 X 3 X 16
• suppose if we use 5 anchor boxes per grid
• number of classes has been increased to 5
• target will be 3 X 3 X 10 X 5 = 3 X 3 X 50
• An input image of shape (608, 608, 3)
• Output volume of (19, 19, 425)
• 5 is the number of anchor boxes per
grid
• How many classes are there?

Answer : 80 classes

You might also like