0% found this document useful (0 votes)
28 views16 pages

Deep Learning CNN 4th Unit

It's Deep learning Notes of 4th unit and explain of Convolutional Neural Network

Uploaded by

refineiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

Deep Learning CNN 4th Unit

It's Deep learning Notes of 4th unit and explain of Convolutional Neural Network

Uploaded by

refineiq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT 4

Convolutional Neural Network (CNN)

 A Convolutional Neural Network (CNN) is a type of Deep Learning


neural network architecture commonly used in Computer Vision.
Computer vision is a field of Artificial Intelligence that enables a
computer to understand and interpret the image or visual data.
 In deep learning, a convolutional neural network (CNN/ConvNet) is a
class of deep neural networks, most applied to analyse visual imagery.
 Convolutional Neural Network (CNN) is the extended version of artificial
neural networks (ANN) which is predominantly used to extract the
feature from the grid-like matrix dataset. For example, visual datasets like
images or videos where data patterns play an extensive role
 Convolutional Neural Network consists of multiple layers like the input
layer, Convolutional layer, Pooling layer, and fully connected layers.
 The convolutional layer is the first layer of a convolutional network.
While convolutional layers can be followed by additional convolutional
layers or pooling layers, the fully-connected layer is the final layer.

 The Convolutional layer applies filters to the input image to extract


features, the Pooling layer down samples the image to reduce
computation, and the fully connected layer makes the final prediction.
The network learns the optimal filters through backpropagation and
gradient descent.
 A complete Convolution Neural Networks architecture is also known as
covnets. A covnets is a sequence of layers, and every layer transforms one
volume to another through a differentiable function.
Types of layers:
Let’s take an example by running a covnets on of image of dimension 32x32x3

 Input Layers: It is the layer in which we give input to our model. In


CNN, Generally, the input will be an image or a sequence of images. This
layer holds the raw input of the image with width 32, height 32, and depth
3.
 Convolutional Layers: This is the layer, which is used to extract the
feature from the input dataset. It applies a set of learnable filters known as
the kernels to the input images. The filters/kernels are smaller matrices
usually 2×2, 3×3, or 5×5 shape. it slides over the input image data and
computes the dot product between kernel weight and the corresponding
input image patch. The output of this layer is referred ad feature maps.
Suppose we use a total of 12 filters for this layer we will get an output
volume of dimension 32 x 32 x 12.
 Activation Layer: By adding an activation function to the output of the
preceding layer, activation layers add nonlinearity to the network. it will
apply an element-wise activation function to the output of the convolution
layer. Some common activation functions are RELU: max (0, x), Tanh,
Leaky RELU, etc. The volume remains unchanged hence output volume
will have dimensions 32 x 32 x 12.
 Pooling layer: This layer is periodically inserted in the covnets and its
main function is to reduce the size of volume which makes the
computation fast reduces memory and prevents overfitting. Two common
types of pooling layers are max pooling and average pooling. If we use a
max pool with 2 x 2 filters and stride 2, the resultant volume will be of
dimension 16x16x12.
 Flattening: The resulting feature maps are flattened into a one-
dimensional vector after the convolution and pooling layers so they can
be passed into a completely linked layer for categorization or regression.
 Fully Connected Layers: It takes the input from the previous layer and
computes the final classification or regression task.

 Output Layer: The output from the fully connected layers is then fed
into a logistic function for classification tasks like sigmoid or softmax
which converts the output of each class into the probability score of each
class.

Convolution Layers
 The Convolution Layers are the initial layers to pull out features from the image. It
maintains the relationship between pixels by learning features using a small input
data sequence. It is a mathematical term that takes two inputs, an image matrix and a
kernel or filter. The result is calculated by:

In the above image,

The image matrix is h x w x d

The dimensions of the filter are fh x fw x d

The output is calculated as (h- fh +1)(w- fw+1) x 1

Now, let us take an example and solve a 5x5 image matrix whose pixel values are
0, 1 and the filter matrix as 3x3:
The matrix multiplication will work as follows

The final convolution layers output matrix of a 5x5 image multiplied with a 3x3
filter will be:

The convolution of the image with different filter values can produce a blur or
sharpened image. The size of the output image is calculated by:

(m-n+1)(m-n+1)

Strides
When the array is created, the pixels are shifted over to the input matrix. The
number of pixels turning to the input matrix is known as the strides. When the
number of strides is 1, we move the filters to 1 pixel at a time. Similarly, when
the number of strides is 2, we carry the filters to 2 pixels, and so on. They are
essential because they control the convolution of the filter against the input, i.e.,
Strides are responsible for regulating the features that could be missed while
flattening the image. They denote the number of steps we are moving in each
convolution. The following figure shows how the convolution would work.

In the first matrix, the stride = 0, second image: stride=2, and the third image:
stride=2. The size of the output image is calculated by:

[{(n+2p-f+1)/s}+1][{(n+2p-f+1)/s}]

Pooling Technique in CNN


 In convolutional neural networks (CNNs), the pooling layer is a
common type of layer that is typically added after convolutional
layers. The pooling layer is used to reduce the spatial dimensions (i.e.,
the width and height) of the feature maps, while preserving the depth
(i.e., the number of channels)
 The padding plays a vital role in creating CNN. After the convolution
operation, the original size of the image is shrunk. Also, in the image
classification task, there are multiple convolution layers after which our
original image is shrunk after every step, which we don’t want.
 Secondly, when the kernel moves over the original image, it passes
through the middle layer more times than the edge layers, due to which
there occurs an overlap.
 To overcome this problem, a new concept was introduced named
padding. It is an additional layer that can add to the borders of an image
while preserving the size of the original picture. For example:
So, if an n x n matrix is convolved with an ff matrix with a padding p, then the
size of the output image will be:

(n+2p-f+1) x (n+2p-f+1)

Pooling
 The pooling layer is another building block of a CNN and plays a vital role
in pre-processing an image. In the pre-process, the image size shrinks by
reducing the number of parameters if the image is too large.
 When the picture is shrunk, the pixel density is also reduced, the
downscaled image is obtained from the previous layers.
 Basically, its function is to progressively reduce the spatial size of the
image to reduce the network complexity and computational cost. Spatial
pooling is also known as down sampling or subsampling that reduces the
dimensionality of each map but retains the essential features.
 A rectified linear activation function, or ReLU, is applied to each value in
the feature map. Relu is a simple and effective nonlinearity that does not
change the values in the feature map but is present because later subsequent
pooling layers are added.
 Pooling is added after the nonlinearity is applied to the feature maps. There
are three types of spatial pooling:

1. Max Pooling

Max pooling is a rule to take the maximum of a region and help to proceed with
the most crucial features from the image. It is a sample-based process that
transfers continuous functions into discrete counterparts. Its primary objective is
to downscale an input by reducing its dimensionality and making assumptions
about features contained in the sub-region that were rejected.
2. Average Pooling

It is different from Max Pooling; it retains information about the lesser essential
features. It simply downscales by dividing the input matrix into rectangular
regions and calculating the average values of each area.
OR

The pooling operation involves sliding a two-dimensional filter over each


channel of feature map and summarising the features lying within the region
covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output
obtained after a pooling layer is
(nh - f + 1) / s x (nw - f + 1)/s x nc
where,

-> nh - height of feature map


-> nw - width of feature map
-> nc - number of channels in the feature map
-> f - size of filter
-> s - stride length

Le-Net-5 Architecture
The network has 5 layers with learnable parameters and hence named Lenet-5.
It has three sets of convolution layers with a combination of average pooling.
After the convolution and average pooling layers, we have two fully connected
layers. At last, a SoftMax classifier which classifies the images into respective
class.

The input to this model is a 32 X 32 grayscale image hence the number of


channels is one.

We then apply the first convolution operation with the filter size 5X5 and we
have 6 such filters. As a result, we get a feature map of size 28X28X6. Here the
number of channels is equal to the number of filters applied.
After the first pooling operation, we apply the average pooling and the size of
the feature map is reduced by half. Note that, the number of channels is intact.

Next, we have a convolution layer with sixteen filters of size 5X5. Again the
feature map changed it is 10X10X16. The output size is calculated in a similar
manner. After this, we again applied an average pooling or subsampling layer,
which again reduce the size of the feature map by half i.e 5X5X16.

Then we have a final convolution layer of size 5X5 with 120 filters. As shown
in the above image. Leaving the feature map size 1X1X120. After which flatten
result is 120 values.

After these convolution layers, we have a fully connected layer with eighty-four
neurons. At last, we have an output layer with ten neurons since the data have
ten classes.

Here is the final architecture of the Lenet-5 model.


Architecture Details

Fourth layer

The subsampling takes place, and the image size in this step is reduced to 5x5x16. In this layer,
the input for the very last function diagram comes from all the remaining function diagrams.

Architecture Details
Let’s understand the architecture in more detail.

The first layer is the input layer with feature map size 32X32X1.

Then we have the first convolution layer with 6 filters of size 5X5 and stride is 1. The
activation function used at his layer is tanh. The output feature map is 28X28X6.

Next, we have an average pooling layer with filter size 2X2 and stride 1. The resulting feature
map is 14X14X6. Since the pooling layer doesn’t affect the number of channels.

After this comes the second convolution layer with 16 filters of 5X5 and stride 1. Also, the
activation function is tanh. Now the output size is 10X10X16.

Again comes the other average pooling layer of 2X2 with stride 2. As a result, the size of the
feature map reduced to 5X5X16.

The final pooling layer has 120 filters of 5X5 with stride 1 and activation function tanh. Now
the output size is 120.

The next is a fully connected layer with 84 neurons that result in the output to 84 values and
the activation function used here is again tanh.

The last layer is the output layer with 10 neurons and Softmax function. The Softmax gives
the probability that a data point belongs to a particular class. The highest value is then
predicted.

This is the entire architecture of the Lenet-5 model. The number of trainable parameters of
this architecture is around sixty thousand.

Alex Net
The structure of AlexNet is similar to LeNet-5, but the main difference is it is
much larger and deeper. It was the first convolution neural network that stacked
convolutional layers on top of each other rather than stacking a pooling layer on
top of each convolutional layer.

Before going on to the architecture of the AlexNet, we will get to know some
terms that will be useful in understanding the structure of AlexNet.

Stride

Stride basically denotes how far the filter will move over a convolution layer in
each step along one direction. In other words, if the value of stride is 1, then we
move the filter 1 pixel each time.

Let us understand stride using an example.


The above figure has a convolution layer of 5×5. A pooling layer of 1 is applied
to the convolution layer (The layer surrounded by zero is the pooling layer). A
filter of size 3×3 is applied to the layer. Now, let S be the stride; therefore, the
dimension of the next layer after processing from the filter will be (W - F +
2×P)/S + 1. Here W is the layer's width, F is the size of the filter that is to be
applied, P is the size of the pooling layer, and S is the size of stride.

If the value of S is 2, then the resultant convolution layer will be of size (5-
3+2)/2+1, i.e., 3×3.

Kernels and filters

A 2D matrix consisting of weights is called a kernel. A filter can be referred to


as multiple kernels stacked together. In other words, a filter is the 3D structure
of multiple kernels placed on each other.

Dropout regularization

Dropout is a mechanism used to improve the training of neural networks by


omitting a hidden unit. It also speeds up training. Dropout is driven by
randomly dropping a neuron so that it will not contribute to the forward pass
and backpropagation.

Max Pooling

Max pooling is an operation where the maximum value is calculated for the
patches of the feature map. This method is used to make a downsampled feature
map. It is generally used after the convolutional layer.

The architecture of AlexNet


AlexNet consists of a total of 8 hidden layers, excluding the input, output, and
pooling layers. Let us discuss each layer of the AlexNet briefly.

Input Layer

The input layer of AlexNet accepts the image of size 227×227×3. Here 227×227
defines the height and width of the input image, and the factor of 3 is for the
RGB channel of the image.

Output Layer
The output layer consists of 1000 connected neurons. The size of the output
layer will be 1000×1×1. The size of this layer is 1000 because the ImageNet
dataset is classified into 1000 classes.

Implementation of AlexNet
Let us see the diagram for the AlexNet and then we will implement it
accordingly.

OR
Working of Alex-Net

You might also like