0% found this document useful (0 votes)
32 views10 pages

CNN Notes

This document discusses CNN models and their components. A CNN consists of feature extraction layers like convolution and pooling layers, as well as classification layers like fully connected layers. Convolution layers automatically learn features from images like edges, shapes and objects. Pooling layers reduce the dimensionality of feature maps. RELU layers introduce non-linearity while flattening layers convert feature maps to vectors for fully connected layers to perform classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

CNN Notes

This document discusses CNN models and their components. A CNN consists of feature extraction layers like convolution and pooling layers, as well as classification layers like fully connected layers. Convolution layers automatically learn features from images like edges, shapes and objects. Pooling layers reduce the dimensionality of feature maps. RELU layers introduce non-linearity while flattening layers convert feature maps to vectors for fully connected layers to perform classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

A CNN model can be thought as a combination of two components:

feature extraction part and the classification part. The convolution +


pooling layers perform feature extraction. For example given an
image, the convolution layer detects features such as two eyes, long
ears, four legs, a short tail and so on. The fully connected layers then
act as a classifier on top of these features, and assign a probability
for the input image being a dog.

The convolution layers are the main powerhouse of a CNN model.


Automatically detecting meaningful features given only an image
and a label is not an easy task. The convolution layers learn such
complex features by building on top of each other. The first layers
detect edges, the next layers combine them to detect shapes, to
following layers merge this information to infer that this is a nose.
To be clear, the CNN doesn’t know what a nose is. By seeing a lot of
them in images, it learns to detect that as a feature. The fully
connected layers learn how to use these features produced by
convolutions in order to correctly classify the images.

Why do we prefer Convolutional Neural networks


(CNN) over Artificial Neural networks (ANN) for image
data as input?

1. Feedforward neural networks can learn a single feature representation of the

image but in the case of complex images, ANN will fail to give better predictions, this

is because it cannot learn pixel dependencies present in the images.

2. CNN can learn multiple layers of feature representations of an image by


applying filters, or transformations.

3. In CNN, the number of parameters for the network to learn is


significantly lower than the multilayer neural networks since the number of
units in the network decreases, therefore reducing the chance of overfitting.
4. Also, CNN considers the context information in the small neighborhood
and due to this feature, these are very important to achieve a better
prediction in data like images. Since digital images are a bunch of pixels with
high values, it makes sense to use CNN to analyze them. CNN decreases
their values, which is better for the training phase with less computational
power and less information loss.

Explain the different layers in CNN.

The different layers involved in the architecture of CNN are as follows:

1. Input Layer: The input layer in CNN should contain image data. Image
data is represented by a three-dimensional matrix. We have to reshape the
image into a single column.

For Example, Suppose we have an MNIST dataset and you have an image of


dimension 28 x 28 =784, you need to convert it into 784 x 1 before feeding
it into the input. If we have “k” training examples in the dataset, then the
dimension of input will be (784, k).

2. Convolutional Layer: To perform the convolution operation, this layer is


used which creates several smaller picture windows to go over the data.
3. ReLU Layer: This layer introduces the non-linearity to the network and
converts all the negative pixels to zero. The final output is a rectified feature
map.

4. Pooling Layer: Pooling is a down-sampling operation that reduces the


dimensionality of the feature map.

5. Fully Connected Layer: This layer identifies and classifies the objects in


the image.

6. Softmax / Logistic Layer: The softmax or Logistic layer is the last layer of


CNN. It resides at the end of the FC layer. Logistic is used for binary
classification problem statement and softmax is for multi-classification
problem statement.

7. Output Layer: This layer contains the label in the form of a one-hot


encoded vector.

Explain the significance of the RELU Activation function


in Convolution Neural Network.

RELU Layer – After each convolution operation, the RELU operation is


used. Moreover, RELU is a non-linear activation function. This operation is
applied to each pixel and replaces all the negative pixel values in the feature
map with zero.

Usually, the image is highly non-linear, which means varied pixel values. This
is a scenario that is very difficult for an algorithm to make correct
predictions. RELU activation function is applied in these cases to decrease
the non-linearity and make the job easier.
Therefore this layer helps in the detection of features, decreasing the non-
linearity of the image, converting negative pixels to zero which also allows
detecting the variations of features.

Therefore non-linearity in convolution(a linear operation) is introduced by


using a non-linear activation function like RELU.

Why do we use a Pooling Layer in a CNN?

CNN uses pooling layers to reduce the size of the input image so that it
speeds up the computation of the network.

Pooling or spatial pooling layers: Also called subsampling or downsampling.

 It is applied after convolution and RELU operations.


 It reduces the dimensionality of each feature map by retaining the most
important information.
 Since the number of hidden layers required to learn the complex relations
present in the image would be large.

As a result of pooling, even if the picture were a little tilted, the largest
number in a certain region of the feature map would have been recorded
and hence, the feature would have been preserved. Also as another benefit,
reducing the size by a very significant amount will use less computational
power. So, it is also useful for extracting dominant features.

Explain the role of the flattening layer in CNN.

After a series of convolution and pooling operations on the feature


representation of the image, we then flatten the output of the final pooling
layers into a single long continuous linear array or a vector.

The process of converting all the resultant 2-d arrays into a vector is
called Flattening.
Flatten output is fed as input to the fully connected neural network having
varying numbers of hidden layers to learn the non-linear complexities
present with the feature representation.

 hyperparameters of a Pooling Layer.

The hyperparameters for a pooling layer are:

 Filter size
 Stride
 Max or average pooling

What is the role of the Fully Connected (FC) Layer in


CNN?

The aim of the Fully connected layer is to use the high-level feature of the
input image produced by convolutional and pooling layers for classifying the
input image into various classes based on the training dataset.

Fully connected means that every neuron in the previous layer is connected
to each and every neuron in the next layer. The Sum of output probabilities
from the Fully connected layer is 1, fully connected using a softmax
activation function in the output layer.

The softmax function takes a vector of arbitrary real-valued scores and


transforms it into a vector of values between 0 and 1 that sums to 1.

Working

It works like an ANN, assigning random weights to each synapse, the input
layer is weight-adjusted and put into an activation function. The output of
this is then compared to the true values and the error generated is back-
propagated, i.e. the weights are re-calculated and repeat all the processes.
This is done until the error or cost function is minimized.

Briefly explain the two major steps of CNN i.e, Feature


Learning and Classification. 

Feature Learning deals with the algorithm by learning about the dataset.
Components like Convolution, ReLU, and Pooling work for that, with
numerous iterations between them. Once the features are known, then
classification happens using the Flattening and Full Connection
components.

VGG is a convolutional neural network from researchers at Oxford’s Visual


Geometry Group, hence the name VGG. It was the runner up of the
ImageNet classification challenge with 7.3% error rate. ImageNet is the
most comprehensive hand-annotated visual dataset, and they hold
competitions every year where researchers from all around the world
compete. All the famous CNN architectures make their debut at that
competition.
VGG is a very fundamental CNN model. It’s the first one that comes
to mind if you need to use an off-the-shelf model for a particular
task. The paper is also very well written, available here. There are
much more complicated models which perform better, for example
Microsoft’s ResNet model was the winner of 2015 ImageNet
challenge with 3.6% error rate, but the model has 152 layers! Details
available in the paper here. We will cover all these CNN
architectures in depth in another article, but if you want to jump
ahead here is a great post.

We will visualize the 3 most crucial components of the VGG model:

 Feature maps
 Convnet filters
 Class output
We will visualize the feature maps to see how the input is
transformed passing through the convolution layers. The feature
maps are also called intermediate activations since the output of a
layer is called the activation.

Remember that the output of a convolution layer is a 3D volume. As


we discussed above the height and width correspond to the
dimensions of the feature map, and each depth channel is a distinct
feature map encoding independent features. So we will visualize
individual feature maps by plotting each channel as a 2D image.

How to visualize the feature maps is actually pretty simple. We pass


an input image through the CNN and record the intermediate
activations. We then randomly select some of the feature maps and
plot them.

VGG convolutional layers are named as follows: blockX_convY. For


example the second filter in the third convolution block is called
block3_conv2. In the architecture diagram above it corresponds to
the second purple filter.

For example one of the feature maps from the output of the very first
layer (block1_conv1) looks as follows.
Bright areas are the “activated” regions, meaning the filter detected
the pattern it was looking for. This filter seems to encode an eye and
nose detector.

The following figure displays 8 feature maps per layer. Block1_conv1


actually contains 64 feature maps, since we have 64 filters in that
layer. But we are only visualizing the first 8 per layer in this figure.
 The first layer feature maps (block1_conv1) retain most of the
information present in the image. In CNN architectures the first
layers usually act as edge detectors.
 As we go deeper into the network, the feature maps look less like
the original image and more like an abstract representation of it.
As you can see in block3_conv1 the cat is somewhat visible, but
after that it becomes unrecognizable. The reason is that deeper
feature maps encode high level concepts like “cat nose” or “dog
ear” while lower level feature maps detect simple edges and
shapes. That’s why deeper feature maps contain less information
about the image and more about the class of the image. They still
encode useful features, but they are less visually interpretable by
us.
 The feature maps become sparser as we go deeper, meaning the
filters detect less features. It makes sense because the filters in
the first layers detect simple shapes, and every image contains
those. But as we go deeper we start looking for more complex
stuff like “dog tail” and they don’t appear in every image. That’s
why in the first figure with 8 filters per layer, we see more of the
feature maps as blank as we go deeper (block4_conv1 and
block5_conv1).

You might also like