0% found this document useful (0 votes)
4 views121 pages

Chapter 4 - CNN updated

Chapter 4 discusses Convolutional Neural Networks (CNNs), which are essential for tasks like image recognition, object detection, and voice recognition. The chapter outlines the architecture of CNNs, highlighting their ability to learn hierarchical feature representations through convolutional layers, pooling layers, and fully connected layers. It emphasizes the advantages of CNNs, such as translation invariance and local pattern recognition, which make them efficient for processing visual data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views121 pages

Chapter 4 - CNN updated

Chapter 4 discusses Convolutional Neural Networks (CNNs), which are essential for tasks like image recognition, object detection, and voice recognition. The chapter outlines the architecture of CNNs, highlighting their ability to learn hierarchical feature representations through convolutional layers, pooling layers, and fully connected layers. It emphasizes the advantages of CNNs, such as translation invariance and local pattern recognition, which make them efficient for processing visual data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

Chapter 4

Convolutional Neural Networks


4.1. Introduction to CNN
• Have you ever wondered
– how face recognition works on social media?
– how object detection helps in building self-driving cars?
– how disease detection is done using images in healthcare?
• It is all possible thanks to convolutional neural networks (CNN).
• They have been used in image recognition since the 1980s.
• CNN emerged from the study of the brain’s visual cortex.
• The architecture of a CNN is analogous to that of the connectivity
pattern of neurons in the human brain and was inspired by the
organization of the Visual Cortex.
• Individual neurons respond to stimuli only in a restricted region of
the visual field known as the Receptive Field.
• A collection of such fields overlap to cover the entire visual area.
4.1. Introduction to CNN…
• Yann LeCun is the pioneer of convolutional neural networks.
• He built a successful CNN called LeNet-5 in 1998.
• LeNet-5 was used for OCR tasks like reading zip codes and digits.
• Yann LeCun et al. created the initial form of LeNet called
LeNet-1 in 1989.
• In recent years, thanks to the increase in computational power, the
amount of available training data, and the tricks for training deep
nets, CNNs have managed to achieve superhuman performance on
some complex visual tasks.
• They power image search services, self-driving cars, automatic
video classification systems, and more.
• Moreover, CNNs are not restricted to visual perception:
– they are also successful at many other tasks, such as voice recognition
and natural language processing.
4.1. Introduction to CNN…

Figure important milestones in CNN development


• Today, digital images are everywhere because of the pervasive
presence of digital cameras, webcams, and mobile phones.
• Because capturing images has become so easy, a new, huge stream of
data is provided by images.
• Being able to process images opens the doors to new applications in
fields such as robotics, autonomous driving, medicine, security, and
surveillance.
• A CNN is comprised of one or more convolutional layers and then
followed by one or more fully connected layers as in a standard
multilayer neural network.
• The architecture of a CNN is designed to take advantage of the 2D
structure of images (or other 2D inputs such as a speech signal).
• This is achieved with local connections and tied weights followed by
some form of pooling which results in translation invariant features.
• Another benefit of CNN:
– easier to train and have many fewer parameters than fully connected
networks with the same number of hidden units.
• There are some more fundamental difficulties in computer vision
tasks.
• For example, you may be able to hard code a detector that finds
triangles in an image, but the logic may fail if the triangle is
shifted, rotated, or skewed somehow.
• As humans we know that the triangle is still a triangle despite
being moved to another part of the image, but capturing the
essence of “triangleness” may be very difficult in code.
• This is one reason for the desire to create models that can learn
how to determine “triangleness” on their own by looking at lots of
examples of triangles.
• If the fundamental patterns of these objects and their common
variations can be learned, a much more robust model can be built.
Figure image transformations
• The image recognition process is generally considered to be a
representation learning process.
• Starting from the original pixel features received, it gradually
extracts
– low-level features such as edges and corners,
– then mid-level features such as textures, and
– then high-level features such as object parts.
• The last network layer learns classification logic based on these
learned abstract feature representations.
• The higher the layer and the more accurate the learned features,
the more favorable the classification of the classifier is.
• From the perspective of representation learning, convolutional
neural networks extract features layer by layer, and the process of
network training can be considered as a feature learning process.
• Based on the learned high-level abstract features, classification
tasks can be conveniently performed.
4.1. Introduction to CNN…
• Applying the idea of representation learning, a well-trained CNN
can often learn better features.
• This feature extraction method is generally universal.
• For example, learning the representation of head, foot, body, and
other characteristics on cat and dog tasks can also be used to some
extent on other animals.
• Based on this idea, the first few feature extraction layers of the
deep neural network trained on task A can be migrated to task B,
and only the classification logic of task B (represented as the last
layer of the network) needs to be trained.
• This method is a type of transfer learning, also known as
fine-tuning.
4.1. Introduction to CNN…
• The fundamental difference between a densely connected layer
and a convolution layer is this:
– dense layers learn global patterns in their input feature space
(patterns involving all pixels)
– convolution layers learn local patterns—in the case of images,
patterns found in small 2D windows of the inputs.

Figure Images can be broken into local patterns such as edges, textures, and so on
• This key characteristic gives convnets two interesting properties:
• 1). The patterns they learn are translation-invariant.
• After learning a certain pattern in the lower-right corner of a
picture, a convnet can recognize it anywhere, for example, in the
upper-left corner.
• A densely connected model would have to learn the pattern anew
if it appeared at a new location.
• This makes CNNs data-efficient when processing images
(because the visual world is fundamentally translation-invariant):
– they need fewer training samples to learn representations that have
generalization power.
• 2). They can learn spatial hierarchies of patterns.
• A first convolution layer will learn small local patterns such as
edges, a second convolution layer will learn larger patterns made
of the features of the first layers, and so on.
• This allows convnets to efficiently learn increasingly complex and
abstract visual concepts, because the visual world is
fundamentally spatially hierarchical.
4.1. Introduction to CNN…
• The objective of the Convolution Operation is to extract the
high-level features such as edges, from the input image.
• Conventionally, the first convolutional layer is responsible for
capturing the low-Level features such as edges, color, gradient
orientation, etc.
• With added layers, the architecture adapts to the high-level
features as well, giving us a network that has a full understanding
of images in the dataset, similar to how we would.
• Convolutional layers:
– can learn to extract features on their own
– can detect these features regardless of location in the image
– when stacked together, it can take advantage of the hierarchical
nature of most image data.
Figure Hierarchical feature extraction in CNNs

• Convolutional neural networks effectively learn hierarchical


feature representations from the training data.
• This means that they typically learn the most basic shapes early in
the network and then more and more complex shapes deeper in
the network.
• jj

Figure The visual


world forms a spatial
hierarchy of visual
modules
4.1. Introduction to CNN…
• Imagine that you want to detect an object in an image.
• It is reasonable that whatever method we use to recognize objects
should not be overly concerned with the precise location of the
object in the image.
• Ideally, the system should exploit this knowledge.
• Cats usually do not fly and planes usually do not swim.
• Nonetheless, we should still recognize a cat were one to appear at
the top of the image.
• CNNs systematize this idea of spatial invariance, exploiting it to
learn useful representations with fewer parameters.
• We can now make these intuitions more concrete by enumerating
a few desiderata to guide our design of a neural network
architecture suitable for computer vision:
– In the earliest layers, our network should respond similarly to
the same patch, regardless of where it appears in the image.
This principle is called translation invariance.
– The earliest layers of the network should focus on local
regions, without regard for the contents of the image in distant
regions. This is the locality principle. Eventually, these local
representations can be aggregated to make predictions at the
whole image level.
• Instead of connecting every neuron to every pixel in the input
image, CNNs use small, localized regions of the input image
called receptive fields.
• Each neuron in a convolutional layer is connected only to a small
region of the input, which allows the network to focus on local
patterns like edges, textures, or shapes.
4.2. The Basic Structure of CNN
• CNNs are distinguished from other neural networks by their
superior performance with image, speech, and audio signal inputs.
• The following figure depicts thee three groups:
– 1. Input layer
– 2. Feature-extraction (learning) layers
– 3. Classification layers
• The input layer accepts three-dimensional input generally in the
form of the size (width × height) of the image and a depth
representing the color channels (three for RGB color channels).
• Hence, the input layer has three dimensions: width, height, and
depth.
4.2. The Basic Structure of CNN…

Figure CNN architecture


• CNNs have three main types of layers:
1. Convolutional layer
2. Pooling layer
3. Fully-connected (FC) layer

1. Convolutional Layer
• The convolutional layer is the core building block of a CNN, and
it is where the majority of computation occurs.
• It requires a few components, which are input data, a filter, and a
feature map.
• Let us assume that the input will be a color image, which is made
up of a tensor of pixels in 3D.
• This means that the input will have three dimensions—a height,
width, and depth—which correspond to RGB in an image.
• We also have a feature detector, also known as a kernel or a filter,
which will move across the receptive fields of the image,
checking if the feature is present.
• This process is known as a convolution.
4.2. The Basic Structure of CNN…
• The feature detector is a two-dimensional (2-D) array of weights,
which represents part of the image.
• While they can vary in size, the filter size is typically a 3x3
matrix; this also determines the size of the receptive field.
• The filter is then applied to an area of the image, and a dot
product is calculated between the input pixels and the filter.
• This dot product is then fed into an output array.
• Afterwards, the filter shifts by a stride, repeating the process until
the kernel has swept across the entire image.
• The final output from the series of dot products from the input and
the filter is known as a feature map, activation map, or a
convolved feature.
4.2. The Basic Structure of CNN…
• After each convolution operation, a CNN applies a ReLU transformation
to the feature map, introducing nonlinearity to the model.
• Another convolution layer can follow the initial convolution layer.
• When this happens, the structure of the CNN can become hierarchical as
the later layers can see the pixels within the receptive fields of prior
layers.
• As an example, let us assume that we are trying to determine if an image
contains a bicycle.
• You can think of the bicycle as a sum of parts.
• It is comprised of a frame, handlebars, wheels, pedals, etc.
• Each individual part of the bicycle makes up a lower-level pattern in the
neural net, and the combination of its parts represents a higher-level
pattern, creating a feature hierarchy within the CNN.
4.2. The Basic Structure of CNN…
• Convolution is a mathematical operation that allows the merging
of two sets of information.
• In the case of CNN, convolution is applied to the input data to
filter the information and produce a feature map.
• This filter is also called a kernel, or feature detector, and its
dimensions can be, for example, 3x3.
• To perform convolution, the kernel goes over the input image,
doing matrix multiplication element after element.
• The result for each receptive field (the area where convolution
takes place) is written down in the feature map.
• We continue sliding the filter until the feature map is complete.
• A kernel is a 2-dimensional array of weights.
• The weights associated with the convolutional layers in a CNN
are what make up the kernels.
• Until the weights are trained, none of the kernels know which
“features” they should detect.
• So, if each kernel is just an array of weights, how do these
weights operate on the input image during convolution?
• The network performs an element-wise multiplication between the
kernel (weight) and the input pixels within its receptive field, then
sums everything up and sends that value to the output array.
• In the two-dimensional cross-correlation operation, we begin with
the convolution window positioned at the upper-left corner of the
input tensor.
• And then slide it across the input tensor, both from left to right
and top to bottom.
• When the convolution window slides to a certain position, the
input subtensor contained in that window and the kernel tensor are
multiplied elementwise and the resulting tensor is summed up
yielding a single scalar value.
• This result gives the value of the output tensor at the
corresponding location.

4.2. The Basic Structure of CNN…
• The same filter array of weights is multiplied with each different
window region of the image.
• The question is “what are the best weight values in the filter
array?”.
• The weights in the filter arrays are learned from the training data.
• Hence, we don’t have to manually set the weights of the filter
array.
• Recall that strictly speaking, convolutional layers are a misnomer.
• The operations they express are more accurately described as
cross-correlations.
• In such a layer, an input tensor and a kernel tensor are combined
to produce an output tensor through a cross-correlation operation.

4.2. The Basic Structure of CNN…
• Another example of convolution operation:

* =

1*1+0*1+1*1+0*0+1*1+0*1+1*0+0*0+1*1 = 4
1*1+0*1+1*0+0*1+1*1+0*1+1*0+0*1+1*1 = 3
1*1+0*0+0*1+0*1+1*1+0*0+1*1+0*1+1*1 = 4
1*0+0*1+1*1+0*0+1*0+0*1+1*0+0*0+1*1 = 2

1*1+0*1+1*1+0*1+1*1+0*0+1*1+0*0+1*0 = 4
4.2. The Basic Structure of CNN…
i. Stride
• When working with a convolutional layer, you may need to get an
output that is smaller than the input.
• Ways to achieve this:
– One way is to use a pooling layer
– Another way is using stride
• The idea behind stride is to skip some areas when the kernel slides
over: for example, skipping every 2 or 3 pixels.
• It reduces spatial resolution and makes the network more
computationally efficient.
• Stride defines by what step does to kernel move.
• For example, a stride of 1 makes kernel slide by one row/column
at a time, and a stride of 2 moves kernel by 2 rows/columns.
4.2. The Basic Structure of CNN…
• For a convolutional or a pooling operation, the stride S denotes
the number of pixels by which the window moves after each
convolution operation.
• A normal filter would be moved 1 pixel over, having a stride
length of 1.
• A stride length of 3 would move the filter over by 3 pixels each
time.
• Since the number of outputs depends on how many positions the
filter is applied to, a larger stride will produce fewer outputs.
• And this reduce the resolution of the resulting array from the
convolution.
• Selecting stride length depends on the desired effect and/or
computing resource constraints or efficiency goals and can be
treated as a hyperparameter.
4.2. The Basic Structure of CNN…

Figure stride size of 2


4.2. The Basic Structure of CNN…
• With stride s=2, the output shrinks a lot faster.
• Take the following example.
• Compared with the previous situation (s = 1), the output height
and width are reduced from 3 × 3 to 2 × 2.
• The number of receptive fields is also reduced to only 4.

4.2. The Basic Structure of CNN…
ii. Padding
• When designing a network model, it is sometimes desired that the
height and width of the output can be the same as the height and
width of the input.
• In order to make the height and width of the output equal to that
of the input, it is common to increase the input by padding
elements on the height and width of the original input.
• Padding expands the input matrix by adding fake pixels to the
borders of the matrix.
• This is done because convolution operation reduces the size of the
matrix.
• For example, a 5x5 matrix turns into a 3x3 matrix when a filter
goes over it.
• To preserve the size of the matrix, padding should be added to the
matrix i.e. image.
• Padding means adding extra data (pixels) to the edges of your
image.
• By carefully designing the number of filling units, the height and
width of the output after the convolution operation can be equal to
the original input, or even larger.
• Unless your filter is of size 1x1 and stride 1, the result of the
convolution will be a matrix of smaller size (i.e. lower resolution)
than the original image.
• Additionally, the pixels at the edges of an image are seen less by
the convolution operation compared with pixels farther in.
• Padding can alleviate these issues.
• The amount of padding depends on the goal, e.g. maintaining the
same resolution, which is termed “same padding”.
• The most common data value to use for padding is 0.
4.2. The Basic Structure of CNN…
• In the figure, one row is filled in the upper and lower directions,
and two columns are filled in the left and right directions.

Figure matrix before and after padding


• So how to calculate the convolutional layer after padding?
• We can simply replace the input X with the new tensor X'
obtained after padding.
• As shown in the figure below, the initial position of the receptive
field is at the upper left of X′.
• Similar as before, the output 1 is obtained and written to the
corresponding position of the output tensor.

Figure Convolution operation after padding


4.2. The Basic Structure of CNN…
• After the convolution is done, we get a matrix that has the same
dimension as the original matrix.

Figure convolution after padding



• CNNs commonly use convolution kernels with odd height and
width values, such as 1, 3, 5, or 7.
• Choosing odd kernel sizes has the benefit that we can preserve the
spatial dimensionality while padding with the same number of
rows on top and bottom, and the same number of columns on left
and right.
• Conventionally, there are two types of padding:
– Valid padding: valid padding means there is no padding added
to the image. Hence, after the convolution operation, the size of
the output tensor decreases.
– Same padding: in this case, padding is added in such a way that
the output image is the same size as the output image. The get
the amount of padding needed such that output image remains
the same size as input image, you can use the following
understanding.
(nw + 2p) x (nh + 2p) => input image size
(nw + 2p - f + 1) x (nh + 2p - f + 1) => output image size
4.2. The Basic Structure of CNN…
Pooling Layer
• Pooling layers, also known as downsampling, conducts
dimensionality reduction, reducing the number of parameters in
the input.
• Similar to convolutional layer, the pooling operation sweeps a
filter across the entire input, but the difference is that this filter
does not have any weights.
• Instead, the kernel applies an aggregation function to the values
within the receptive field, populating the output array.
• There are two main types of pooling:
– Max pooling: As the filter moves across the input, it selects the pixel
with the maximum value to send to the output array. Max pooling is
used more often compared to average pooling.
– Average pooling: As the filter moves across the input, it calculates the
average value within the receptive field to send to the output array.
4.2. The Basic Structure of CNN…
• While a lot of information is lost in the pooling layer, it also has a
number of benefits to the CNN.
• They help to
– reduce complexity, improve efficiency, and limit risk of
overfitting.

Figure a max pooling layer operating on chunks of a reduced image


• Pooling layers are another kind of layer commonly used in
convolutional neural networks.
• They simply downsample each of the feature maps created by a
convolution operation.
• For the most typically used pooling size of 2, this involves mapping
each 2 x 2 section of each feature map either
– to the maximum value of that section, in the case of max-pooling,
– or to the average value of that section, in the case of
average-pooling.
• For an n x n image, then, this would map the entire image to one of
size n/2 x n/2.

Figure max pooling and


average pooling
4.2. The Basic Structure of CNN…
• The main advantage of pooling is computational:
– by downsampling the image to contain one-fourth as many
pixels as the prior layer, pooling decreases both the number of
weights and the number of computations needed to train the
network by a factor of 4.
• This can be further compounded if multiple pooling layers are
used in the network, as they are in many architectures of CNNs.
• The downside of pooling is that only one-fourth as much
information can be extracted from the downsampled image.
• However, the fact that architectures showed very strong
performance on benchmarks in image recognition despite the use
of pooling suggested that, even though pooling was causing the
networks to lose information about the images, the trade-offs in
terms of increased computational speed were worth it.
4.2. The Basic Structure of CNN…
Global Average Pooling
• Another pooling operation is a global average pooling.
• A global average pooling performs an extreme type of
downsampling.
• Here, the feature map with size of h × w is downsampled into a 1
× 1 array by simply taking the average of all the elements in each
channel, whereas the depth of feature maps is retained.
• This operation is typically applied only once before the fully
connected layers.
• The advantages of applying global average pooling are as follows:
i. reduces the number of learnable parameters
ii. enables the CNN to accept inputs of variable size
4.2. The Basic Structure of CNN…

Figure global average pooling


4.2. The Basic Structure of CNN…
Fully-Connected Layer
• As mentioned earlier, the pixel values of the input image are not
directly connected to the output layer in partially connected
layers.
• However, in the fully-connected layer, each node in the output
layer connects directly to a node in the previous layer.
• This layer performs the task of classification based on the features
extracted through the previous convolutional layers and their
different filters.
• While convolutional and pooling layers tend to use ReLu
functions, FC layers usually leverage a softmax activation
function to classify inputs appropriately, producing a probability
from 0 to 1.
4.2. The Basic Structure of CNN…
4.2. The Basic Structure of CNN…
• Neurons in CNNs share weights unlike in MLPs where each
neuron has a separate weight vector.
• This sharing of weights ends up reducing the overall number of
trainable weights hence introducing sparsity.
• Utilizing the weights sharing strategy, neurons are able to perform
convolutions on the data with the filter being formed by the
weights.
• This is then followed by a pooling operation which as a form of
non-linear down-sampling.
• This progressively reduces the spatial size of the representation
thus reducing the amount of computation and parameters in the
network.
4.2. The Basic Structure of CNN…
• After several convolutional and pooling layers, the image size is
reduced and more complex features are extracted.
• Eventually with a small enough feature map, the contents are
squashed into a one-dimension vector and fed into a
fully-connected MLP for processing.
• The last layer of this fully-connected MLP seen as the output.
Convolution over Volume
• Color images have multiple channels and these channels are the
standard RGB channels to indicate the amount of red, green and blue.
• When we add channels into the mix, our inputs and hidden
representations both become three-dimensional tensors.
• For example, each RGB input image has shape 3×h×w.
• We refer to this axis, with a size of 3, as the channel dimension.

Figure RGB image with multiple input channels


4.2. The Basic Structure of CNN…
• When the input data contain multiple channels, we need to
construct a kernel with the same number of channels as the input
data, so that it can perform convolution with the input data.
• Assuming that the number of channels for the input data is ci, the
number of channels of the kernel also needs to be ci.
• If kernelʼs window shape is kh × kw, then when ci = 1, we can
think of our convolution kernel as just a two-dimensional tensor
of shape kh × kw.
• When ci > 1, we need a kernel that contains a tensor of shape kh ×
kw for every input channel.
• Concatenating these ci tensors together yields a convolution
kernel of shape ci × kh × kw.
4.2. The Basic Structure of CNN…
• Since the input and the kernel each have ci channels, for each
channel, we can perform a cross-correlation operation on the
two-dimensional tensor of the input and the two-dimensional
tensor of the kernel.
• We sum over the ci results together to yield a two-dimensional
tensor result.
• This is the result of a two-dimensional cross-correlation between a
multi-channel input and a multi-input-channel kernel.
• In the example below, we demonstrate an example of a
two-dimensional cross-correlation with two input channels.
• The shaded portions are the first output element as well as the
input and kernel tensor elements used for the output computation:
(1×1+2×2+4×3+5×4) + (0×0+1×1+3×2+4×3) = 56.
Figure convolving multichannel input

• In the case of images with multiple channels (e.g. RGB), the


Kernel has the same depth as that of the input image.
• Element-wise matrix multiplication (Hadamard product) is
performed between Kn and In stack ([K1, I1]; [K2, I2]; [K3, I3]) and
all the results are summed to give a squashed one-depth channel
convoluted feature output.
Figure Convolution operation on a wxhx3 image matrix with a 3x3x3 Kernel
4.2. The Basic Structure of CNN…
• There is a high chance that you may need to extract a lot of different
features from an image, for which you will use multiple filters.
• If you apply each filter to the input separately, you would need to
perform the convolution operation multiple times, once for each filter.
• If individual filters are convolved separately, it will increase the
computation time.
• Hence, its more convenient to use all required filters at a time
directly.
• Modern CNN convolutions apply all filters simultaneously in a single
operation.
• Below is a simple example of convolution over volume, of an image
having dimension 6 x 6 x 3 with 3 denoting the 3 channels R, G & B.
• Similarly, the filter is of dimension 3 x 3 x 3.
Figure Convolution over volume
4.2. The Basic Structure of CNN…

4.2. The Basic Structure of CNN…
Multiple Output Channels
• Regardless of the number of input channels, so far we always
ended up with one output channel.
• It is essential to have multiple output channels at each layer.
• The most popular CNN architectures will actually increase the
channel dimension as we go higher up in the network.
• But they typically downsample to trade off spatial resolution for
greater channel depth.
• Intuitively, you could think of each channel as responding to some
different set of features.
• Reality is a bit more complicated than that since representations
are not learned independent but are rather optimized to be jointly
useful.
4.2. The Basic Structure of CNN…
• So, it may not be that a single channel learns an edge detector but
rather that some direction in channel space corresponds to
detecting edges.
• If we denote the number of input and output channels by ci and co
respectively, and let kh and kw be the height and width of the
kernel.
• To get an output with multiple channels, we can create a kernel
tensor of shape ci × kh × kw for every output channel.
• We concatenate them on the output channel dimension, so that the
shape of the kernel is co × ci × kh × kw.
• In cross-correlation operations, the result on each output channel
is calculated from the kernel corresponding to that output channel
and takes input from all channels in the input tensor.
4.2. The Basic Structure of CNN…

• The two filters learn different things.


• Filter 1 may be vertical edge detector whereas filter 2 may be curve detector
4.2. The Basic Structure of CNN…
• Multiple filters can be used in a convolution layer to detect
multiple features.
• The output of the layer will have the same number of channels as
the number of filters in the layer.
• In CNNs, using multiple kernels is a common technique to
extract multiple features from an input image.
• By using multiple kernels in a CNN layer, we can extract
multiple types of features from the same input image.
• For example, one kernel might be designed to detect edges, while
another might be designed to detect curves or corners.
• By having multiple kernels, the CNN can learn to recognize
more complex patterns in the input image, which can improve
the accuracy of the network.
4.2. The Basic Structure of CNN…
• Another advantage of using multiple kernels is that it allows the
network to learn features that are invariant to certain
transformations, such as changes in scale, rotation, or
illumination.
• By training the network on a variety of images with different
transformations, the kernels can learn to recognize patterns that
are invariant to those transformations, making the network more
robust to variations in the input.
• Overall, using multiple kernels in CNN networks is
– an effective way to extract multiple features from an input
image, and
– learn more complex representations that can improve the
accuracy and robustness of the network
4.2. The Basic Structure of CNN…
• In conclusion, CNNs do not learn a single filter.
• They learn multiple features in parallel for a given input.
• For example, it is common for a convolutional layer to learn
from 32 to 512 filters in parallel for a given input.
• This gives the model 32, or even 512, different ways of
extracting features from the input, or many different ways of
both “learning to see” and after training, many different ways of
“seeing” the input data.
• This diversity allows specialization, e.g. not just lines, but the
specific lines seen in your specific training data.
4.2. The Basic Structure of CNN…
• Example:
4.2. The Basic Structure of CNN…
1 × 1 Convolutional Filter
• At first, a 1 × 1 convolution, i.e., kh = kw = 1, does not seem to
make much sense.
• After all, a convolution correlates adjacent pixels.
• A 1 × 1 convolution obviously does not do that.
• Nonetheless, they are popular operations that are sometimes
included in the designs of complex deep networks.
• Because the minimum window is used, the 1 × 1 convolution
loses the ability of larger convolutional layers to recognize
patterns consisting of interactions among adjacent elements in the
height and width dimensions.
• The only computation of the 1 × 1 convolution occurs on the
channel dimension.
4.2. The Basic Structure of CNN…
• The figure on next slide shows the convolution computation using
1 × 1 convolution kernel with 3 input channels and 2 output
channels.
• Note that the inputs and outputs have the same height and width.
• Each element in the output is derived from a linear combination of
elements at the same position in the input image.
• You could think of the 1 × 1 convolutional layer as constituting a
fully-connected layer applied at every single pixel location to
transform the ci corresponding input values into co output values.
• Because this is still a convolutional layer, the weights are tied
across pixel location.
• Thus the 1 × 1 convolutional layer requires co × ci weights (plus
the bias).
4.2. The Basic Structure of CNN…
4.2. The Basic Structure of CNN…
• 1 × 1 convolutions perform channel reduction in a useful manner.
• With 1x1 filter, you can cheaply reduce (or increase) the number
of channels without losing information.
• This could be useful especially in bottleneck layers.
• For example, for an input of 64 x 64 x 3, if we use 1 x 1 x 3 filter,
the output will have the same width and height as the input but
only one channel – 64 x 64 x 1.
• The 1x1 filter are actually used to adapt depths, by merging them,
without changing the spatial information.
• We use this type of kernel when we need to transform a volume
depth into another (called squeezing or expanding) without losing
spatial information.
4.2. The Basic Structure of CNN…

Figure changing number of channels using 1x1 kernel


4.2. The Basic Structure of CNN…
One Convolutional Layer
• Finally to make up a convolution layer, a bias (b ϵ R) is added and
an activation function such as ReLU or tanh is applied.
• This forms one convolutional layer.

a[0] w[1] [1] [0]


w a z[1] a[1]

z[1] = w[1]a[0] + b[1]


a[1] = g(z[1])

Figure one convolutional layer over volume


4.2. The Basic Structure of CNN…
Shorthand Representation
• A simpler representation can be used to represent one
convolutional layer showing all information about the layer.
• It shows kernel size (f), stride (s), padding (p), & number of
filters.
4.2. The Basic Structure of CNN…
• The following is a sample network with three convolution layers.
• At the end of the network, the output of the convolution layer is
flattened and is connected to a softmax output layer.
4.3. Training CNN
• We can train convolutional networks through traditional
stochastic gradient descent algorithm.
• It is based on the backpropagation of error from output layer to all
other previous layers and make updates.

Figure An overview of CNN training process


• Now, let’s assume the function f is a convolution between input X
and a filter F.
• Input X is a 3x3 matrix and filter F is a 2x2 matrix, as shown
below:

• Convolution between input X and filter F, gives us an output O.


• This can be represented as:
4.3. Training CNN…

4.3. Training CNN…
• Let’s see how to the do the backward pass.
• We get the loss gradient with respect to the output O from the
next layer as ∂L/∂O, during the backward pass.
• Combining this with local gradients using chain rule, we get:
4.3. Training CNN…




4.3. Training CNN…




• What about pooling layers?
• Pooling layers don’t have learnable parameters.
4.3. Training CNN…
Data Augmentation
• Large datasets are a prerequisite for the success of deep neural
networks in various applications.
• One way of expanding the training set in CNN is through data
augmentation
• Data augmentation artificially increases the size of the training set
by generating many realistic variants of each training instance.
• This reduces overfitting, making this a regularization technique.
• The generated instances should be as realistic as possible: ideally,
given an image from the augmented training set, a human should
not be able to tell whether it was augmented or not.
• By increasing the samples with different random changes that
produce realistic-looking images, data augmentation uses the
existing training samples to generate more training data.
• Some of the most popular data augmentation methods are the
following:
1. Position augmentation
• i. Center crop: crops the given image at the center. Size is the
parameter given by the user.
• ii. Random crop: crop the given image at a random location.
• iii. Vertical flip: vertically flips the given image randomly.
• iv. Horizontal flip: horizontally flip the given image randomly.
• v. Rotation: rotate the image by some angle.
• vi. Resize: resize the size of the input image to a given size.
• vii. Random affine: random affine transformation of the image
keeping center invariant.
Figure Some common image augmentations
2. Color Augmentation:
• Another augmentation method is changing colors.
• We can change four aspects of the image color: brightness,
contrast, saturation, and hue.
• i. Brightness: change the brightness of the image.
• The resultant image becomes darker or lighter compared to the
original one.
• ii. Contrast: the contrast is defined as the degree of separation
between the darkest and brightest areas of an image.
• The contrast of the image can also be changed.
• iii. Saturation: Saturation pertains the amount of white light mixed
with a color.
• Saturation in images is the intensity or purity of a color.
• It is can be measured on a scale from 0 to 100.
• 0 indicates completely desaturated (black and white)
• 100 indicates completely saturated (full color photo).
• A grayscale photo has no color saturation.
• iv. Hue: hue is the degree to which a color can be described as
similar to or different from color that are described as red, orange,
yellow, green, blue, violet.
• It is the dominant wavelength of light that the human eye
interprets as color.
• Hue is determined by the dominant wavelength of the visible
spectrum.
4.3. Training CNN…

Figure Color augmentations on image of a tiger


4.4. Transfer Learning
• An abundance of labeled data is desirable to train CNNs but such
data is rarely available due to the cost and workload of image
labeling.
• There are a couple of techniques available to train a model
efficiently on a smaller dataset:
– data augmentation and
– transfer learning.
• Transfer learning is a common and effective strategy to train a
network on a small dataset, where a network is pretrained on an
extremely large dataset, such as ImageNet, which contains 1.4
million images with 1000 classes.
• This pretrained convolutional network is then reused and applied
to a given task of interest.
• The underlying assumption of transfer learning is that generic
features learned on a large enough dataset can be shared among
seemingly disparate datasets.
• This portability of learned generic features is a unique advantage of
deep learning that makes itself useful in various domain tasks with
small datasets.
• At present, many models pretrained on the ImageNet dataset are open
to the public and readily accessible, along with their learned kernels
and weights, such as AlexNet, VGG, ResNet, Inception, DenseNet,
etc.

Figure Common
transfer learning
approaches
• In general, there are two ways to utilize a pretrained network:
– feature extraction approach
– fine-tuning approach
• Feature extraction consists of using the representations learned by
a previous network to extract interesting features from new
samples.
• These features are then run through a new classifier, which is
trained from scratch.
• A fixed feature extraction method is a process to remove fully
connected layers from the pretrained network while maintaining
the remaining network, referred to as the convolutional base, as a
feature extractor.
• In this scenario, any machine learning classifier, such as random
forests and support vector machines, as well as the usual fully
connected layers, can be added on top of the fixed feature
extractor.
• This results in training limited to the added classifier on a given
dataset of interest.
• Convnets used for image classification comprise two parts:
– they start with a series of convolution and pooling layers, and
– they end with a densely connected classifier.
• The first part is called the convolutional base of the model.
• Feature extraction consists of taking the convolutional base of a
previously trained network, running the new data through it, and
training a new classifier on top of the output.

Figure feature extraction


• The level of generality (and therefore reusability) of the
representations extracted by specific convolution layers depends
on the depth of the layer in the model.
• Layers that come earlier in the model extract local, highly generic
feature maps (such as visual edges, colors, and textures), whereas
layers that are higher up extract more-abstract concepts (such as
“cat ear” or “dog eye”).
• So, if the new dataset differs a lot from the dataset on which the
original model was trained, you may be better off using only the
first few layers of the model to do feature extraction, rather than
using the entire convolutional base.
• Another widely used technique for model reuse is fine-tuning.
• Fine-tuning consists of unfreezing a few of the top layers of a
convolutional base used for feature extraction, and jointly training
both the newly added part of the model (the fully connected
classifier) and these top layers.
• This is called fine-tuning because it slightly adjusts the more
abstract representations of the model being reused, in order to
make them more relevant for the problem at hand.

Figure fine-tuning the last convolutional block of the VGG16 network

• A fine-tuning method
– replace fully connected layers of the pretrained model with a new set
of fully connected layers to retrain on a given dataset
– fine-tune all or part of the kernels in the pretrained convolutional
base.
• All the layers in the convolutional base can be fine-tuned or,
alternatively, some earlier layers can be fixed while fine-tuning
the rest of the deeper layers.
• This is motivated by the observation that the early-layer features
appear more generic, including features such as edges applicable
to a variety of datasets and tasks.
• However, later features progressively become more specific to a
particular dataset or task.
• Why not fine-tune the entire convolutional base? You could.
• But you need to consider the following:
– Earlier layers in the convolutional base encode more-generic, reusable
features, whereas layers higher up encode more-specialized features.
It is more useful to fine-tune the more specialized features, because
these are the ones that need to be repurposed on the new problem.
There would be fast-decreasing returns in fine-tuning lower layers.
– The more parameters you are training, the more you are at risk of
overfitting. The convolutional base has tens of millions of parameters,
so it would be risky to attempt to train it on your small dataset.
4.4. Transfer Learning…
• Thus, in this situation, it is a good strategy to fine-tune only the
top two or three layers in the convolutional base.
• One drawback of transfer learning is its constraints on input
dimensions.
• The input image has to be 2D with three channels because the
ImageNet dataset consists of 2D color images that have three
channels (RGB: red, green, and blue).
• On the other hand, the height and width of an input image can be
arbitrary, but not too small.
4.5. Architecture of CNN
• There are several architectures in the field of Convolutional
Networks that were introduced by different companies or research
institutions.
• The most common are:
– LeNet
– AlexNet
– ZF Net
– GoogleNet (Inception)
– VGGNet
– ResNet,
– EfficientNet, etc.
4.5. Architecture of CNN…
i. LeNet
• The first successful applications of CNN
were developed by Yann LeCun in
1990s.
• Of these, the best known is the LeNet-5
architecture that was used to read zip
codes, digits, etc.
• Each version of LeNet played a critical
role in demonstrating the power of
CNNs, laying the foundation for the rapid
development of deep learning.
4.5. Architecture of CNN…
ii. AlexNet
• The first work that popularized CNN in computer vision was the
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever and
Geoff Hinton.
• The AlexNet was submitted to the ImageNet ILSVRC challenge
in 2012 and won the challenge.
• It significantly outperformed the second runner-up (top 5 error of
16% compared to runner-up with 26% error).
• The Network had a very similar architecture to LeNet.
• But it was deeper, bigger, and featured Convolutional Layers
stacked on top of each other (previously it was common to only
have a single CONV layer always immediately followed by a
POOL layer).
• The innovations of AlexNet are:
– The number of layers has reached eight.
– Uses the ReLU activation function.
Most of previous neural networks use
the Sigmoid activation function, which
is relatively complicated to calculate
and is prone to gradient dispersion.
– Introduced the Dropout layer. Dropout
improves the generalization ability of
the model and prevents overfitting
4.5. Architecture of CNN…
iii. Inception (GoogLeNet)
• The ILSVRC 2014 winner was a Convolutional Network from
Szegedy et al. from Google.
• Its main contribution was the development of an Inception
Module that dramatically reduced the number of parameters in the
network (4M, compared to AlexNet with 60M).
• Additionally, it uses Average Pooling instead of Fully Connected
layers at the top of the ConvNet, eliminating a large amount of
parameters that do not seem to matter much.
• There are also several follow up versions to the Inxeption, most
recently Inception-v4.
• The core building block of the architecture is the inception
module.
• The module uses 1x1, 3x3, and 5x5 convolutions in parallel to
capture features of varying sizes.
• Instead of using a single filter size (e.g., 3x3 or 5x5), the Inception
module applies multiple filter sizes (1x1, 3x3, 5x5) and pooling
operations in parallel.
• A 1x1 filter reduces the number of input channels before applying
more computationally expensive operations (like 3x3 and 5x5
convolutions).
• The outputs of these operations are concatenated along the depth
dimension, allowing the network to capture fine-grained and
coarse-grained features simultaneously.
Inception module

• Output of the layer:


iv. VGGNet
• The runner-up in ILSVRC 2014 was the network from Karen
Simonyan and Andrew Zisserman that became known as the
VGGNet.
• It was developed by the Visual Geometry Group (VGG) at the
University of Oxford.
• Its main contribution was in showing that the depth of the network
is a critical component for good performance.
• Their final best network contains 16 CONV/FC layers and,
appealingly, features an extremely homogeneous architecture that
only performs 3x3 convolutions and 2x2 pooling from the
beginning to the end.
• A downside of the VGGNet is that it is more expensive to
evaluate and uses a lot more memory and parameters (140M).
• Most of these parameters are in the first fully connected layer, and
it was since found that these FC layers can be removed with no
performance downgrade, significantly reducing the number of
parameters.
• The superior performance of the AlexNet model has inspired the
industry to move in the direction of deeper network models.
• In 2014, the VGG Lab of the University of Oxford, proposed a
series of network models such as VGG11, VGG13, VGG16, and
VGG19, and increased the network depth to up to 19 layers.
• Take VGG16 as an example:
– it accepts color picture data with size of 224 × 224
– It then passes through 2 Conv-Conv-Pooling units and 3
Conv-Conv-Conv-Pooling units
– finally outputs the probability of current picture belonging to
1000 categories through a 3 fully connected layers.
Figure VGG16

• The two most common variants are:


– VGG-16: 16 weight layers (13 convolutional + 3 fully connected).
– VGG-19: 19 weight layers (16 convolutional + 3 fully connected).
• The innovations of the VGG series network are:
– The number of layers is increased to 19.
– Uses a smaller 3x3 convolution kernel, which has fewer
parameters and lower computational cost compared to the 7x7
convolution kernel in AlexNet.
– Uses a smaller pooling layer window 2 × 2 and stride size s=2,
while s=2 and pooling window is 3 x 3 in AlexNet.
v. ResNet
• Residual Network developed by Kaiming He et al. was the winner
of ILSVRC 2015.
• It features special skip connections and a heavy use of batch
normalization.
• The architecture is also missing fully connected layers at the end
of the network.
• Residual blocks are an important part of the ResNet architecture.
• In older architectures such as VGG16, convolutional layers are
stacked with batch normalization and activation layers such as
ReLu between them.
• This method works with a small number of convolutional layers
— the maximum for VGG models is around 19 layers.
• However, subsequent research discovered that increasing the
number of layers could significantly improve CNN performance.
• The ResNet architecture introduces the simple concept of adding an
intermediate input to the output of a series of convolution blocks.
• This technique smooths out the gradient flow during
backpropagation, enabling the network to scale to 50, 100, or even
150 layers.
• Skipping a connection does not add additional computational load to
the network.
4.6. Applications of CNN
• CNNs provide an optimal architecture for uncovering and learning
key features in image and time-series data.
• CNNs are a key technology in applications such as:
Image Classification
• Image classification is a major business application of CNNs
because it enables computers to automatically categorize and
understand visual content.
• This has numerous applications in a wide range of industries.
• Image classification assigns a single label or category to an image
from a predefined set of classes.
• Example: Classifying an image as "cat", "dog", "car", etc.
Image tagging
• Tagging assigns multiple labels or tags to an image, describing various
objects, attributes, or scenes present in the image.
• The image tag is a term or a phrase that describes the images and makes
them easier to find.
• It tags the image with multiple terms based on the content.
• Example: Tagging an image with "beach," "sunset," "people," and "ocean.“
• This method is used by big companies like Facebook, Google, & Amazon.
• It is also one of the fundamental elements of visual search.
• Tagging involves recognition of objects and even sentiment analysis of the
image tone.
Recommender Systems
• Another application of CNN is recommendation engines, which use image
data to recommend products or services to customers.
• A recommendation system is designed to suggest relevant items to users
based on their preferences, behavior, or historical data.
• For example, an ecommerce platform can use image classification to
recommend clothing items that match a customer’s style or preferences.
Image Retrieval
• This is another application of image classification, which allows users to
search for images based on their visual content rather than text-based
search terms.
• This is particularly useful in industries such as fashion, where users may
be looking for items that match a particular style or color scheme.
• Other applications of image classification include object detection,
where the goal is to identify and locate objects within an image, and
semantic segmentation, where the goal is to assign a label to each pixel
in an image.
• These applications have many use cases, such as in self-driving cars,
security and surveillance, and medical imaging.
Face Recognition
• Face recognition is a subset of image recognition that specifically
focuses on detecting and identifying human faces within images or
videos.
• In social media, face recognition is often used for features such as
tagging friends in photos or videos, as well as for security and account
verification purposes.
Optical Character Recognition
• Optical character recognition (OCR) is a technology that enables the
digital recognition and interpretation of printed or handwritten text from
images, scanned documents, or other sources.
• OCR algorithms typically use machine learning techniques, such as
convolutional neural networks, to analyze the shape and structure of
individual characters, and then use these patterns to recognize and
transcribe the text.
• OCR has numerous applications in fields such as document
management, digital archiving, and data entry, where it can be used to
automate the process of converting paper documents into searchable,
editable digital text.
• OCR is also commonly used in automated systems for processing forms
and invoices, as well as in the creation of ebooks and digital libraries.
Object Detection
• Automated driving relies on CNNs to accurately detect the presence of a
sign or other object and make decisions based on the output.
• CNNs can be used to detect objects in images and videos.
• CNNs are used in a wide range of object detection applications, such as
self-driving cars, facial recognition systems, and security systems.
Synthetic Data Generation
• Using Generative Adversarial Networks (GANs), new images can
be produced for use in deep learning applications including face
recognition and automated driving.
Medical Imaging
• CNNs can examine thousands of pathology reports to visually
detect the presence or absence of cancer cells in images.
• Medical imaging technology has revolutionized health care,
allowing doctors to detect cancer earlier and improve patient
outcomes.
• Procedures such as X-rays, computed tomography (CT),
magnetic resonance imaging (MRI), positron emission
tomography (PET) and single-photon emission computed
tomography (SPECT) are important in clinical decision-making,
including therapy and follow-up.
Exercise
1. Train a CNN on the CIFAR-10 dataset to classify images into
10 categories (e.g., cars, animals, planes).
• Cifar-10 consists of 60,000 32x32 color images in 10 different
classes, with 6,000 images per class.
• The classes include airplanes, cars, birds, cats, deer, dogs, frogs,
horses, ships, and trucks.
• The dataset is divided into 50,000 training images and 10,000
testing images.
• You can download the data from
https://www.cs.toronto.edu/~kriz/cifar.html
• Create an AI system that trains using Cifar-10 and can classify
images into one of the 10 classes.
Exercise
2. Train a CNN model to classify flowers using the Oxford
Flowers 102 dataset.
• The Oxford Flowers 102 dataset contains 102 categories of
flowers, with each category consisting of 40 to 258 images.
• Resize all the images to 224x224 before feeding it to
convolutional neural network.
• You can download the data from:
https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
• Create a CNN model that can classify flowers into 102 classes
using this dataset.

You might also like