Chapter 4 - CNN updated
Chapter 4 - CNN updated
Figure Images can be broken into local patterns such as edges, textures, and so on
• This key characteristic gives convnets two interesting properties:
• 1). The patterns they learn are translation-invariant.
• After learning a certain pattern in the lower-right corner of a
picture, a convnet can recognize it anywhere, for example, in the
upper-left corner.
• A densely connected model would have to learn the pattern anew
if it appeared at a new location.
• This makes CNNs data-efficient when processing images
(because the visual world is fundamentally translation-invariant):
– they need fewer training samples to learn representations that have
generalization power.
• 2). They can learn spatial hierarchies of patterns.
• A first convolution layer will learn small local patterns such as
edges, a second convolution layer will learn larger patterns made
of the features of the first layers, and so on.
• This allows convnets to efficiently learn increasingly complex and
abstract visual concepts, because the visual world is
fundamentally spatially hierarchical.
4.1. Introduction to CNN…
• The objective of the Convolution Operation is to extract the
high-level features such as edges, from the input image.
• Conventionally, the first convolutional layer is responsible for
capturing the low-Level features such as edges, color, gradient
orientation, etc.
• With added layers, the architecture adapts to the high-level
features as well, giving us a network that has a full understanding
of images in the dataset, similar to how we would.
• Convolutional layers:
– can learn to extract features on their own
– can detect these features regardless of location in the image
– when stacked together, it can take advantage of the hierarchical
nature of most image data.
Figure Hierarchical feature extraction in CNNs
1. Convolutional Layer
• The convolutional layer is the core building block of a CNN, and
it is where the majority of computation occurs.
• It requires a few components, which are input data, a filter, and a
feature map.
• Let us assume that the input will be a color image, which is made
up of a tensor of pixels in 3D.
• This means that the input will have three dimensions—a height,
width, and depth—which correspond to RGB in an image.
• We also have a feature detector, also known as a kernel or a filter,
which will move across the receptive fields of the image,
checking if the feature is present.
• This process is known as a convolution.
4.2. The Basic Structure of CNN…
• The feature detector is a two-dimensional (2-D) array of weights,
which represents part of the image.
• While they can vary in size, the filter size is typically a 3x3
matrix; this also determines the size of the receptive field.
• The filter is then applied to an area of the image, and a dot
product is calculated between the input pixels and the filter.
• This dot product is then fed into an output array.
• Afterwards, the filter shifts by a stride, repeating the process until
the kernel has swept across the entire image.
• The final output from the series of dot products from the input and
the filter is known as a feature map, activation map, or a
convolved feature.
4.2. The Basic Structure of CNN…
• After each convolution operation, a CNN applies a ReLU transformation
to the feature map, introducing nonlinearity to the model.
• Another convolution layer can follow the initial convolution layer.
• When this happens, the structure of the CNN can become hierarchical as
the later layers can see the pixels within the receptive fields of prior
layers.
• As an example, let us assume that we are trying to determine if an image
contains a bicycle.
• You can think of the bicycle as a sum of parts.
• It is comprised of a frame, handlebars, wheels, pedals, etc.
• Each individual part of the bicycle makes up a lower-level pattern in the
neural net, and the combination of its parts represents a higher-level
pattern, creating a feature hierarchy within the CNN.
4.2. The Basic Structure of CNN…
• Convolution is a mathematical operation that allows the merging
of two sets of information.
• In the case of CNN, convolution is applied to the input data to
filter the information and produce a feature map.
• This filter is also called a kernel, or feature detector, and its
dimensions can be, for example, 3x3.
• To perform convolution, the kernel goes over the input image,
doing matrix multiplication element after element.
• The result for each receptive field (the area where convolution
takes place) is written down in the feature map.
• We continue sliding the filter until the feature map is complete.
• A kernel is a 2-dimensional array of weights.
• The weights associated with the convolutional layers in a CNN
are what make up the kernels.
• Until the weights are trained, none of the kernels know which
“features” they should detect.
• So, if each kernel is just an array of weights, how do these
weights operate on the input image during convolution?
• The network performs an element-wise multiplication between the
kernel (weight) and the input pixels within its receptive field, then
sums everything up and sends that value to the output array.
• In the two-dimensional cross-correlation operation, we begin with
the convolution window positioned at the upper-left corner of the
input tensor.
• And then slide it across the input tensor, both from left to right
and top to bottom.
• When the convolution window slides to a certain position, the
input subtensor contained in that window and the kernel tensor are
multiplied elementwise and the resulting tensor is summed up
yielding a single scalar value.
• This result gives the value of the output tensor at the
corresponding location.
•
4.2. The Basic Structure of CNN…
• The same filter array of weights is multiplied with each different
window region of the image.
• The question is “what are the best weight values in the filter
array?”.
• The weights in the filter arrays are learned from the training data.
• Hence, we don’t have to manually set the weights of the filter
array.
• Recall that strictly speaking, convolutional layers are a misnomer.
• The operations they express are more accurately described as
cross-correlations.
• In such a layer, an input tensor and a kernel tensor are combined
to produce an output tensor through a cross-correlation operation.
•
4.2. The Basic Structure of CNN…
• Another example of convolution operation:
* =
1*1+0*1+1*1+0*0+1*1+0*1+1*0+0*0+1*1 = 4
1*1+0*1+1*0+0*1+1*1+0*1+1*0+0*1+1*1 = 3
1*1+0*0+0*1+0*1+1*1+0*0+1*1+0*1+1*1 = 4
1*0+0*1+1*1+0*0+1*0+0*1+1*0+0*0+1*1 = 2
…
1*1+0*1+1*1+0*1+1*1+0*0+1*1+0*0+1*0 = 4
4.2. The Basic Structure of CNN…
i. Stride
• When working with a convolutional layer, you may need to get an
output that is smaller than the input.
• Ways to achieve this:
– One way is to use a pooling layer
– Another way is using stride
• The idea behind stride is to skip some areas when the kernel slides
over: for example, skipping every 2 or 3 pixels.
• It reduces spatial resolution and makes the network more
computationally efficient.
• Stride defines by what step does to kernel move.
• For example, a stride of 1 makes kernel slide by one row/column
at a time, and a stride of 2 moves kernel by 2 rows/columns.
4.2. The Basic Structure of CNN…
• For a convolutional or a pooling operation, the stride S denotes
the number of pixels by which the window moves after each
convolution operation.
• A normal filter would be moved 1 pixel over, having a stride
length of 1.
• A stride length of 3 would move the filter over by 3 pixels each
time.
• Since the number of outputs depends on how many positions the
filter is applied to, a larger stride will produce fewer outputs.
• And this reduce the resolution of the resulting array from the
convolution.
• Selecting stride length depends on the desired effect and/or
computing resource constraints or efficiency goals and can be
treated as a hyperparameter.
4.2. The Basic Structure of CNN…
Figure Common
transfer learning
approaches
• In general, there are two ways to utilize a pretrained network:
– feature extraction approach
– fine-tuning approach
• Feature extraction consists of using the representations learned by
a previous network to extract interesting features from new
samples.
• These features are then run through a new classifier, which is
trained from scratch.
• A fixed feature extraction method is a process to remove fully
connected layers from the pretrained network while maintaining
the remaining network, referred to as the convolutional base, as a
feature extractor.
• In this scenario, any machine learning classifier, such as random
forests and support vector machines, as well as the usual fully
connected layers, can be added on top of the fixed feature
extractor.
• This results in training limited to the added classifier on a given
dataset of interest.
• Convnets used for image classification comprise two parts:
– they start with a series of convolution and pooling layers, and
– they end with a densely connected classifier.
• The first part is called the convolutional base of the model.
• Feature extraction consists of taking the convolutional base of a
previously trained network, running the new data through it, and
training a new classifier on top of the output.
• A fine-tuning method
– replace fully connected layers of the pretrained model with a new set
of fully connected layers to retrain on a given dataset
– fine-tune all or part of the kernels in the pretrained convolutional
base.
• All the layers in the convolutional base can be fine-tuned or,
alternatively, some earlier layers can be fixed while fine-tuning
the rest of the deeper layers.
• This is motivated by the observation that the early-layer features
appear more generic, including features such as edges applicable
to a variety of datasets and tasks.
• However, later features progressively become more specific to a
particular dataset or task.
• Why not fine-tune the entire convolutional base? You could.
• But you need to consider the following:
– Earlier layers in the convolutional base encode more-generic, reusable
features, whereas layers higher up encode more-specialized features.
It is more useful to fine-tune the more specialized features, because
these are the ones that need to be repurposed on the new problem.
There would be fast-decreasing returns in fine-tuning lower layers.
– The more parameters you are training, the more you are at risk of
overfitting. The convolutional base has tens of millions of parameters,
so it would be risky to attempt to train it on your small dataset.
4.4. Transfer Learning…
• Thus, in this situation, it is a good strategy to fine-tune only the
top two or three layers in the convolutional base.
• One drawback of transfer learning is its constraints on input
dimensions.
• The input image has to be 2D with three channels because the
ImageNet dataset consists of 2D color images that have three
channels (RGB: red, green, and blue).
• On the other hand, the height and width of an input image can be
arbitrary, but not too small.
4.5. Architecture of CNN
• There are several architectures in the field of Convolutional
Networks that were introduced by different companies or research
institutions.
• The most common are:
– LeNet
– AlexNet
– ZF Net
– GoogleNet (Inception)
– VGGNet
– ResNet,
– EfficientNet, etc.
4.5. Architecture of CNN…
i. LeNet
• The first successful applications of CNN
were developed by Yann LeCun in
1990s.
• Of these, the best known is the LeNet-5
architecture that was used to read zip
codes, digits, etc.
• Each version of LeNet played a critical
role in demonstrating the power of
CNNs, laying the foundation for the rapid
development of deep learning.
4.5. Architecture of CNN…
ii. AlexNet
• The first work that popularized CNN in computer vision was the
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever and
Geoff Hinton.
• The AlexNet was submitted to the ImageNet ILSVRC challenge
in 2012 and won the challenge.
• It significantly outperformed the second runner-up (top 5 error of
16% compared to runner-up with 26% error).
• The Network had a very similar architecture to LeNet.
• But it was deeper, bigger, and featured Convolutional Layers
stacked on top of each other (previously it was common to only
have a single CONV layer always immediately followed by a
POOL layer).
• The innovations of AlexNet are:
– The number of layers has reached eight.
– Uses the ReLU activation function.
Most of previous neural networks use
the Sigmoid activation function, which
is relatively complicated to calculate
and is prone to gradient dispersion.
– Introduced the Dropout layer. Dropout
improves the generalization ability of
the model and prevents overfitting
4.5. Architecture of CNN…
iii. Inception (GoogLeNet)
• The ILSVRC 2014 winner was a Convolutional Network from
Szegedy et al. from Google.
• Its main contribution was the development of an Inception
Module that dramatically reduced the number of parameters in the
network (4M, compared to AlexNet with 60M).
• Additionally, it uses Average Pooling instead of Fully Connected
layers at the top of the ConvNet, eliminating a large amount of
parameters that do not seem to matter much.
• There are also several follow up versions to the Inxeption, most
recently Inception-v4.
• The core building block of the architecture is the inception
module.
• The module uses 1x1, 3x3, and 5x5 convolutions in parallel to
capture features of varying sizes.
• Instead of using a single filter size (e.g., 3x3 or 5x5), the Inception
module applies multiple filter sizes (1x1, 3x3, 5x5) and pooling
operations in parallel.
• A 1x1 filter reduces the number of input channels before applying
more computationally expensive operations (like 3x3 and 5x5
convolutions).
• The outputs of these operations are concatenated along the depth
dimension, allowing the network to capture fine-grained and
coarse-grained features simultaneously.
Inception module