0% found this document useful (0 votes)
20 views11 pages

Convolutional Neural Network

The document provides an overview of Convolutional Neural Networks (CNNs) and their application in image recognition, highlighting their success in the ImageNet competition. It explains key concepts such as receptive fields, convolution, weight sharing, padding, striding, and pooling, which are essential for processing high-dimensional image data. Additionally, it discusses how CNNs create feature maps through multiple layers of convolutions and non-linearities, ultimately transforming the output into a format suitable for classification tasks.

Uploaded by

Mirko Messina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Convolutional Neural Network

The document provides an overview of Convolutional Neural Networks (CNNs) and their application in image recognition, highlighting their success in the ImageNet competition. It explains key concepts such as receptive fields, convolution, weight sharing, padding, striding, and pooling, which are essential for processing high-dimensional image data. Additionally, it discusses how CNNs create feature maps through multiple layers of convolutions and non-linearities, ultimately transforming the output into a format suitable for classification tasks.

Uploaded by

Mirko Messina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Convolutional Neural Networks (CNNS)

l
l
Neural Nets for Image Recognition

• ImageNet Large Scale Visual Recognition Competition (ILSVRC):


• 1.2M images, 1,000 different classes.
• Yearly challenge to find the “State of the Art” in image recognition.
• ILSVRC 2012: Won by the only CNN-based solution.
• ILSVRC 2013: Best 5 participants were all CNNs (9 of the top 10 were CNNs).
• ILSVRC 2014: Everyone uses CNNs.

CNNs are useful whenever there is “local structure” (for ex. Faces we want to detect, edges,
corners) in the data:
• Pixel data
• Audio data
• Voxel data
• ...

f
h
Properties of Image(-like) Data
The reason why a regular feed forward network doesn’t really work well in the context pf image
data is that:
• Images are extremely high-dimensional.
Example: 250 x 250 pixels · 3 color channels = 187.5k input dimensions. We would have to
learn a lot of parameters and this would be quite challenging.
• Pixels that are near each other are highly correlated.
• Same basic patches (e.g., edges, corners) appear on all positions of the image.
• Often, invariances to certain variations are desired (e.g., translation invariance).

MAIN CONCEPTS OF CNNS


Receptive Field
Pixels that are near each other are highly correlated.
Same basic patches (e.g., edges, corners) appear on all positions of the image.
We can detect those basic patches by only viewing a small part of the image.
→ We can use a network with a small receptive field!
The assumption is: if that model cand detect an edge somewhere on theimage, just in a small
portion of the image, if it gets in other parts of the image it should recognize it there as well.
Receptive field: Connect network to patch of image using weight matrix (=kernel or filter)
FNN: In a feed-forward neural network, each hidden neuron is connected to all neurons of the
previous layer.
CNN: In a convolutional neural network, a hidden neuron is only connected to a few neurons in the
previous layer.
H

What Is a Convolution?
We apply one signal, which is our canal to a second signal, which is our input image, at every
position of the signal. The result is our convolution.
A convolution is a mathematical operations on two functions:
g
Technically, it is a cross-correlation or sliding dot product (the visual example when talking about
weight sharing later on should convey this more clearly). By convention, we refer to it as
convolution.
Also, the formula only shows the 2D case. While this is the typical scenario (and in this course, we
keep it that way), we are not limited to 2D.

In a neural network, the convolution is performed on the input (image) matrix.In a neural network,
the convolution is performed on the input (image) matrix.
Applying the kernel W to all image positions (weight sharing) can be viewed as convolution.
We get an output value for each time we apply the kernel.
We apply an activation function to those outputs afterwards (usually ReLU).
Applying the kernel and the activation function is one layer in a Convolutional Neural Network
(CNN).

Weight Sharing
This is a really important point, because it allows us to achieve the so called transation invariance,
with which we can detect features or characteristics of our image at several positions within the
image.
Same basic patches (e.g., edges, corners) appear on all positions of the image.
Often, invariances to certain variations are desired (e.g., translation invariance).
→ We can reuse the receptive field at all positions of the image to produce the new output
(=feature/activation map).
Reusing the kernel weight matrix is called weight sharing.
We apply our kernel to all image positions while keeping the weights the same.
This significantly reduces the number of model parameters.
.
.
k

Kernels
We saw how the convolution with kernels can be applied.
But how do we know which kernels/kernel values to choose?
We can pick existing kernels. Examples:
• Sobel filter/operator for detecting edges
• Gaussian blur filter for blurring
#
#

In convolutional neural networks, however, we actually want to learn the kernels ourselves!
The kernels are just small weight matrices W which we can learn the same way as in regular neural
networks.
Padding (2nd component of CNNs)
When applying a convolution with a kernel of size > 1, the output will be smaller than the input (see
examples before).
Padding is used to keep the input and output size the same:
• Zero-Padding: Add zeros at borders of input
• Repeat-Padding: Duplicate the border values
• Other padding methods: Mean, weighted sum, . . .
Can be applied to input image or in between layers to keep the original input size

Striding
Striding controls how much the kernels/filters are moved.
The smaller the stride, the more the receptive filters overlap.
Striding is one way of downsampling images.
A stride > 1 will lead to loss of information (no problem if we keep the essential information) but
will also reduce computational load and memory requirements.
A stride > 1 will increase the receptive field through depth of network.
F
f
Pooling
Another way of downsampling images is pooling.
There are different ways to perform pooling. Most popular:
• Average Pooling: take the average value in a k x k field
• Max Pooling: take the maximum value in a k x k field
• N-Max Pooling: take the mean over the n maximum values in a k x k field
Pooling will lead to loss of information (no problem if we keep the essential information) but will
also reduce computational load and memory requirements.

Pooling will increase the receptive field through depth of network.


Pooling is a fixed operation compared to “strided” convolutions, i.e., there are no parameters to
learn.

H
Inputs in CNNs
Until now, we assumed grayscale images with 1 channel.
Usually RGB images have 3 channels for red, green, blue, typically stored in a shape of (width,
height, 3).
After the convolutional operations, channels are also called feature maps or activation maps.
We need to make sure our kernel matches the number of channels (e.g., if the input is 3D, the kernel
must be 3D as well).
Regardless of the number of channels, a single feature map/channel will be produced (it just
computes the sum of the channel-wise convolutions).
V

Outputs in CNNs
We usually want to apply multiple kernels to the image.We apply multiple kernels because we want
to detect multiple characteristics of the image.
For each (multi-dimensional) kernel, we create a feature map/channel in the CNN output.For each
(multi-dimensional) kernel, we create a feature map/channel in the CNN output.
F
f
f

f
CREATING A COMPLETE CNN

Multiple Levels of Convolutions


A typical CNN architecture has several layers of:
1. Convolution
2. Non-linearity
3. Pooling (optional) – it downsample the image before moving to the next part.

Each kernel produces a new feature map for the next layer.

To do a convolution, we apply a non linearity, we do pooling to downsample the output that we get
and then we put it again through convolution an so on and so for , repeteadly.

The complexity of detected features tends to increase layer by layer (e.g., first feature map does
edge detection, later ones combine it to complex shapes).

Final Output
Depending on the task, we might need to perform some additional operations after the convolutions.
Example: image classification:
• We want to employ the same strategy we already know from regular neural networks.
• This means that we would like to have a flat vector of size K with all the class probabilities.
• We know that the softmax function can produce the probabilities, but the current output of
our CNN is a multi-dimensional feature map, which is incompatible.
• We somehow need to transform this output to a flat vector of size K.
Solution: Reshape the multi-dimensional output into a vector (a.k.a. flatten), and then apply a
regular fully connected layer that maps the flattened size to K.
Flatten Example

Assume our CNN ultimately produces 2 feature maps m^i, each of size 3 ↓ 3 → 2 · 3 · 3 = 18 flat
elements.
For each map, run through all elements from left to right and top to bottom and put the
corresponding element into our flat vector of size 18

Assume our CNN ultimately produces 2 feature maps m^i, each of size 3 ↓ 3 → 2 · 3 · 3 = 18 flat
elements.
For each map, run through all elements from left to right and top to bottom and put the
corresponding element into our flat vector of size 18 → ready for regular NN input!

You might also like