Convolutional Neural Network
Convolutional Neural Network
l
l
Neural Nets for Image Recognition
CNNs are useful whenever there is “local structure” (for ex. Faces we want to detect, edges,
corners) in the data:
• Pixel data
• Audio data
• Voxel data
• ...
f
h
Properties of Image(-like) Data
The reason why a regular feed forward network doesn’t really work well in the context pf image
data is that:
• Images are extremely high-dimensional.
Example: 250 x 250 pixels · 3 color channels = 187.5k input dimensions. We would have to
learn a lot of parameters and this would be quite challenging.
• Pixels that are near each other are highly correlated.
• Same basic patches (e.g., edges, corners) appear on all positions of the image.
• Often, invariances to certain variations are desired (e.g., translation invariance).
What Is a Convolution?
We apply one signal, which is our canal to a second signal, which is our input image, at every
position of the signal. The result is our convolution.
A convolution is a mathematical operations on two functions:
g
Technically, it is a cross-correlation or sliding dot product (the visual example when talking about
weight sharing later on should convey this more clearly). By convention, we refer to it as
convolution.
Also, the formula only shows the 2D case. While this is the typical scenario (and in this course, we
keep it that way), we are not limited to 2D.
In a neural network, the convolution is performed on the input (image) matrix.In a neural network,
the convolution is performed on the input (image) matrix.
Applying the kernel W to all image positions (weight sharing) can be viewed as convolution.
We get an output value for each time we apply the kernel.
We apply an activation function to those outputs afterwards (usually ReLU).
Applying the kernel and the activation function is one layer in a Convolutional Neural Network
(CNN).
Weight Sharing
This is a really important point, because it allows us to achieve the so called transation invariance,
with which we can detect features or characteristics of our image at several positions within the
image.
Same basic patches (e.g., edges, corners) appear on all positions of the image.
Often, invariances to certain variations are desired (e.g., translation invariance).
→ We can reuse the receptive field at all positions of the image to produce the new output
(=feature/activation map).
Reusing the kernel weight matrix is called weight sharing.
We apply our kernel to all image positions while keeping the weights the same.
This significantly reduces the number of model parameters.
.
.
k
Kernels
We saw how the convolution with kernels can be applied.
But how do we know which kernels/kernel values to choose?
We can pick existing kernels. Examples:
• Sobel filter/operator for detecting edges
• Gaussian blur filter for blurring
#
#
In convolutional neural networks, however, we actually want to learn the kernels ourselves!
The kernels are just small weight matrices W which we can learn the same way as in regular neural
networks.
Padding (2nd component of CNNs)
When applying a convolution with a kernel of size > 1, the output will be smaller than the input (see
examples before).
Padding is used to keep the input and output size the same:
• Zero-Padding: Add zeros at borders of input
• Repeat-Padding: Duplicate the border values
• Other padding methods: Mean, weighted sum, . . .
Can be applied to input image or in between layers to keep the original input size
Striding
Striding controls how much the kernels/filters are moved.
The smaller the stride, the more the receptive filters overlap.
Striding is one way of downsampling images.
A stride > 1 will lead to loss of information (no problem if we keep the essential information) but
will also reduce computational load and memory requirements.
A stride > 1 will increase the receptive field through depth of network.
F
f
Pooling
Another way of downsampling images is pooling.
There are different ways to perform pooling. Most popular:
• Average Pooling: take the average value in a k x k field
• Max Pooling: take the maximum value in a k x k field
• N-Max Pooling: take the mean over the n maximum values in a k x k field
Pooling will lead to loss of information (no problem if we keep the essential information) but will
also reduce computational load and memory requirements.
H
Inputs in CNNs
Until now, we assumed grayscale images with 1 channel.
Usually RGB images have 3 channels for red, green, blue, typically stored in a shape of (width,
height, 3).
After the convolutional operations, channels are also called feature maps or activation maps.
We need to make sure our kernel matches the number of channels (e.g., if the input is 3D, the kernel
must be 3D as well).
Regardless of the number of channels, a single feature map/channel will be produced (it just
computes the sum of the channel-wise convolutions).
V
Outputs in CNNs
We usually want to apply multiple kernels to the image.We apply multiple kernels because we want
to detect multiple characteristics of the image.
For each (multi-dimensional) kernel, we create a feature map/channel in the CNN output.For each
(multi-dimensional) kernel, we create a feature map/channel in the CNN output.
F
f
f
f
CREATING A COMPLETE CNN
Each kernel produces a new feature map for the next layer.
To do a convolution, we apply a non linearity, we do pooling to downsample the output that we get
and then we put it again through convolution an so on and so for , repeteadly.
The complexity of detected features tends to increase layer by layer (e.g., first feature map does
edge detection, later ones combine it to complex shapes).
Final Output
Depending on the task, we might need to perform some additional operations after the convolutions.
Example: image classification:
• We want to employ the same strategy we already know from regular neural networks.
• This means that we would like to have a flat vector of size K with all the class probabilities.
• We know that the softmax function can produce the probabilities, but the current output of
our CNN is a multi-dimensional feature map, which is incompatible.
• We somehow need to transform this output to a flat vector of size K.
Solution: Reshape the multi-dimensional output into a vector (a.k.a. flatten), and then apply a
regular fully connected layer that maps the flattened size to K.
Flatten Example
Assume our CNN ultimately produces 2 feature maps m^i, each of size 3 ↓ 3 → 2 · 3 · 3 = 18 flat
elements.
For each map, run through all elements from left to right and top to bottom and put the
corresponding element into our flat vector of size 18
Assume our CNN ultimately produces 2 feature maps m^i, each of size 3 ↓ 3 → 2 · 3 · 3 = 18 flat
elements.
For each map, run through all elements from left to right and top to bottom and put the
corresponding element into our flat vector of size 18 → ready for regular NN input!