21CS743 Module4 Notes
21CS743 Module4 Notes
Module 4: The Convolution Operation, Pooling, Convolution and Pooling as an Infinitely Strong Prior,
Variants of the Basic Convolution Function, Structured Outputs, Data Types, Efficient Convolution
Algorithms, Random or Unsupervised Features- LeNet, AlexNet.
Text Book: Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.
Convolutional networks (LeCun, 1989), also known as convolutional neural networks or CNNs, are a
specialized kind of neural network for processing data that has a known, grid-like topology. Examples include
time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image
data, which can be thought of as a 2D grid of pixels. Convolutional networks have been tremendously
successful in practical applications. The name “convolutional neural network” indicates that the network
employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
Convolutional networks are simply neural networks that use convolution in place of general matrix
multiplication in at least one of their layers.
• We can do this with a weighting function w(a), where a is the age of a measurement. If we apply
such a weighted average operation at every moment, we obtain a new function s providing a
smoothed estimate of the position of the spaceship:
• This operation is called convolution. The convolution operation is typically denoted with an
asterisk:
• In the above equation the x represents the input, * represents the convolution operation and w
denotes the filter that is applied.
• In the above example, w needs to be a valid probability density function, or the output is not a
weighted average. Also, w needs to be 0 for all negative arguments, or it will look into the future,
which is presumably beyond our capabilities. These limitations are particular to the example
discussed above though.
• In general, convolution is defined for any functions for which the above integral is defined, and
may be used for other purposes besides taking weighted averages.
• In convolutional network terminology, the first argument (x) to the convolution is often referred to
as the input and the second argument (the function w) as the kernel. The output is sometimes
referred to as the feature map.
• Similarly, for 2D input, re-estimation of each pixel is done by taking a weighted average of all its
neighbours.
➢ Convolutional Neural Network Architecture:
• A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected
layer.
• Convolution Layer:
• The convolution layer is the core building block of the CNN. It carries the main portion of the
network’s computational load. The CNN architecture is as shown in Figure 1.
• This layer performs a dot product between two matrices, where one matrix is the set of learnable
parameters otherwise known as a kernel, and the other matrix is the restricted portion of the
receptive field.
• The kernel is spatially smaller than an image but is more in-depth.
• During the forward pass, the kernel slides across the height and width of the image-producing the
image representation of that receptive region.
• This produces a two-dimensional representation of the image known as an activation map that
gives the response of the kernel at each spatial position of the image. The sliding size of the kernel
is called a stride.
• If we have an input of size W x W x D and Dout number of kernels with a spatial size of F with
stride S and amount of padding P, then the size of output volume can be determined by the
following formula:
• Number of Filters: Dout This represents the number of filters (kernels) used in the convolution
layer, determining the depth of the output volume.
• Each filter produces one output channel, so the output will have Dout channels. The figure 2 depicts
the process of how the activation map is obtained.
• Before we go on to the next layer let us try and understand Cross-Correlation and its Role in
CNNs.
• In convolutional networks, the operation commonly referred to as "convolution" is, in fact, cross-
correlation.
• Cross-correlation computes the similarity between the input signal and the kernel as the kernel
slides over the input. Mathematically, this can be written as:
(f⋆g)[n]=m∑f[m]g[n+m]
• Here, f represents the input, g the kernel, and n the spatial or temporal shift. Cross-correlation
captures localized patterns in the data, which is essential for feature extraction in CNNs.
• Toeplitz Matrix in Convolution: The convolution operation can be expressed in matrix form
using a Toeplitz matrix, where each diagonal contains the same elements.
• For a 1D convolution, the input vector can be expanded into a Toeplitz matrix, enabling the
convolution to be represented as:
Output=Toeplitz (Input)×Kernel
➢ Pooling Layer:
• A typical layer of a convolutional network consists of three stages. In the first stage, the layer
performs several convolutions in parallel to produce a set of linear activations. In the second stage,
each linear activation is run through a nonlinear activation function, such as the rectified linear
activation function. This stage is sometimes called the detector stage.
• In the third stage, we use a pooling function to modify the output of the layer further. A pooling
function replaces the output of the net at a certain location with a summary statistic of the nearby
outputs. For example, the max pooling operation reports the maximum output within a rectangular
neighbourhood.
• Other popular pooling functions include the average of a rectangular neighbourhood, the L2 norm
of a rectangular neighbourhood, or a weighted average based on the distance from the central pixel.
• In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a small
amount, the values of most of the pooled outputs do not change.
• Translation Invariance: Invariance to local translation can be a very useful property if we care more
about whether some feature is present than exactly where it is.
• For example, when determining whether an image contains a face, we need not know the location
of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of
the face and an eye on the right side of the face.
• In other contexts, it is more important to preserve the location of a feature. For example, if we want
to find a corner defined by two edges meeting at a specific orientation, we need to preserve the
location of the edges well enough to test whether they meet.
• Tiled Convolution: Reuses filters at intervals instead of applying them at every position, balancing
parameter sharing and diversity.
• These variations provide trade-offs between computational cost, memory, and the ability to capture
features. So, let’s discuss each variant in detail.
• Full Convolution: To understand better the full convolution let’s elaborate these terms:
• Kernel K with element Ki,j,k,l giving the connection strength between a unit in channel i of output
and a unit in channel j of the input, with an offset of k rows and l columns between the output unit
and the input unit.
• Input: Vi,j,k with channel i, row j and column k
• Output Z same format as V
• Use 1 as first entry
• 0 Padding 1 stride, the equation is:
• Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by
downsampling.
• In the below figure 3 depicts the convolution with stride. In the top figure stride value is 2 and that
is applied in a single operation.
• In the bottom figure the stride value is greater than 1 pixel which is convolution with unit stride
along with downsampling.
• The 2-step approach involving downsampling is computationally wasteful because it computes
many values that later on are discarded.
• Working:
• The input is divided into small overlapping or non-overlapping patches.
• Each patch is processed by a separate kernel, which allows fine-tuned learning specific to
localized features.
• The outputs of these convolutions are combined to form the final feature map.
• Use Case:
• Tasks like image style transfer, where local patterns in different spatial regions vary
significantly.
• Applications in medical imaging to analyse highly localized features in scans or images.
• Drawbacks:
• High computational cost due to the large number of parameters.
• Increased memory usage, as unique filters must be stored for each spatial location.
• In some case when we do not want to use convolution but want to use locally connected layer. We
use Unshared convolution. Indices into weight W
• i: the output channel
• j: the output row;
• k: the output column
• l: the input channel
• m: row offset within input
• n: column offset within input
• Useful when we know that each feature should be a function of a small part of space, but no reason
to think that the same feature should occur across all the space. eg: look for mouth only in the
bottom half of the image.
• It can be also useful to make versions of convolution or local connected layers in which the
connectivity is further restricted, eg: constrain each output channel i to be a function of only a
subset of the input channel.
• Advantages:
• Reduce memory consumption.
• Increase statistical efficiency.
• Reduce computation for both forward and backward prop.
Department of CSE, Vemana I.T Page 10 of 15
Deep Learning Module 4: Convolutional Networks
• Tiled Convolution:
• Definition: A hybrid approach between standard convolution and unshared convolution, where
kernels are shared across tiles or groups of spatial regions instead of globally.
• Properties:
▪ Provides a middle ground by allowing partial sharing of parameters within specific spatial
regions (tiles).
▪ Retains some flexibility of unshared convolution while reducing the computational burden.
• Working:
▪ Divide the input into tiles, which are typically rectangular or square regions of the input.
▪ Assign a unique kernel to each tile. Kernels are shared within a tile but not across tiles.
▪ Perform convolution independently on each tile.
▪ Concatenate the results from all tiles to produce the final output feature map.
• Use Case:
▪ Tasks involving repetitive but spatially localized patterns, such as texture synthesis or
segmentation.
▪ Useful in scenarios where there is a trade-off between the diversity of spatial features and
computational efficiency.
• Design Parameters:
▪ Tile Size: Determines the spatial granularity of kernel sharing. Larger tiles lead to more
sharing, reducing parameter count.
▪ Number of Kernels: Affects model capacity and complexity. Too few kernels can underfit,
while too many may overfit or become computationally expensive.
• Advantages:
▪ Offers flexibility to capture diverse spatial patterns without the full parameter cost of
unshared convolution.
▪ Scales well for moderately complex tasks.
▪ Challenges:
▪ Requires careful selection of tile size and kernel sharing strategy to balance performance
and efficiency.
▪ May introduce boundary artifacts between tiles if not handled properly.
➢ Structured Outputs:
• As studied in the previous chapters, a convolutional network works with multidimensional tensors. This
can be used, not only to solve a classification or regression task, but to output a structured object in form
of a tensor.
• For example, with a convolutional layer, we can output a tensor, where each pixel has a vector containing
probabilities for belonging to a certain class. This could be used to label every pixel in an image and use
it for segmentation.
• A problem which has to be considered is the shrinking size of the input. Such reductions of the size are
mainly a result of pooling.
• To avoid such a shrinkage, we can first of all avoid pooling at all or using pooling layers with a stride of
one.
• One also may work with a lower resolution when it is possible. Another possibility is to upscale the image
again with multiple convolutions. This can be done by producing an initial guess about the missing pixels.
• From there, we can use this to create a recurrent neuronal network with the same kernels in each step. In
each step our initial guess gets refined.
• Strategy for size reduction issue:
▪ avoid pooling altogether
▪ emit a lower-resolution grid of labels
▪ pooling operator with unit stride
• One strategy for pixel-wise labelling of images is to produce an initial guess of the image label.
• Produce an initial guess of the image labels. Refine this initial guess using the interactions between
neighbouring pixels. Repeat this refinement step several times corresponds to using the same convolution
at each stage, sharing weights between last layers of the deep net. The figure 4, Recurrent Convolutional
Network:
➢ Data Type:
➢ Convolutional networks are capable of processing a wide variety of data, often with multiple channels.
Typically, datasets contain examples that share the same spatial dimensions, a requirement also common
for traditional multilayer perceptron.
➢ However, CNNs offer the flexibility to handle datasets with examples of varying sizes. This is achieved
by applying one or more kernels across the data, with the number of kernel applications depending on the
data's dimensions, which can lead to variable output sizes.
➢ While this variability can sometimes pose challenges, maintaining consistent output sizes can be addressed
using pooling layers.
➢ These layers adapt their pooling regions based on the input size, enabling the network to process inputs of
arbitrary dimensions while producing outputs with uniform size.
➢ Examples of Data Formats for Convolutional Networks:
• 1-Dimensional Data
▪ Single Channel:
▪ Example: Audio waveform.
▪ The convolution is performed over the time axis.
▪ Time is discretized into steps, and the amplitude of the waveform is measured at each step.
▪ Multi-Channel:
▪ Example: Skeleton animation data.
▪ Represents animations of 3D characters. Each character’s pose at a given time is described by the
angles of its skeletal joints.
▪ Each channel corresponds to the angle for one joint’s axis over time.
➢ 2-Dimensional Data
▪ Single Channel:
▪ Example: Pre-processed audio data (Fourier transform).
▪ Audio is converted into a 2D tensor:
▪ Rows: Represent different frequencies.
▪ Columns: Represent different time points.
▪ Convolution along the time axis makes the model invariant to time shifts.
▪ Convolution along the frequency axis makes the model invariant to pitch/octave shifts, ensuring
melodies at different octaves are represented consistently but at different output heights.
▪ Multi-Channel:
▪ Example: Color image data.