0% found this document useful (0 votes)
117 views15 pages

21CS743 Module4 Notes

M4 notes deep learning

Uploaded by

nickjason670
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views15 pages

21CS743 Module4 Notes

M4 notes deep learning

Uploaded by

nickjason670
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Deep Learning Module 4: Convolutional Networks

Module 4: Convolutional Networks

Module 4: The Convolution Operation, Pooling, Convolution and Pooling as an Infinitely Strong Prior,
Variants of the Basic Convolution Function, Structured Outputs, Data Types, Efficient Convolution
Algorithms, Random or Unsupervised Features- LeNet, AlexNet.

Text Book: Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep Learning”, MIT Press, 2016.

(Chapters 9.1, 9.9)

➢ Convolutional Neural Networks:

Convolutional networks (LeCun, 1989), also known as convolutional neural networks or CNNs, are a
specialized kind of neural network for processing data that has a known, grid-like topology. Examples include
time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image
data, which can be thought of as a 2D grid of pixels. Convolutional networks have been tremendously
successful in practical applications. The name “convolutional neural network” indicates that the network
employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
Convolutional networks are simply neural networks that use convolution in place of general matrix
multiplication in at least one of their layers.

➢ The Convolutional Operation:


• Convolution Operation: As convolution is a mathematical operation on two functions that produces
a third function that expresses how the shape of one function is modified by another.
• Similarly, CNN is more like a convolution operation where we take different measurements and
rely more on the near ones for better results.
• So here we take revised measurements which is a weighted average of the measurements taken
such that the near ones are assigned more weight than the measurements taken earlier.
• Let us assume we are tracking the location of a spaceship with a laser sensor. Our laser sensor
provides a single output x(t), the position of the spaceship at time t. Both x and t are real-valued,
i.e., we can get a different reading from the laser sensor at any instant in time.
• Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the
spaceship’s position, we would like to average together several measurements.
• Of course, more recent measurements are more relevant, so we will want this to be a weighted
average that gives more weight to recent measurements.

Department of CSE, Vemana I.T Page 1 of 15


Deep Learning Module 4: Convolutional Networks

• We can do this with a weighting function w(a), where a is the age of a measurement. If we apply
such a weighted average operation at every moment, we obtain a new function s providing a
smoothed estimate of the position of the spaceship:

• This operation is called convolution. The convolution operation is typically denoted with an
asterisk:

• In the above equation the x represents the input, * represents the convolution operation and w
denotes the filter that is applied.
• In the above example, w needs to be a valid probability density function, or the output is not a
weighted average. Also, w needs to be 0 for all negative arguments, or it will look into the future,
which is presumably beyond our capabilities. These limitations are particular to the example
discussed above though.
• In general, convolution is defined for any functions for which the above integral is defined, and
may be used for other purposes besides taking weighted averages.
• In convolutional network terminology, the first argument (x) to the convolution is often referred to
as the input and the second argument (the function w) as the kernel. The output is sometimes
referred to as the feature map.
• Similarly, for 2D input, re-estimation of each pixel is done by taking a weighted average of all its
neighbours.
➢ Convolutional Neural Network Architecture:
• A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected
layer.
• Convolution Layer:
• The convolution layer is the core building block of the CNN. It carries the main portion of the
network’s computational load. The CNN architecture is as shown in Figure 1.
• This layer performs a dot product between two matrices, where one matrix is the set of learnable
parameters otherwise known as a kernel, and the other matrix is the restricted portion of the
receptive field.
• The kernel is spatially smaller than an image but is more in-depth.

Department of CSE, Vemana I.T Page 2 of 15


Deep Learning Module 4: Convolutional Networks

Fig 1. The CNN Architecture

• During the forward pass, the kernel slides across the height and width of the image-producing the
image representation of that receptive region.
• This produces a two-dimensional representation of the image known as an activation map that
gives the response of the kernel at each spatial position of the image. The sliding size of the kernel
is called a stride.
• If we have an input of size W x W x D and Dout number of kernels with a spatial size of F with
stride S and amount of padding P, then the size of output volume can be determined by the
following formula:

• This will yield an output volume of size Wout x Wout x Dout.


• Input Volume Dimensions: W x W x D: The input volume has a width and height of 𝑊, and a
depth of 𝐷.
• In CNNs, depth 𝐷 typically represents the number of channels (for example, 𝐷=3 for RGB images
with three color channels).
• Kernel Parameters F: This is the spatial size of the filter (kernel). If F=3, it means the kernel is
3×3.
• S: The stride s the number of pixels by which the filter slides over the input each time. If S=1, the
filter moves one pixel at a time. If S=2, it moves two pixels, and so on.
• P: Padding refers to the number of pixels added around the input's border. Padding can help
preserve spatial dimensions after convolution. For example, if padding P=1, one pixel layer is
added to all four sides of the input.

Department of CSE, Vemana I.T Page 3 of 15


Deep Learning Module 4: Convolutional Networks

• Number of Filters: Dout This represents the number of filters (kernels) used in the convolution
layer, determining the depth of the output volume.
• Each filter produces one output channel, so the output will have Dout channels. The figure 2 depicts
the process of how the activation map is obtained.

Fig 2. Representation of how the activation map/ feature map is obtained.

• Before we go on to the next layer let us try and understand Cross-Correlation and its Role in
CNNs.
• In convolutional networks, the operation commonly referred to as "convolution" is, in fact, cross-
correlation.
• Cross-correlation computes the similarity between the input signal and the kernel as the kernel
slides over the input. Mathematically, this can be written as:

(f⋆g)[n]=m∑f[m]g[n+m]

• Here, f represents the input, g the kernel, and n the spatial or temporal shift. Cross-correlation
captures localized patterns in the data, which is essential for feature extraction in CNNs.
• Toeplitz Matrix in Convolution: The convolution operation can be expressed in matrix form
using a Toeplitz matrix, where each diagonal contains the same elements.
• For a 1D convolution, the input vector can be expanded into a Toeplitz matrix, enabling the
convolution to be represented as:

Output=Toeplitz (Input)×Kernel

• This matrix-based representation helps in understanding the linear transformations performed by


convolutional layers.

Department of CSE, Vemana I.T Page 4 of 15


Deep Learning Module 4: Convolutional Networks

• Block-Circulant Matrices for Efficiency: In advanced implementations, the Toeplitz matrix is


often extended into a block-circulant matrix to optimize memory usage and computation. Block-
circulant structures allow for the decomposition of the convolution operation into efficient
computations using FFT, reducing the computational cost significantly.
• Circulant Matrices and FFT:
• A circulant matrix is a specific type of Toeplitz matrix where each row is a circular shift of the
previous row. Circulant matrices are fundamental to the FFT-based acceleration of convolutions,
making them integral to efficient deep learning frameworks.
• By leveraging these mathematical structures, modern CNNs achieve high efficiency in both
training and inference.
➢ Motivation behind Convolution:
• Convolution leverages three important ideas that motivated computer vision researchers: sparse
interaction, parameter sharing, and equivariant representation. Let’s describe each one of them in
detail.
• Trivial neural network layers use matrix multiplication by a matrix of parameters describing the
interaction between the input and output unit. This means that every output unit interacts with every
input unit. However, convolution neural networks have sparse interaction.
• This is achieved by making kernel smaller than the input e.g., an image can have millions or
thousands of pixels, but while processing it using kernel, we can detect meaningful information
that is of tens or hundreds of pixels.
• This means that we need to store fewer parameters that not only reduces the memory requirement
of the model but also improves the statistical efficiency of the model.
• If computing one feature at a spatial point (x1, y1) is useful then it should also be useful at some
other spatial point say (x2, y2). It means that for a single two-dimensional slice i.e., for creating
one activation map, neurons are constrained to use the same set of weights.
• In a traditional neural network, each element of the weight matrix is used once and then never
revisited, while convolution network has shared parameters i.e., for getting output, weights
applied to one input are the same as the weight applied elsewhere.
• Due to parameter sharing, the layers of convolution neural network will have a property of
equivariance to translation. It says that if we changed the input in a way, the output will also get
changed in the same way.

Department of CSE, Vemana I.T Page 5 of 15


Deep Learning Module 4: Convolutional Networks

➢ Pooling Layer:
• A typical layer of a convolutional network consists of three stages. In the first stage, the layer
performs several convolutions in parallel to produce a set of linear activations. In the second stage,
each linear activation is run through a nonlinear activation function, such as the rectified linear
activation function. This stage is sometimes called the detector stage.
• In the third stage, we use a pooling function to modify the output of the layer further. A pooling
function replaces the output of the net at a certain location with a summary statistic of the nearby
outputs. For example, the max pooling operation reports the maximum output within a rectangular
neighbourhood.
• Other popular pooling functions include the average of a rectangular neighbourhood, the L2 norm
of a rectangular neighbourhood, or a weighted average based on the distance from the central pixel.
• In all cases, pooling helps to make the representation become approximately invariant to small
translations of the input. Invariance to translation means that if we translate the input by a small
amount, the values of most of the pooled outputs do not change.
• Translation Invariance: Invariance to local translation can be a very useful property if we care more
about whether some feature is present than exactly where it is.
• For example, when determining whether an image contains a face, we need not know the location
of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of
the face and an eye on the right side of the face.
• In other contexts, it is more important to preserve the location of a feature. For example, if we want
to find a corner defined by two edges meeting at a specific orientation, we need to preserve the
location of the edges well enough to test whether they meet.

➢ Fully Connected Layer:


• Neurons in this layer have full connectivity with all neurons in the preceding and succeeding layer
as seen in regular FCNN.
• This is why it can be computed as usual by a matrix multiplication followed by a bias effect.
• The FC layer helps to map the representation between the input and the output.
• The non linearity is introduced in this fully connected layer through the activation functions.

Department of CSE, Vemana I.T Page 6 of 15


Deep Learning Module 4: Convolutional Networks

• Convolution and Pooling as an Infinitely Strong Prior:


• A prior is something we already assume to be true about a problem.
• Convolution and pooling assume that:
• Patterns matter more than exact positions (e.g., a "tail" is a tail no matter where it is in the photo).
• Small details add up to the big picture (e.g., spotting "fur" and "paws" helps identify a tiger).
• These assumptions (or priors) are so strong that they work incredibly well for many tasks, especially
image recognition.
• They save time, computation, and generalize well to real-world images without needing extra data.
• An infinitely strong prior means having an assumption about how something works that is so powerful
and rigid that it dominates how we process or interpret data no matter what the actual data says.
• IN CNN the assumptions made by the convolution and pooling are that patterns matter more than their
exact location.
• A cat’s ears will look the same whether they are in the top-left corner or the bottom-right corner of
the image. This is called translation invariance.
• Local relationships are key: Convolution assumes the meaningful information in an image (e.g., an
eye, a whisker) can be found by looking at small patches of the image at a time.
• These assumptions are so strong that the model focuses entirely on patterns and ignores other
possibilities.
• For example: If an image’s pattern looks slightly like an ear, the convolutional model might still say,
“This must be part of a cat!” even if it’s just a random shape.
• In other words, convolution and pooling force the model to think in terms of patterns and local
information no matter what the data might suggest otherwise.
• Variants of the Basic Convolution Function:
• Convolutional layers in deep learning have different variants depending on how the convolution
operation is applied:
• Full Convolution: Expands the input by padding it with zeros, ensuring the output is larger than
the input. This helps in capturing more context around the edges.
• Valid Convolution: No padding is added, so the output is smaller than the input because only fully
overlapping parts are computed.
• Same Convolution: Pads the input to maintain the same output size as the input.
• Unshared Convolution: Uses different filters for each location in the input, rather than sharing
the same filter across all positions. This increases flexibility but requires more parameters.

Department of CSE, Vemana I.T Page 7 of 15


Deep Learning Module 4: Convolutional Networks

• Tiled Convolution: Reuses filters at intervals instead of applying them at every position, balancing
parameter sharing and diversity.
• These variations provide trade-offs between computational cost, memory, and the ability to capture
features. So, let’s discuss each variant in detail.
• Full Convolution: To understand better the full convolution let’s elaborate these terms:
• Kernel K with element Ki,j,k,l giving the connection strength between a unit in channel i of output
and a unit in channel j of the input, with an offset of k rows and l columns between the output unit
and the input unit.
• Input: Vi,j,k with channel i, row j and column k
• Output Z same format as V
• Use 1 as first entry
• 0 Padding 1 stride, the equation is:

• 0 Padding s stride, the equation is:

• Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by
downsampling.
• In the below figure 3 depicts the convolution with stride. In the top figure stride value is 2 and that
is applied in a single operation.
• In the bottom figure the stride value is greater than 1 pixel which is convolution with unit stride
along with downsampling.
• The 2-step approach involving downsampling is computationally wasteful because it computes
many values that later on are discarded.

Department of CSE, Vemana I.T Page 8 of 15


Deep Learning Module 4: Convolutional Networks

Fig 3. Convolution with stride.

• Some 0 Paddings and 1 stride


• Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at
each layer. We are forced to choose between shrinking the spatial extent of the network rapidly and
using small kernel. 0 padding allows us to control the kernel width and the size of the output
independently.
• Special case of 0 padding:
• Valid: no 0 padding is used. Limited number of layers.
• Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near the
border influence fewer output pixels than the input pixels near the Center.
• Full: Enough zeros are added for every pixel to be visited k (kernel width) times in each direction,
resulting width m + k - 1. Difficult to learn a single kernel that performs well at all positions in the
convolutional feature map.
• Usually, the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’.
• Unshared Convolution: Unlike standard convolution, where the same kernel is shared across all
spatial regions, unshared convolution uses a unique kernel for each spatial location.
• Properties:
• Each input region has its own filter.
• Captures highly localized, diverse spatial features.
• Requires significantly more parameters and computational resources.

Department of CSE, Vemana I.T Page 9 of 15


Deep Learning Module 4: Convolutional Networks

• Working:
• The input is divided into small overlapping or non-overlapping patches.
• Each patch is processed by a separate kernel, which allows fine-tuned learning specific to
localized features.
• The outputs of these convolutions are combined to form the final feature map.
• Use Case:
• Tasks like image style transfer, where local patterns in different spatial regions vary
significantly.
• Applications in medical imaging to analyse highly localized features in scans or images.
• Drawbacks:
• High computational cost due to the large number of parameters.
• Increased memory usage, as unique filters must be stored for each spatial location.
• In some case when we do not want to use convolution but want to use locally connected layer. We
use Unshared convolution. Indices into weight W
• i: the output channel
• j: the output row;
• k: the output column
• l: the input channel
• m: row offset within input
• n: column offset within input

• Useful when we know that each feature should be a function of a small part of space, but no reason
to think that the same feature should occur across all the space. eg: look for mouth only in the
bottom half of the image.
• It can be also useful to make versions of convolution or local connected layers in which the
connectivity is further restricted, eg: constrain each output channel i to be a function of only a
subset of the input channel.
• Advantages:
• Reduce memory consumption.
• Increase statistical efficiency.
• Reduce computation for both forward and backward prop.
Department of CSE, Vemana I.T Page 10 of 15
Deep Learning Module 4: Convolutional Networks

• Tiled Convolution:
• Definition: A hybrid approach between standard convolution and unshared convolution, where
kernels are shared across tiles or groups of spatial regions instead of globally.
• Properties:
▪ Provides a middle ground by allowing partial sharing of parameters within specific spatial
regions (tiles).
▪ Retains some flexibility of unshared convolution while reducing the computational burden.
• Working:
▪ Divide the input into tiles, which are typically rectangular or square regions of the input.
▪ Assign a unique kernel to each tile. Kernels are shared within a tile but not across tiles.
▪ Perform convolution independently on each tile.
▪ Concatenate the results from all tiles to produce the final output feature map.
• Use Case:
▪ Tasks involving repetitive but spatially localized patterns, such as texture synthesis or
segmentation.
▪ Useful in scenarios where there is a trade-off between the diversity of spatial features and
computational efficiency.
• Design Parameters:
▪ Tile Size: Determines the spatial granularity of kernel sharing. Larger tiles lead to more
sharing, reducing parameter count.
▪ Number of Kernels: Affects model capacity and complexity. Too few kernels can underfit,
while too many may overfit or become computationally expensive.
• Advantages:
▪ Offers flexibility to capture diverse spatial patterns without the full parameter cost of
unshared convolution.
▪ Scales well for moderately complex tasks.
▪ Challenges:
▪ Requires careful selection of tile size and kernel sharing strategy to balance performance
and efficiency.
▪ May introduce boundary artifacts between tiles if not handled properly.

Department of CSE, Vemana I.T Page 11 of 15


Deep Learning Module 4: Convolutional Networks

➢ Structured Outputs:
• As studied in the previous chapters, a convolutional network works with multidimensional tensors. This
can be used, not only to solve a classification or regression task, but to output a structured object in form
of a tensor.
• For example, with a convolutional layer, we can output a tensor, where each pixel has a vector containing
probabilities for belonging to a certain class. This could be used to label every pixel in an image and use
it for segmentation.
• A problem which has to be considered is the shrinking size of the input. Such reductions of the size are
mainly a result of pooling.
• To avoid such a shrinkage, we can first of all avoid pooling at all or using pooling layers with a stride of
one.
• One also may work with a lower resolution when it is possible. Another possibility is to upscale the image
again with multiple convolutions. This can be done by producing an initial guess about the missing pixels.
• From there, we can use this to create a recurrent neuronal network with the same kernels in each step. In
each step our initial guess gets refined.
• Strategy for size reduction issue:
▪ avoid pooling altogether
▪ emit a lower-resolution grid of labels
▪ pooling operator with unit stride
• One strategy for pixel-wise labelling of images is to produce an initial guess of the image label.
• Produce an initial guess of the image labels. Refine this initial guess using the interactions between
neighbouring pixels. Repeat this refinement step several times corresponds to using the same convolution
at each stage, sharing weights between last layers of the deep net. The figure 4, Recurrent Convolutional
Network:

Fig 4. Recurrent Convolutional network for pixel labelling.

Department of CSE, Vemana I.T Page 12 of 15


Deep Learning Module 4: Convolutional Networks

➢ Data Type:
➢ Convolutional networks are capable of processing a wide variety of data, often with multiple channels.
Typically, datasets contain examples that share the same spatial dimensions, a requirement also common
for traditional multilayer perceptron.
➢ However, CNNs offer the flexibility to handle datasets with examples of varying sizes. This is achieved
by applying one or more kernels across the data, with the number of kernel applications depending on the
data's dimensions, which can lead to variable output sizes.
➢ While this variability can sometimes pose challenges, maintaining consistent output sizes can be addressed
using pooling layers.
➢ These layers adapt their pooling regions based on the input size, enabling the network to process inputs of
arbitrary dimensions while producing outputs with uniform size.
➢ Examples of Data Formats for Convolutional Networks:
• 1-Dimensional Data
▪ Single Channel:
▪ Example: Audio waveform.
▪ The convolution is performed over the time axis.
▪ Time is discretized into steps, and the amplitude of the waveform is measured at each step.
▪ Multi-Channel:
▪ Example: Skeleton animation data.
▪ Represents animations of 3D characters. Each character’s pose at a given time is described by the
angles of its skeletal joints.
▪ Each channel corresponds to the angle for one joint’s axis over time.
➢ 2-Dimensional Data
▪ Single Channel:
▪ Example: Pre-processed audio data (Fourier transform).
▪ Audio is converted into a 2D tensor:
▪ Rows: Represent different frequencies.
▪ Columns: Represent different time points.
▪ Convolution along the time axis makes the model invariant to time shifts.
▪ Convolution along the frequency axis makes the model invariant to pitch/octave shifts, ensuring
melodies at different octaves are represented consistently but at different output heights.
▪ Multi-Channel:
▪ Example: Color image data.

Department of CSE, Vemana I.T Page 13 of 15


Deep Learning Module 4: Convolutional Networks

▪ Each channel represents one of the three primary colors:


▪ Red, Green, and Blue.
▪ The convolution kernel operates across both the horizontal and vertical axes of the image, ensuring
translation equivariance in both directions.
➢ 3-Dimensional Data
▪ Single Channel:
▪ Example: Volumetric data (e.g., CT scans).
▪ Represents data in three spatial dimensions, such as depth, height, and width.
▪ Multi-Channel:
▪ Example: Color video data.
▪ Dimensions include:
▪ Time: The sequence of frames.
▪ Height: Vertical resolution of each frame.
▪ Width: Horizontal resolution of each frame.
➢ Efficient Convolutional Algorithms:
• Efficient convolutional algorithms are designed to reduce computational overhead while maintaining the
accuracy of convolutional operations. These methods optimize performance in CNNs, especially for large
datasets and complex architectures.
• As already discussed, variants of the basic convolution function. One option to decrease the computational
effort is to use the strided convolution. But even with strides the convolution is quite expensive to calculate,
since we do a pointwise multiplication for each element.
• To overcome this problem, we may regard the input as a signal. Using the Fourier transform, the
convolution becomes a simple multiplication.
• FFT-Based Convolutions: Uses the Fast Fourier Transform (FFT) to convert spatial domain convolutions
into element-wise multiplications in the frequency domain.
• Significant computational savings for large kernel sizes.
• Drawback: Overhead of transforming between spatial and frequency domains.
• Another approach can be used, when the filters are separable. A filter of dimension dd is separable if it can
be expressed as a product of dd vectors.
• When a filter is separable, the convolution of the dd-dimensional filter is equal to dd one-dimensional
convolutions.
• In this case, the runtime and memory consumption decreases from O(𝑑 𝑛 ) to O(d∗n).

Department of CSE, Vemana I.T Page 14 of 15


Deep Learning Module 4: Convolutional Networks

➢ Random or Unsupervised Features:


• The most resource-intensive aspect of training convolutional networks is often learning the features.
• To mitigate this, there are three primary approaches for obtaining convolution kernels without relying
on supervised training:
• Random Initialization:
▪ Initializing convolution filters randomly has been found to perform surprisingly well in
convolutional networks. This approach is a cost-effective way to explore and finalize a network
architecture:
▪ Evaluate several architectures by training only the final layer.
▪ Select the best-performing architecture and train the entire network more thoroughly using a
resource-intensive approach.
▪ Alternatively, filters can also be manually designed for specific tasks.
• Unsupervised Feature Learning:
▪ By employing unsupervised learning criteria, convolutional features can be learned independently
of the classifier layer at the top of the architecture. This decouples feature learning from supervised
training, making the process more efficient.
▪ An intermediate method involves greedy layer-wise pretraining, such as in a Convolutional Deep
Belief Network (CDBN). Instead of training the entire convolutional layer simultaneously, small
patches of data can be modelled, and the parameters learned from these patches are then extended
to define the kernels for a full convolutional layer.
• Modern Trends:
• Despite these alternative methods, most contemporary convolutional networks are trained using
purely supervised methods. This involves full forward and backpropagation through the entire
network for every training iteration.
• Today, most convolution networks are trained in a purely supervised fashion, using full forward and
back-propagation through the entire network on each training iteration.

Department of CSE, Vemana I.T Page 15 of 15

You might also like