Computer Vision
Field of AI that enables computers to derive meaningful information from
images, videos, or other visual inputs.
Human Vision vs. Computer Vision
• Human vision:
– lifetimes of context to train how to tell objects apart, how far away they are, whether they are
moving and whether there is something wrong in an image.
• Computer vision:
– Trains machines to perform these functions. Can analyze thousands of products in less than a
minute
Computer Vision Applications
• Tesla: Autonomous cars for hand free driving
• Google translate: Convert road signs in one language to another language
• Photo scan: Optical Character Recognition (OCR) or QR Code Reader
• Facebook: Face recognition for automatic tagging
• Boston Dynamics: Designing intelligent robots
Basic Components of a Image Processing Task
What is an Image?
Color Channel
1 channel 1 channel 3 channel
How does computer vision works?
• How to solve computer vision tasks?
– Computer vision needs lots of data.
– It runs analyses of data over and over until it discerns distinctions and ultimately recognize
images.
• Machine learning based models can enable a computer to teach itself about the
context of visual data.
– One popular ML algorithm used for computer vision task is convolutional neural network (CNN)
Convolutional Neural Network
• Convolutional Neural Networks (CNN) are distinguished from other neural networks
by their superior performance with image or visual input signals.
• CNN consists of three main layers:
1) Convolutional layer
2) Pooling layer
3) Fully-connected layer
Why CNN
• High computation cost of ANN
• Reduce overfitting
• Successfully capture the Spatial and Temporal dependencies in an
image
• Trainable parameters depends on filter rather than image size
Convolutional Layer
• Convolutional Layers are core building block of a CNN.
• Focus on detecting edges from an image
• It requires a few components, which are input data, a filter, and a feature map.
Convolution
• Convolution: Express how one shape is modified by another
• Below amatrix is convolved with a filter to obtain matrix
Input Image
Convolutional Layer (Intuition)
How can we detect these edges from an image?
Convolutional Layer (Intuition)
• To illustrate this, we use a simplified picture
Convolutional Layer
• We have seen that convolving an input of dimension with a filter results in output.
• In general:
– Input:
– Filter size:
– Output:
• Disadvantage:
– Every time we apply a convolutional operation, the size of the image shrinks
– Pixels present in the corner of the image are used only a few number of times during convolution
as compared to the central pixels
– Hence, we do not focus too much on the corners since that can lead to information loss
Convolutional Layer (Padding)
• We can pad the image with an additional border (add pixels around the border)
• In general:
– Input:
– Padding:
– Filter size:
– Output:
Convolutional Layer (Padding)
• Hence, we have two choice for padding
• Valid Padding:
– It means no padding.
– If we are using valid padding, the output will be
• Same Padding:
– Here, we apply padding so that the output size is the same as the input size
– Need to set,
Convolutional Layer (Stride)
• Stride is how far the filter moves in every step along one direction.
• If we select stride of 2, then we will take two steps – both in the horizontal and
vertical directions.
Convolution Over Multiple Channel
• Color image contains three channel
• We can use a filter on the image
• The last dimension of the filter should be same as the number of input channel
• We can use multiple filter for capturing multiple features
Pooling Layer
• Pooling layers are generally used to reduce the size of the inputs and hence speed up
the computation.
CNN Example
• There are a combination of convolution and pooling layers at the beginning
• A few fully connected layers at the end
• And, finally a softmax classifier to classify the input into various categories
• There are a lot of hyperparameters in this network which we have to specify as well
CNN Example
Classical Networks (LeNet-5)
• Parameters: 60k
• Layers flow: Conv -> Pool -> Conv -> Pool -> FC -> FC -> Output
• Activation functions: Sigmoid/tanh and ReLu
Classical Networks (Alexnet)
More Classical Networks
• VGG-16
• ResNet
• Inception
• And many more…
Some Notes
• Building your own model from scratch can be a tedious and cumbersome
process.
• In many cases, we also face issues like lack of data availability.
• Steps to follow:
– Using Open-Source implementation
– Transfer Learning: we can take a pre-trained network and transfer that to a new
task which we are working on.
– Data Augmentation: Deep learning models perform well when we have a large
amount of data.
– E.g., Mirroring, Random cropping, Rotating
Resources
• https://www.youtube.com/playlist?list=PLGP2q2bIgaNzhSv4yMX6yPxwQ0mk4CP
wS