Image Features and Descriptors
Image Features and Descriptors
-- Socrates
1
2
1/12/2022
CONTENTS
Feature detectors
Feature descriptors
Harris Corner
SIFT, ORB, BRIEF features
Blob detectors (LoG, DoG, DoH, HoG)
Image matching and object detection
Haar-like features – Face detection
2
4
1/12/2022
Feature Extraction
After an image has been segmented into regions or their boundaries, the resulting sets of
segmented pixels usually have to be converted into a form suitable for further computer
processing
Typically, the step after segmentation is feature extraction, which consists of feature
detection and feature description
Feature detection refers to finding the features in an image, region, or boundary
Feature description assigns quantitative attributes to the detected features
For example, we might detect corners in a region boundary, and describe those corners by their
orientation and location, both of which are quantitative attributes
Feature processing methods are subdivided into three principal categories, depending on
whether they are applicable to boundaries, regions, or whole images. Some features are
applicable to more than one category
Feature descriptors should be as insensitive as possible to variations in parameters such as
scale, translation, rotation, illumination, and viewpoint
descriptors are either insensitive to, or can be normalized to compensate for, variations in
one or more of these parameters
Department of E & C, RIT 5
Image feature
a feature is a distinctive attribute or description of “something” to label or differentiate
key words here are label and differentiate
“something” of interest refers either to individual image objects, or even to entire images or sets of images
features as attributes that are going to help to assign unique labels to objects in an image or, more
generally, are going to be of value in differentiating between entire images or families of images
two principal aspects of image feature extraction: feature detection, and feature description
feature extraction – refer to both detecting the features and then describing them
Extraction process must encompass both
Example:
object corners as features
Description refers to assigning quantitative (or qualitative) attributes to the detected features, such as corner
orientation, and location with respect to other corners
knowing that there are corners in an image has limited use without additional information that can help us differentiate
between objects in an image, or between images, based on corners and their attributes
3
6
1/12/2022
A feature descriptor is invariant with respect to a set of transformations if its value remains
unchanged after the application of any transformation from the family
A feature descriptor is covariant with respect to a set of transformations if applying to the
entity any transformation from the set produces the same result in the descriptor
area is an invariant feature descriptor with respect to the given family of transformations
If we add the affine transformation scaling to the family, descriptor area ceases to be
invariant with respect to the extended family
The descriptor is now covariant with respect to the family, because scaling the area of the
region by any factor scales the value of the descriptor by the same factor
the descriptor direction (of the principal axis of the region) is covariant because rotating the
region by any angle has the same effect on the value of the descriptor
compensate for changes in direction of a region by computing its actual direction and rotating
the region so that its principal axis points in a predefined direction
4
8
1/12/2022
Corner Detector
5
10
1/12/2022
Corner Detector
11
6
12
1/12/2022
13
7
14
1/12/2022
Corner Detector
Measure R has large positive values when both eigenvalues are large, indicating the
presence of a corner;
it has large negative values when one eigenvalue is large and the other small, indicating
an edge;
its absolute value is small when both eigenvalues are small, indicating that the
Department of E & C, RIT 15
15
Corner Detector
8
16
1/12/2022
17
9
18
1/12/2022
Corner Detector
19
10
20
1/12/2022
Harris Detector
21
Example
11
22
1/12/2022
Example
23
Example
12
24
1/12/2022
Example
25
Image Matching
13
26
1/12/2022
First, we will compute the interest points or the Harris Corners in both the images
A small space around the points will be considered, and the correspondences in-
between the points will then be computed using a weighted sum of squared
differences. This measure is not very robust, and it's only usable with slight
viewpoint changes
A set of source and corresponding destination coordinates will be obtained once
the correspondences are found; they are used to estimate the geometric
transformations between both the images
A simple estimation of the parameters with the coordinates is not enough—many
of the correspondences are likely to be faulty
The RANdom SAmple Consensus (RANSAC) algorithm is used to robustly estimate
the parameters, first by classifying the points into inliers and outliers, and
then by fitting the model to inliers while ignoring the outliers, in order to find
matches consistent with an affine transformation
Department of E & C, RIT 27
27
Approach: we want to avoid the impact of outliers, so let’s look for “inliers”, and
use only those
Intuition: if an outlier is chosen to compute the current fit, then the resulting line
won’t have much support from rest of the points
Keep the transformation with the largest number of inliers
RANSAC loop:
Randomly select a seed group of points on which to base transformation estimate (e.g.,
a group of matches)
Compute transformation from seed group
Find inliers to this transformation
If the number of inliers is sufficiently large, re-compute least-squares estimate of
transformation on all of the inliers
14
28
1/12/2022
29
15
30
1/12/2022
Algorithm
31
After RANSAC
RANSAC divides data into inliers and outliers and yields estimate computed
from minimal set of inliers
Improve this initial estimate with estimation over all inliers (e.g. with
standard least-squares minimization)
But this may change inliers, so alternate fitting with re-classification as
inlier/outlier
16
32
1/12/2022
Pros:
General method suited for a wide range of model fitting problems
Easy to implement and easy to calculate its failure rate
Cons:
Only handles a moderate percentage of outliers without cost blowing up
Many real problems have high rate of outliers (but sometimes selective choice of
random subsets can help)
A voting strategy, The Hough transform, can handle high percentage of
outliers
33
Local Features
17
34
1/12/2022
General Approach
35
Scale Covariance
Goal: independently detect corresponding regions in scaled versions of the
same image
Need scale selection mechanism for finding characteristic region size that is
covariant with the image transformation
18
36
1/12/2022
Edge = ripple
Blob = superposition of two ripples
Spatial selection: the magnitude of the Laplacian response will achieve a
maximum at the center of the blob, provided the scale of the Laplacian is
“matched” to the scale of the blob
37
Scale selection
19
38
1/12/2022
Scale normalization
39
Scale normalization
20
40
1/12/2022
41
Blob detection in 2D
21
42
1/12/2022
Scale Selection
43
Characteristic scale
22
44
1/12/2022
Efficient Implementation
45
23
46
1/12/2022
DoG
47
24
48
1/12/2022
49
25
50
1/12/2022
HoG
51
26
52
1/12/2022
53
27
54
1/12/2022
SIFT
55
Scale Space
First stage - find image locations that are invariant to scale change
achieved by searching for stable features across all possible scales, using a function of scale
known as scale space, which is a multi-scale representation suitable for handling image
structures at different scales in a consistent manner
objects in unconstrained scenes will appear in different ways, depending on the scale at
which images are captured. Because these scales may not be known beforehand, a
reasonable approach is to work with all relevant scales simultaneously
Scale space represents an image as a one-parameter family of smoothed images, with the
objective of simulating the loss of detail that would occur as the scale of an image decreases
The parameter controlling the smoothing is referred to as the scale parameter
Gaussian kernels are used to implement smoothing, so the scale parameter is the standard
deviation
only smoothing kernel that meets a set of important constraints, such as linearity and shift-
invariance, is the Gaussian lowpass kernel
28
56
1/12/2022
Scale Space
57
Scale Space
29
58
1/12/2022
Scale Space
the smoothed images in scale space are used to compute differences of Gaussian, in order to cover a full octave, implies that
an additional two images past the octave image are required
a total of s + 3 images - Because the octave image is always the (s + 1)th image in the stack (counting from the bottom)
it follows that this image is the third image from the top in the expanded sequence of s + 3 images
Each octave contains five images, indicating that s = 2 was used in this case
The first image in the second octave is formed by downsampling the original image (by skipping every other row and column),
and then smoothing it using a kernel with twice the standard deviation used in the first octave (i.e., 2 = 2 1)
Subsequent images in that octave are smoothed using 2 , with the same sequence of values of k as in the first octave
The same basic procedure is then repeated for subsequent octaves. That is, the first image of the new octave is formed by:
(1) downsampling the original image enough times to achieve half the size of the image in the previous octave
(2) smoothing the downsampled image with a new standard deviation that is twice the standard deviation of the previous octave.
The rest of the images in the new octave are obtained by smoothing the downsampled image with the new standard deviation
multiplied by the same sequence of values of k as before
59
Scale Space
When k = 2, the first image of a new octave can be obtained without having to smooth the downsampled
image
This is because, for this value of k, the kernel used to smooth the first image of every octave is the same as
the kernel used to smooth the third image from the top of the previous octave
The first image of a new octave can be obtained directly by downsampling that third image of the previous
octave by 2
The result will be the same. The third image from the top of any octave is called the octave image because
the standard deviation used to smooth it is twice (k^2 = 2) the value of the standard deviation used to
smooth the first image in the octave
each octave is composed of five images, it follows that we are again using s = 2. We chose 1 = 2/2 = 0.707
and k = 2 = 1.414 for this example so that the numbers would result in familiar multiples
The images going up scale space are blurred by using Gaussian kernels with progressively larger standard
deviations, and the first image of the second and subsequent octaves is obtained by downsampling the
octave image from the previous octave by 2
the images become significantly more blurred (and consequently lose more fine detail) as they go up both in
scale as well as in octave
The images in the third octave show significantly fewer details, but their gross appearance is unmistakably
that of the same structure
30
60
1/12/2022
Scale Space
61
SIFT initially finds the locations of keypoints using the Gaussian filtered
images
refines the locations and validity of those keypoints using two processing
steps
31
62
1/12/2022
63
32
64
1/12/2022
65
33
66
1/12/2022
67
If the determinant is negative, the curvatures have different signs and the
keypoint in question cannot be an extremum, so it is discarded
Let r denote the ratio of the largest to the smallest eigenvalue. Then a = rb
and which depends on the ratio of the eigenvalues, rather than their
individual values
The minimum of (r + 1)^2/r occurs when the eigenvalues are equal, and it
increases with r
to check that the ratio of principal curvatures is below some threshold, r, we
only need to check
which is a simple computation. In the experimental results reported by Lowe
[2004], a value of r = 10 was used, meaning that keypoints with ratios of
curvature greater than 10 were eliminated
34
68
1/12/2022
69
Keypoint Orientation
35
70
1/12/2022
Keypoint Orientation
71
Keypoint Orientation
36
72
1/12/2022
Keypoint Descriptors
73
Keypoint Descriptors
compute a descriptor for a local region around each keypoint that is highly
distinctive, but is at the same time as invariant as possible to changes in
scale, orientation, illumination, and image viewpoint
The idea is to be able to use these descriptors to identify matches
(similarities) between local regions in two or more images
The approach used by SIFT to compute descriptors is based on experimental
results suggesting that local image gradients appear to perform a function
similar to what human vision does for matching and recognizing 3-D objects
from different viewpoints
37
74
1/12/2022
Procedure
75
Procedure
Because there is one gradient computation for each point in the region surrounding a keypoint,
there are (16)^2 gradient directions to process for each keypoint
There are 16 directions in each 4 × 4 subregion. The top-rightmost subregion is shown zoomed to
simplify the explanation of the next step, which consists of quantizing all gradient orientations in
the 4 × 4 subregion into eight possible directions differing by 45°
Rather than assigning a directional value as a full count to the bin to which it is closest, SIFT
performs interpolation that distributes a histogram entry among all bins proportionally, depending
on the distance from that value to the center of each bin
This is done by multiplying each entry into a bin by a weight of 1 − d, where d is the shortest
distance from the value to the center of a bin, measured in the units of the histogram spacing, so
that the maximum possible distance is 1
For example, the center of the first bin is at 45°/2 = 22.5°, the next center is at 22.5° + 45° = 67.5°, and so
on. Suppose that a particular directional value is 22.5°. The distance from that value to the center of the first
histogram bin is 0, so we would assign a full entry (i.e., a count of 1) to that bin in the histogram. The distance
to the next center would be greater than 0, so we would assign a fraction of a full entry, that is 1 * (1 − d), to
that bin, and so forth for all bins. In this way, every bin gets a proportional fraction of a count, thus avoiding
“boundary” effects in which a descriptor changes abruptly as a small change in orientation causes it to be
assigned from one bin to another
38
76
1/12/2022
Procedure
eight directions of a histogram as a small cluster of vectors, with the length of each vector being equal to the value of its
corresponding bin
Sixteen histograms are computed, one for each 4 × 4 subregion of the 16 × 16 region surrounding a keypoint
A descriptor then consists of a 4 × 4 array, each containing eight directional values
Descriptor data is organized as a 128-dimensional vector
In order to achieve orientation invariance, the coordinates of the descriptor and the gradient orientations are rotated relative
to the keypoint orientation
In order to reduce the effects of illumination, a feature vector is normalized in two stages
First, the vector is normalized to unit length by dividing each component by the vector norm. A change in image contrast resulting
from each pixel value being multiplied by a constant will multiply the gradients by the same constant, so the change in contrast will
be cancelled by the first normalization
A brightness change caused by a constant being added to each pixel will not affect the gradient values because they are computed
from pixel differences. Therefore, the descriptor is invariant to affine changes in illumination
However, nonlinear illumination changes resulting, for example, from camera saturation, can also occur. These types of changes can
cause large variations in the relative magnitudes of some of the gradients, but they are less likely to affect gradient orientation
SIFT reduces the influence of large gradient magnitudes by thresholding the values of the normalized feature vector so that all
components are below the experimentally determined value of 0.2
After thresholding, the feature vector is renormalized to unit length
77
Summary
1. Construct the scale space. The parameters that need to be specified are , s, (k is
computed from s), and the number of octaves. Suggested values are = 1.6, s = 2, and three
octaves
2. Obtain the initial keypoints. Compute the difference of Gaussians, D(x, y, ), from the
smoothed images in scale space. Find the extrema in each D(x, y, ) image. These are the
initial keypoints.
3. Improve the accuracy of the location of the keypoints. Interpolate the values of D(x, y, )
via a Taylor expansion.
4. Delete unsuitable keypoints. Eliminate keypoints that have low contrast and/or are poorly
localized. This is done by evaluating D from Step 3 at the improved locations. All keypoints
whose values of D are lower than a threshold are deleted. A suggested threshold value is
0.03. Keypoints associated with edges are deleted also. A value of 10 is suggested for r.
5. Compute keypoint orientations. Compute the magnitude and orientation of each keypoint
using the histogram-based procedure.
6. Compute keypoint descriptors. Compute a feature (descriptor) vector for each keypoint. If a
region of size 16 × 16 around each keypoint is used, the result will be a 128-dimensional
feature vector for each keypoint.
Department of E & C, RIT 78
39
78
1/12/2022
Example
79
Example
40
80
1/12/2022
Example
81
Example
41
82
1/12/2022
Observations
83
Observations
we do not always know a priori when images have been acquired under different
conditions and geometrical arrangements
more practical test is to compute features for a prototype image and test them
against unknown samples
Rotated subimage
10 matches were found, 2 were incorrect
excellent results, considering the relatively small size of the subimage, and the fact that
it was rotated
Half-sized subimage
11 matches were found, 4 were incorrect
good results, considering the fact that significant detail was lost in the subimage when it
was rotated or reduced in size
42
84
1/12/2022
85
Algorithm
43
86
1/12/2022
87
BRIEF
convert image patches into a binary feature vector so that together they can
represent an object
each keypoint is described by a feature vector which is 128–512 bits string
images need to be smoothed before they can be meaningfully differentiated
when looking for edges
The (x,y) pair is also called random pair which is located inside the patch
Total we have to select n test(random pair) for creating a binary feature
vector and we have to choose this n test from one of five approaches
(Sampling Geometries)
44
88
1/12/2022
BRIEF
89
Sampling Strategies
45
90
1/12/2022
Sampling Strategies
Uniform(G I): Both x and y pixels in the random pair is drawn from a Uniform
distribution or spread of S/2 around keypoint. The pair(test) can lie close to the
patch border
Gaussian(G II): Both x and y pixels in the random pair is drawn from a Gaussian
distribution or spread of 0.04 * S² around keypoint
Gaussian(G III): The first pixel(x) in the random pair is drawn from a Gaussian
distribution centered around the keypoint with a standard deviation or spread of
0.04 x S². The second pixel(y) in the random pair is drawn from a Gaussian
distribution centered around the first pixel(x) with a standard deviation or spread
of 0.01 x S². This forces the test(pair) to be more local. Test(pair) locations
outside the patch are clamped to the edge of the patch.
Coarse Polar Grid(G IV): Both x and y pixels in the random pair is sampled from
discrete locations of a coarse polar grid introducing a spatial quantization.
Coarse Polar Grid(G V): The first pixel(x) in random pair is at (0, 0) and the second
pixel(y) in the random pair is drawn from discrete locations of a coarse polar grid.
91
46
92
1/12/2022
ORB
93
ORB Algorithm
47
94
1/12/2022
Image Matching
95
Haar-like Features
48
96
1/12/2022
Haar-like Features
97
Haar-like Features
Detection phase
a window of the target size is moved over the input image
for each subsection of the image the Haar-like feature is calculated
difference is then compared to a learned threshold that separates non-objects from
objects
This is only a weak learner or classifier (its detection quality is slightly better than
random guessing)
a large number of Haar-like features are necessary to describe an object with sufficient
accuracy
organized in something called a classifier cascade to form a strong learner or classifier
key advantage: calculation speed
Due to the use of integral images, a Haar-like feature of any size can be calculated in
constant time (approximately 60 microprocessor instructions for a 2-rectangle feature)
Department of E & C, RIT 98
49
98
1/12/2022
Integral Images
Summed-area tables
defined as two-dimensional lookup tables in the form of a matrix with the
same size of the original image
Each element of the integral image contains the sum of all pixels located on
the up-left region of the original image (in relation to the element's position)
This allows to compute sum of rectangular areas in the image, at any position
or scale, using only four lookups
Each Haar-like feature may need more than four lookups, depending on how
it was defined
2-rectangle features need six lookups
3-rectangle features need eight lookups
4-rectangle features need nine lookups
Department of E & C, RIT 99
99
Integral Images
50
100
1/12/2022
Face Detection
101
Face Detection
51
102
1/12/2022
Haar-like Features
103
Each Haar-like feature is only a weak classifier, and hence a large number of Haar-like features are
required to detect a face with good accuracy
A huge number of Haar-like features are computed for all possible sizes and locations of each Haar-like
kernel using the integral images
an AdaBoost ensemble classifier is used to select important features from the huge number of features
and combine them into a strong classifier model during the training phase
The model learned is then used to classify a face region with the selected features.
Most of the regions in an image is a non-face region in general. So, first it is checked whether a window
is not a face region. If it is not, it is discarded in a single shot and a different region is inspected where a
face is likely to be found. This ensures that more time is dedicated to checking a possible face region
In order to implement this idea, the concept of cascade of classifiers is introduced. Instead of applying
all the huge number of features on a window, the features are grouped into different stages of
classifiers and applied one-by-one
The first few stages contain very few features. If a window fails at the first stage it is discarded, and the
remaining features on it are not considered. If it passes, the second stage of features are applied, and so
on and so forth. A face region corresponds to the window that passes all the stages
52
104
1/12/2022
Stages
105
53
106
1/12/2022
Algorithm
107
AdaBoost
Pros
AdaBoost is easy to implement.
It iteratively corrects the mistakes of the weak classifier and improves accuracy by
combining weak learners.
You can use many base classifiers with AdaBoost.
AdaBoost is not prone to overfitting. This can be found out via experiment results,
but there is no concrete reason available.
Cons
AdaBoost is sensitive to noise data.
It is highly affected by outliers because it tries to fit each point perfectly.
AdaBoost is slower compared to XGBoost.
Department of E & C, RIT 108
54
108
1/12/2022
AdaBoost
109
Summary
55
110