06_feature_detection_2
06_feature_detection_2
Lecture 06
Point Feature Detection and Matching – Part 2
Davide Scaramuzza
http://rpg.ifi.uzh.ch 1
Lab Exercise 4 – Today
Implement SIFT blob detection and matching
2
Main questions
• What features are repeatable and distinctive?
• How to describe a feature?
• How to establish correspondences, i.e., compute matches?
3
Feature Matching
For each point, how to match its corresponding point in the other image?
• Brute-force Matching: compare each feature descriptor of Image 1 against the descriptor of each
feature in Image 2 and assign as correspondence the feature with closest descriptor (e.g.,
minimum of SSD). If each image contains N features, we need to perform 𝑵𝟐 comparisons.
Image 1 Image 2
4
Recall: Patch and Census Descriptors
• Patch descriptor
(i.e., patch of intensity values, integer values)
5
HOG Descriptor (Histogram of Oriented Gradients)
• The patch is divided into a grid of cells and for each cell a histogram of gradient directions is compiled.
• The HOG descriptor is the concatenation of these histograms (used in SIFT)
• Differently from the patch and Census descriptors, HOG has float values.
0 2p
Example of gradient histogram with 8 orientation bins.
Each vote is weighted by the gradient magnitude
HOG Descriptor: …
(1D vector)
0 2p 0 2p 0 2p
6
Feature Descriptor Invariance
Are feature descriptors invariant (robust) to geometric and photometric changes?
7
Outline
• How to achieve descriptor invariance to:
• Scale
• Rotation
• Viewpoint
• The SIFT blob detector and descriptor
• Other corner and blob detectors and descriptors
8
Scale changes
How can we match image patches corresponding to the same feature but belonging to
images taken at different scales?
Image 1 Image 2
9
Scale changes
How can we match image patches corresponding to the same feature but belonging to
images taken at different scales? Possible solution: rescale the patch
Image 1 Image 2
10
Scale changes
How can we match image patches corresponding to the same feature but belonging to
images taken at different scales? Possible solution: rescale the patch
Image 1 Image 2
11
Scale changes
How can we match image patches corresponding to the same feature but belonging to
images taken at different scales? Possible solution: rescale the patch
Image 1 Image 2
12
Scale changes
• Scale search is time consuming (needs to be done individually for all patches
in one image)
• Complexity is 𝑁 2 𝑆 assuming 𝑁 features per image and 𝑆 rescalings per
feature
• Solution: automatic scale selection: automatically assign each feature its
own “scale” (i.e., size)
13
Automatic Scale Selection
• Idea: Design a function on the image patch, which is scale invariant (i.e., it has the same
value for corresponding patches, even if they are at different scales)
f Image 1 f Image 2
scale = 1/2
f Image 1 f Image 2
scale = 1/2
𝜎 𝜎′
16
Automatic Scale Selection: Example
Image 1 Image 2
𝜎 𝜎′
17
Automatic Scale Selection: Example
Image 1 Image 2
𝜎 𝜎′
18
Automatic Scale Selection: Example
Image 1 Image 2
𝜎 𝜎′
19
Automatic Scale Selection: Example
Image 1 Image 2
𝜎 𝜎′
20
Automatic Scale Selection: Example
Image 1 Image 2
s1 𝜎 s2 𝜎′
21
Automatic Scale Selection: Example
Image 1 Image 2
23
Automatic Scale Selection
• A “good” function for scale detection should have a single & sharp peak
f f f Very good!
bad
Good or Bad?
patch size patch size patch size
24
Automatic Scale Selection
• The ideal function for determining the scale is one that highlights sharp discontinuities
• Solution: convolve image with a kernel that highlights edges
f = Kernel Image
• It has been shown that the Laplacian of Gaussian kernel is optimal under certain
assumptions [Lindeberg’94]:
2G ( x, y ) 2G ( x, y )
LoG ( x, y, ) = G ( x, y ) =
2
+
x 2
y 2
Lindeberg, Scale-space theory: A basic tool for analysing structures at different scales, Journal of Applied Statistics, 1994. PDF. 25
Automatic Scale Selection
The correct scale(s) is (are) found as local extrema across consecutive smoothed patches
Scale
(i.e., 𝜎 of the LoG)
Scale (𝜎)
26
Outline
• How to achieve descriptor invariance to:
• Scale
• Rotation
• Viewpoint
• The SIFT blob detector and descriptor
• Other corner and blob detectors and descriptors
27
How to achieve invariance to Rotation
Derotation:
• Determine patch orientation
e.g., eigenvectors of M matrix of Harris or
dominant gradient direction (see next slide)
• Derotate patch through “patch warping”
This puts the patches into a canonical orientation
28
How to determine the patch orientation?
1. First, multiply the patch by a Gaussian kernel to make the shape circular rather than square
2. Then, compute gradients vectors at each pixel
3. Build a histogram of gradient orientations, weighted by the gradient magnitudes. This histogram is a particular case of HOG
descriptor (a grid of 1×1 cells)
4. Extract all local maxima in the histogram: each local maximum above a threshold is a candidate dominant orientation.
5. Construct a different keypoint descriptor for each dominant orientation
0 2𝜋
29
Outline
• How to achieve descriptor invariance to:
• Scale
• Rotation
• Viewpoint
• The SIFT blob detector and descriptor
• Other corner and blob detectors and descriptors
30
How to achieve invariance to small viewpoint changes?
Affine warping provides invariance to small view-point changes
• The second moment matrix M of the Harris detector can be used to identify the two directions of fastest
and slowest change of SSD around the feature
• Out of these two directions, an elliptic patch is extracted
• The region inside the ellipse is normalized to a canonical circular patch
Image 1 Image 2
31
Recap:
How to achieve Scale, Rotation, and Affine-invariant patch matching
1. Scale assignment: compute the scale using the LoG operator. If mutiple local extrema, assign multiple scales
2. Multiply the patch by a Gaussian kernel to make the shape circular rather than square
3. Rotation assignment: use Harris or gradient histogram to find dominant orientation. If multiple local extrema, assign
multiple orientations
4. Affine invariance: use Harris eigenvectors to extract affine transformation parameters
5. Warp the patch into a canonical patch
Image 1 Image 2
32
How to warp a patch?
• Start with an “empty” canonical patch (all pixels set to 0)
• For each pixel (𝑥, 𝑦) in the empty patch, apply the warping function 𝑾(𝒙, 𝒚)
to compute the corresponding position in the source image. It will be in
floating point and will fall between the image pixels.
• Interpolate the intensity values of the 4 closest pixels in the detected image
using either of:
• Nearest neighbor interpolation
• Bilinear interpolation
• Bicubic interpolation
33
Example: Similarity Transformation (rotation, translation, rescaling)
• Warping function 𝑊: rotation (𝜃) plus rescaling (𝑠) and translation (𝑎, 𝑏):
(𝑥’, 𝑦’)
𝑊
(𝑥, 𝑦)
35
Nearest Neighbor vs Bilinear vs Bicubic Interpolation
36
Bilinear Interpolation
• It is an extension of linear interpolation for interpolating functions of two variables (e.g., 𝑥 and 𝑦) on a
rectilinear 2D grid.
• The key idea is to perform linear interpolation first in one direction, and then again in the other direction.
• Although each step is linear in the sampled values and in the position, the interpolation as a whole is not
linear but rather quadratic in the sample location.
𝑥
𝐼(0,0) 𝐼(1,0) 𝐼 𝑥, 𝑦 = 𝐼 0,0 1−𝑥 1−𝑦 +
𝑦 This formula
𝐼 0,1 1−𝑥 𝑦 +
won’t be asked
𝐼 1,0 𝑥 1−𝑦 +
at the exam ☺
𝐼(0,1) 𝐼(1,1) 𝐼 1,1 𝑥 𝑦
In this geometric visualization, the value at the black spot is the sum of the value at each
colored spot multiplied by the area of the rectangle of the same color.
37
Disadvantage of Patch Descriptors
• Disadvantage of patch descriptors:
• If the warp is not estimated accurately, very small errors in rotation, scale, and view-
point will affect matching score significantly
• Computationally expensive (need to unwarp every patch)
38
Outline
• Automatic Scale Selection
• The SIFT blob detector and descriptor
• Other corner and blob detectors and descriptors
39
SIFT Descriptor
• Scale Invariant Feature Transform
• Invented by David Lowe in 2004
Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, Internal Journal of Computer Vision, 2004. PDF 40
SIFT Descriptor
Descriptor computation:
• Consider a 𝟏𝟔 × 𝟏𝟔 pixel patch
• Multiply the patch by a Gaussian filter, compute dominant orientation, and de-rotate patch
• Compute HOG descriptor
• Divide patch into 4×4 cells
• Use 8 bin histograms (, i.e., 8 directions)
• Concatenate all histograms into a single 1D vector
• Resulting SIFT descriptor: 4×4×8 = 128 values
• Descriptor Matching: SSD (i.e., Euclidean-distance)
• Why 4×4 cells and why 8 bins? See later
𝒗
ഥ=
𝒗
σ𝑛𝑖 𝑣𝑖2
• We can conclude that the SIFT descriptor is invariant to affine illumination changes
42
SIFT Matching Robustness
• Can handle severe viewpoint changes (up to 50 degree out-of-plane rotation)
• Can handle even severe non affine changes in illumination (low to bright scenes)
• Computationally expensive: 10 frames per second (fps) on an i7 processor
• Original SIFT binary files: http://people.cs.ubc.ca/~lowe/keypoints
• OpenCV C/C++ implementation: https://docs.opencv.org/master/da/df5/tutorial_py_sift_intro.html
43
SIFT Detector
• SIFT uses the Difference of Gaussian (DoG) kernel instead of Laplacian of Gaussian (LoG) because
computationally cheaper
LOG ( x, y ) DoG( x, y ) = Gk ( x, y ) − G ( x, y )
𝜕𝐺𝜎
• The proof that LoG can be approximated by a difference of Gaussian comes from the Heat Equation: = 𝜎 ∇2 𝐺𝜎
𝜕𝜎
44
SIFT Detector (location + scale)
SIFT keypoints: local extrema in both space and scale of the DoG images
• Each pixel is compared to 26 neighbors (below in green): its 8 neighbors in the current image + 9 neighbors
in the adjacent upper scale + 9 neighbors in the adjacent lower scale
• If the pixel is a global maximum or minimum (i.e., extrema) with respect to its 26 neighbors then it is
selected as SIFT feature
45
Example
46
DoG Images example
47
DoG Images example
48
DoG Images example
49
DoG Images example
50
DoG Images example
Magnitude of (𝐺 𝑘 5 𝜎 − 𝐺 𝑘 4 𝜎 ) | 𝑠 = 4; 𝜎 = 1.6 |
(second octave shown at the input resolution for convenience)
51
DoG Images example
Magnitude of (𝐺 𝑘 6 𝜎 − 𝐺 𝑘 5 𝜎 ) | 𝑠 = 4; 𝜎 = 1.6 |
(second octave shown at the input resolution for convenience)
52
DoG Images example
Magnitude of (𝐺 𝑘 7 𝜎 − 𝐺 𝑘 6 𝜎 ) | 𝑠 = 4; 𝜎 = 1.6 |
(second octave shown at the input resolution for convenience)
53
DoG Images example
Magnitude of (𝐺 𝑘 8 𝜎 − 𝐺 𝑘 7 𝜎 ) | 𝑠 = 4; 𝜎 = 1.6 |
(second octave shown at the input resolution for convenience)
54
DoG Images example
Magnitude of (𝐺 𝑘 9 𝜎 − 𝐺 𝑘 8 𝜎 ) | 𝑠 = 4; 𝜎 = 1.6 |
(third octave shown at the input resolution for convenience)
55
Local extrema of DoG images across Scale and Space
G(𝑘𝜎) G(𝜎)
57
58
SIFT: Recap
• SIFT: Scale Invariant Feature Transform
• An approach to detect and describe regions of interest in an image.
• SIFT detector = DoG detector
• SIFT features are invariant to 2D rotation, and reasonably invariant to
rescaling, viewpoint changes (up to 50 degrees), and illumination
• It runs in real-time but expensive (10 Hz on an i7 laptop)
• The expensive steps are the scale detection and descriptor extraction
59
Original SIFT Demo by David Lowe
Download original SIFT binaries and Matlab function from :
http://people.cs.ubc.ca/~lowe/keypoints
60
What’s the output of SIFT?
• Descriptor: 4x4x8 = 128-element 1D vector
• Location (pixel coordinates of the center of the patch): 2D vector
• Scale (i.e., size) of the patch: 1 scalar value (high scale corresponds to high blur in the space-scale pyramid)
• Orientation (i.e., angle of the patch): 1 scalar value
61
SIFT Repeatability with Viewpoint Changes
Repeatability=
# correspondences detected
# correspondences present
Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, Internal Journal of Computer Vision, 2004. PDF 62
SIFT Repeatability with Number of Scales per Octave
Repeatability=
# correspondences detected
# correspondences present
Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, Internal Journal of Computer Vision, 2004. PDF 63
Influence of Number of Orientations and Number of Sub-patches
The graph shows that a single orientation histogram (n = 1) is very poor at discriminating.
The results improve with a 4x4 array of histograms with 8 orientations.
4x4 HOGs
Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, Internal Journal of Computer Vision, 2004. PDF 64
Application of SIFT to Object recognition
• Can be implemented easily by returning object with the largest number of
correspondences with the template
• For planar objects, 4-point RANSAC can be used to remove outliers (see Lecture 8).
• For rigid 3D objects, 5-point RANSAC (see Lecture 8).
65
Application of SIFT to Panorama Stitching
AutoStitch: http://matthewalunbrown.com/autostitch/autostitch.html
M. Brown and D. G. Lowe. Recognising Panoramas, International Conference on Computer Vision (ICCV), 2003. PDF. 66
Main questions
• What features are repeatable and distinctive?
• How to describe a feature?
• How to establish correspondences, i.e., compute matches?
67
Feature Matching
68
Feature Matching
• Given a feature in 𝐼1, how to find the best match in 𝐼2?
1. Define distance function that compares two descriptors ((Z)SSD, (Z)SAD, (Z)NCC or Hamming distance for binary
descriptors (e.g., Census, HOG, ORB, BRIEF, BRISK, FREAK)
2. Brute-force matching:
1. Compare each feature in 𝐼1 against all the features in 𝐼2 (𝑁 2 comparisons, where 𝑁 is the number of
features in each image)
2. Take the one at minimum distance, i.e., the closest descriptor
𝐼1 𝐼2
69
Feature Matching
• Issues with closest descriptor: can occasionally return good scores for false matches
• Better approach: compute ratio of distances to 1st to 2nd closest descriptor
𝑑1
< 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 (𝑢𝑠𝑢𝑎𝑙𝑙𝑦 0.8)
𝑑2
where:
70
Distance Ratio: Explanation
• In SIFT, the nearest neighbor is defined as the keypoint with minimum Euclidean distance. However, many
features in Image 1 may not have any a correct match in Image 2 because they arise from background
clutter or were not detected in Image 1.
• An effective measure is obtained by comparing the distance of the closest neighbor to that of the second-
closest neighbor. This measure performs well because correct matches need to have the closest neighbor
significantly closer than the closest incorrect match to achieve reliable matching.
• For false matches, there will likely be a number of other false matches within similar distances due to the
high dimensionality of the feature space (this problem is known as curse of dimensionality). We can think
of the second-closest match as providing an estimate of the density of false matches within this portion of
the feature space and at the same time identifying specific instances of feature ambiguity.
71
SIFT Feature Matching: Distance Ratio
The SIFT paper recommends to use a threshold on 0.8. Where does this come
from?
72
Outline
• Automatic Scale Selection
• The SIFT blob detector and descriptor
• Other corner and blob detectors and descriptors
73
“FAST” Corner Detector
• FAST: Features from Accelerated Segment Test
• Analyses intensities along a ring of 16 pixels centered on
the pixel of interest 𝒑
• 𝒑 is a FAST corner if a set of N contiguous pixels on the
ring are:
• all brighter than the pixel intensity 𝑰(𝒑) + 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅,
• or all darker than 𝑰 𝒑 − 𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅
• Common value of N: 12
• A simple classifier is used to check the quality of corners and reject the weak ones
• FAST is the fastest corner detector ever made: can process 100 million pixels per second (<3ms per image)
• Issue: it is very sensitive to image noise (high in low light). This is why Harris is still more common despite a bit slower
• In fact, FAST was initially proposed to find candidate corner regions to scout with the Harris detector
Rosten, Drummond, Fusing points and lines for high performance tracking, International Conference on Computer Vision (ICCV), 2005. PDF.
Rosten, Porter, Drummond, “Faster and better: a machine learning approach to corner detection”,
IEEE Trans. Pattern Analysis and Machine Intelligence, 2010. PDF. 74
“SURF” Blob Detector & Descriptor
Original second order partial derivatives of
• SURF: Speeded Up Robust Features a Gaussian
• Similar to SIFT but much faster
• Basic idea: approximate Gaussian and DoG filters using box filters
• Results comparable with SIFT, plus:
• Faster computation
𝜕 2 𝐺(𝑥, 𝑦) 𝜕 2 𝐺(𝑥, 𝑦)
• Generally shorter descriptors 𝜕𝑦 2 𝜕𝑥𝜕𝑦
Bay, Tuytelaars, Van Gool, " Speeded Up Robust Features ", European Conference on Computer Vision (ECCV) 2006. PDF. 75
“BRIEF” Descriptor (can be applied to corners or blobs)
• BRIEF: Binary Robust Independent Elementary Features
• The pattern is generated randomly (or learned) only once; then, the same pattern is
used for all patches
• Pros: Binary descriptor: allows very fast Hamming distance matching (count of the
number of bits that are different in the descriptors matched) Pattern for intensity pair samples –
• Cons: Not scale/rotation invariant generated randomly
Calonder, Lepetit, Strecha, Fua, BRIEF: Binary Robust Independent Elementary Features,
76
European Conference on Computer Vision (ECCV), 2010. PDF.
“ORB” Descriptor (can be applied to corners or blobs)
• ORB: Oriented FAST and Rotated BRIEF
• Keypoint detector originally based on FAST
• Binary descriptor based on BRIEF but adds an
orientation component to make it rotation
invariant
Leutenegger, Chli, Siegwart. BRISK: Binary Robust invariant scalable keypoints, ICCV 2011. PDF 78
“FREAK” Descriptor (can be applied to corners or blobs)
• FREAK: Fast Retina Keypoint
• Rotation and scale invariant
• Binary descriptor
• Sampling pattern similar to BRISK but uses a more pronounced “retinal” (i.e.,
log-polar) sampling pattern inspired by the human retina: higher density of
points near the center
• Pairwise intensity comparisons form binary strings similar to BRIEF Human retina
• Pairs are learned (as in ORB)
• Circles indicate size of smoothing kernel
• Coarse-to-fine matching (cascaded approach): first compare the first half of
bits; if distance smaller than threshold, proceed to compare the next bits, etc.
• Faster to compute, less memory and than SIFT, SURF or BRISK
FREAK sampling pattern
Alahi, Ortiz, Vandergheynst. FREAK: Fast Retina Keypoint, Conference on Computer Vision and Pattern Recognition (CVPR), 2012. PDF. 79
“LIFT” Descriptor (can be applied to corners or blobs)
• LIFT: Learned Invariant Feature Transform
• Learning-based descriptor
• Rotation, scale, viewpoint and illumination invariant
• First a network predicts the patch orientation which is used to derotate
the patch.
• Then another neural network is used to generate a patch descriptor (128
Keypoints with scales
dimensional) from the derotated patch.
and orientations
• Illumination invariance is achieved by randomizing illuminations during
training.
• LIFT descriptor beats SIFT in repeatability CNN
neural network
predicts descriptor
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, Pascal Fua,
LIFT: Learned Invariant Feature Transform, European Conference on Computer Vision (ECCV) 2016. PDF. 80
LIFT vs SIFT
https://youtu.be/hhxAttChmCo 81
“SuperPoint”: Self-Supervised Interest Point Detection and Description
Detone, Malisiewicz, Rabinovich. SuperPoint: Self-Supervised Interest Point Detection and Description. CVPRW 2018. PDF.
82
Recap Table
Detector Localization Accuracy Descriptor that can be used Efficiency Relocalization & Loop closing
of the detector
84
Readings
• Ch. 7.1 of Szeliski book, 2nd Edition
• Chapter 4 of Autonomous Mobile Robots book: link
• Ch. 13.3 of Peter Corke book
85
Understanding Check
Are you able to answer:
• How does automatic scale selection work?
• What are the good and the bad properties that a function for automatic scale selection should have or not
have?
• How can we implement scale invariant detection efficiently? (show that we can do this by resampling the
image vs rescaling the kernel).
• What is a feature descriptor? (patch of intensity value vs histogram of oriented gradients). How do we
match descriptors?
• How is the keypoint detection done in SIFT and how does this differ from Harris?
• How does SIFT achieve orientation invariance?
• How is the SIFT descriptor built?
• What is the repeatability of the SIFT detector after a rescaling of 2? And for a 50 degrees viewpoint change?
• Illustrate the 1st to 2nd closest ratio of SIFT detection: what’s the intuitive reasoning behind it? Where does
the 0.8 factor come from?
• How does the FAST detector work? What are its pros and cons compared with Harris?
86