2D Motion Estimation
2D Motion Estimation
TWO DIMENSIONAL
MOTION ESTIMATION
148
Section 6.1. Optical Flow 149
ow. Optical
ow can be caused not only by object motions, but also camera
movements or illumination condition changes. In this chapter, we start by dening
optical
ow. We then derive the optical
ow equation, which imposes a constraint
between image gradients and
ow vectors. This is a fundamental equality that many
motion estimation algorithms are based on. Next we present the general method-
ologies for 2D motion estimation. As will be seen, motion estimation problem is
usually converted to an optimization problem and involves three key components:
parameterization of the motion eld, formulation of the optimization criterion, and
nally searching of the optimal parameters. Finally, we present motion estimation
algorithms developed based on dierent parameterizations of the motion eld and
dierent estimation criteria. Unless specied otherwise, the word \motion" refers
to 2D motion, in this chapter.
although the same methodology can be applied to the full color information.
150 Two Dimensional Motion Estimation Chapter 6
Figure 6.1. The optical
ow is not always the same as the true motion eld. In (a), a
sphere is rotating under a constant ambient illumination, but the observed image does not
change. In (b), a point light source is rotating around a stationary sphere, causing the
highlight point on the sphere to rotate. Adapted from [17, Fig.12-2].
images of the same object point at dierent times have the same luminance value.
Therefore,
(x + dx ; y + dy ; t + dt ) = (x; y; t): (6.1.1)
Using Taylor's expansion, when dx ; dy ; dt are small, we have
(x + dx ; y + dy ; t + dt ) = (x; y; t) + @@x dx + @@y dy + @@t dt : (6.1.2)
Combining Eqs. (6.1.1) and (6.1.2) yields
@ @ @
d + d + d = 0: (6.1.3)
@x x @y y @t t
The above equation is written in terms of the motion vector (dx ; dy ). Dividing both
sides by dt yields
@ @ @
v + v +
@x x @y y @t
= 0 or r T v + @@t = 0: (6.1.4)
h iT
where (vx ; vy ) represents the velocity vector, r = @@x ; @@y is the spatial gradient
vector of (x; y; t). In arriving at the above equation, we have assumed that dt is
small, so that vx = dx =dt ; vy = dy =dt : The above equation is commonly known as
the optical
ow equation.2 The conditions for this relation to hold is the same as
that for the constant intensity assumption, and have been discussed previously in
Sec. 5.2.3.
2
Another way to derive the optical
ow equation is by representing the constant intensity
assumption as d (dt
x;y;t) = 0. Expanding d (x;y;t) in terms of the partials will lead to the same
dt
equation.
Section 6.1. Optical Flow 151
y
Ñy
v vtet
vnen
Tangent line
As shown in Fig. 6.2, the
ow vector v at any point x can be decomposed into
two orthogonal components as
v = vnen + vtet; (6.1.5)
where en is the direction vector of the image gradient r , to be called the normal
direction, and et is orthogonal to en , to be called the tangent direction. The optical
ow equation in Eq. (6.1.4) can be written as
vn kr k + @@t = 0; (6.1.6)
where kr k is the magnitude of the gradient vector. Three consequences from
Eq. (6.1.4) or (6.1.6) are:
1. At any pixel x, one cannot determine the motion vector v based on r and
@ alone. There is only one equation for two unknowns (v and v , or v and
@t x y n
vt ). In fact, the underdetermined component is vt : To solve both unknowns,
one needs to impose additional constraints. The most common constraint is
that the
ow vectors should vary smoothly spatially, so that one can make use
of the intensity variation over a small neighborhood surrounding x to estimate
the motion at x.
2. Given r and @@t ; the projection of the motion vector along the normal
direction is xed, with vn = @@t /kr k ; whereas the projection onto the
tangent direction, vt , is undetermined. Any value of vt would satisfy the
optical
ow equation. In Fig. 6.2, this means that any point on the tangent line
will satisfy the optical
ow equation. This ambiguity in estimating the motion
vector is known as the aperture problem. The word \aperture" here refers to
152 Two Dimensional Motion Estimation Chapter 6
True Motion
Aperture 2
x2
Aperture 1
x1
Figure 6.3. The aperture problem in motion estimation: To estimate the motion at x1
using aperture 1, it is impossible to determine whether the motion is upward or perpen-
dicular to the edge, because there is only one spatial gradient direction in this aperture.
On the other hand, the motion at x2 can be determined accurately, because the image has
gradient in two dierent directions in aperture 2. Adapted from [39, Fig. 5.7].
the small window over which to apply the constant intensity assumption. The
motion can be estimated uniquely only if the aperture contains at least two
dierent gradient directions, as illustrated in Fig. 6.3.
3. In regions with constant brightness so that kr k = 0; the
ow vector is
indeterminate. This is because there is no perceived brightness changes when
the underlying surface has a
at pattern. The estimation of motion is reliable
only in regions with brightness variation, i.e., regions with edges or non-
at
textures.
The above observations are consistent with the relation between spatial and tem-
poral frequencies discussed in Sec. 2.3.2. There, we have shown that the temporal
frequency of a moving object is zero if the spatial frequency is zero, or if the motion
direction is orthogonal to the spatial frequency. When the temporal frequency is
zero, no changes can be observed in image patterns, and consequently, motion is
indeterminate.
As will be seen in the following sections, the optical
ow equation or, equiva-
lently, the constant intensity assumption plays a key role in all motion estimation
algorithms.
Time t+Dt
Backward motion estimation
Time t
Time t-Dt
d(x;t,t+Dt)
x
Tracked Frame
d(x;t,t-Dt) x
Tracked Frame
Figure 6.4. Forward and backward motion estimation. Adapted from [39, Fig. 5.5].
anchor and tracked frames, respectively. In general, we can represent the motion
eld as d(x; a), where a = [a1 ; a2 ; : : : ; aL]T is a vector containing all the motion
parameters. Similarly, the mapping function can be denoted by w(x; a) = x +
d(x; a): The motion estimation problem is to estimate the motion parameter vector
a. Methods that have been developed can be categorized into two groups: feature-
based and intensity-based. In the feature-based approach, correspondences between
pairs of selected feature points in two video frames are rst established. The motion
model parameters are then obtained by a least squares tting of the established
correspondences into the chosen motion model. This approach is only applicable
to parametric motion models and can be quite eective in, say, determining global
motions. The intensity-based approach applies the constant intensity assumption or
the optical
ow equation at every pixel and requires the estimated motion to satisfy
this constraint as closely as possible. This approach is more appropriate when the
underlying motion cannot be characterized by a simple model, and that an estimate
of a pixel-wise or block-wise motion eld is desired.
In this chapter, we only consider intensity-based approaches, which are more
widely used in applications requiring motion compensated prediction and ltering.
In general, the intensity-based motion estimation problem can be converted into
an optimization problem, and three key questions need to be answered: i) how to
parameterize the underlying motion eld? ii) what criterion to use to estimate the
parameters? and iii) how to search for the optimal parameters? In this section, we
rst describe several ways to represent a motion eld. Then we introduce dierent
types of estimation criteria. Finally, we present search strategies commonly used
for motion estimation. Specic motion estimation schemes using dierent motion
representations and estimation criteria will be introduced in subsequent sections.
154 Two Dimensional Motion Estimation Chapter 6
based" to acknowledge the fact that we are only considering 2D motions, and that a region with
a coherent 2D motion may not always correspond to a physical object.
Section 6.2. General Methodologies 155
Figure 6.5. Dierent motion representations: (a) global, (b) pixel-based, (c) block-based,
and (d) region-based. From [38, Fig. 3].
which usually undergoes a continuous motion, but it fails to capture motion discon-
tinuities at object boundaries. Adaptive schemes that allow discontinuities when
necessary is needed for more accurate motion estimation. Figure 6.5 illustrates the
eect of several motion representations described above for a head-and-shoulder
scene. In the next few sections, we will introduce motion estimation methods using
dierent motion representations.
Setting the above gradient to zero yields the least squares solution for d0 :
! 1 !
X X
d0 = r 1(x)r 1(x )T ( 1 (x) 2( x)) r 1(x) : (6.2.7)
x20 x20
Section 6.2. General Methodologies 157
When the motion is not a constant, but can be related to the model parameters
linearly, one can still derive a similar least-squares solution. See Prob. 6.6 in the
Problem section.
An advantage of the above method is that the minimizing function is a quadratic
function of the MVs, when p = 2. If the motion parameters are linearly related
to the MVs, then the function has a unique minimum and can be solved easily.
This is not true with the DFD error given in Eq. (6.2.1). However, the optical
ow equation is valid only when the motion is small, or when an initial motion
estimate d~ (x) that is close to the true motion can be found and one can pre-update
2 (x) to 2 (x + d
~ (x)) When this is not the case, it is better to use the DFD error
criterion, and nd the minimal solution using the gradient descent or exhaustive
search method.
Regularization
Minimizing the DFD error or solving the optical
ow equation does not always give
physically meaningful motion estimate. This is partially because the constant inten-
sity assumption is not always correct. The imaged intensity of the same object point
may vary after an object motion because of the various re
ectance and shadowing
eects. Secondly, in a region with
at texture, many dierent motion estimates can
satisfy the constant intensity assumption or the optical
ow equation. Finally, if the
motion parameters are the MVs at every pixel, the optical
ow equation does not
constrain the motion vector completely. These factors make the problem of motion
estimation a ill-posed problem.
To obtain a physically meaningful solution, one needs to impose additional con-
straints to regularize the problem. One common regularization approach is to add
a penalty term to the error function in (6.2.1) or (6.2.4), which should enforce the
resulting motion estimate to bear the characteristics of common motion elds. One
well-known property of a typical motion eld is that it usually varies smoothly from
pixel to pixel, except at object boundaries. To enforce the smoothness, one can use
a penalty term that measures the dierences between the MVs of adjacent pixels,
i.e., X X
Es (a) = kd(x; a) d(y; a)k2 ; (6.2.8)
x2 y2Nx
Bayesian Criterion
The Bayesian estimator is based on a probablistic formulation of the motion esti-
mation problem, pioneered by Konrad and Dubois [22, 38]. Under this formulation,
given an anchor frame 1 , the image function at the tracked frame 2 is considered
a realization of a random eld , and the motion eld d is a realization of another
random eld D . The a posterior probability distribution of the motion eld D
given a realization of and 1 can be written, using the Bayes rule
P ( = 2 jD = d; 1 )P (D = d; 1 )
P (D = dj = 2 ; 1 ) = : (6.2.10)
P ( = 2 ; 1 )
In the above notation, the semicolon indicates that subsequent variables are de-
terministic parameters. An estimator based on the Bayesian criterion attempts to
maximize the a posterior probability. But for given 1 and 2 , maximizing the
above probability is equivalent to maximizing the numerator only. Therefore, the
maximum a posterior (MAP) estimate of d is
dMAP = argmaxd fP ( = 2jD = d; 1)P (D = d; 1)g : (6.2.11)
The rst probability denotes the likelihood of an image frame given the motion
eld and the anchor frame. Let E represent the random eld corresponding to the
DFD image e(x) = 2 (x + d) 1 (x) for given d and 1 , then
P ( = 2 jD = d; 1 ) = P (E = e);
and the above equation becomes
dMAP = argmaxd fP (E = e)P (D = d; 1 )g
= argmind f log P (E = e) log P (D = d; 1 )g : (6.2.12)
From the source coding theory (Sec. 8.3.1), the minimum coding length for a
source X is its entropy, log P (X = x). We see that the MAP estimate is
equivalent to minimizing the sum of the coding length for the DFD image e and
that for the motion eld d. As will be shown in Sec. 9.3.1, this is precisely what
a video coder using motion-compensated prediction needs to code. Therefore, the
MAP estimate for d is equivalent to a minimum description length (MDL) estimate
[34]. Because the purpose of motion estimation in video coding is to minimize the
bit rate, the MAP criterion is a better choice than minimizing the prediction error.
The most common model for the DFD image is a zero-mean independently
identically distributed (i.i.d.) Gaussian eld, with distribution
x2 e (x) ;
P 2
P (E = e) = (2 )2 j j = 2 exp (6.2.13)
2 2
where jj denotes the size of (i.e., the number of pixels in ). With this model,
minimizing the rst term in Eq. (6.2.12) is equivalent to minimizing the previously
dened DFD error (when p = 2).
Section 6.2. General Methodologies 159
For the motion eld D , a common model is a Gibbs/Markov random eld [11].
Such a model is dened by a neighborhood structure called clique. Let C represent
the set of cliques, the model assumes
1
P (D = d) = exp(
X
V (d)); (6.2.14)
c
Z c2C
where Z is a normalization factor. The function Vc (d) is called the potential func-
tion, which is usually dened to measure the dierence between pixels in the same
clique: X
Vc (d) = jd(x) d(y)j2 : (6.2.15)
(x;y)2c
Under this model, minimizing the second term in Eq. (6.2.12) is equivalent to min-
imizing the smoothing function in Eq. (6.2.8). Therefore, the MAP estimate is
equivalent to the DFD-based estimator with an appropriate smoothness constraint.
6.2.3 Minimization Methods
The error functions presented in Sec. 6.2.2 can be minimized using various opti-
mization methods. Here we only consider exhaustive search and gradient-based
search methods. Usually, for the exhaustive search, the MAD is used for reasons of
computational simplicity, whereas for the gradient-based search, the MSE is used
for its mathematical tractability.
Obviously, the advantage of the exhaustive search method is that it guarantees
reaching the global minimum. However, such search is feasible only if the number of
unknown parameters is small, and each parameter takes only a nite set of discrete
values. To reduce the search time, various fast algorithms can be developed, which
achieve sub-optimal solutions.
The most common gradient descent methods include the steepest gradient de-
scent and the Newton-Ralphson method. A brief review of these methods is provided
in Appendix B. A gradient-based method can handle unknown parameters in a high
dimensional continuous space. However, it can only guarantee the convergence to
a local minimum. The error functions introduced in the previous section in general
are not convex and can have many local minima that are far from the global mini-
mum. Therefore, it is important to obtain a good initial solution through the use of
a prior knowledge, or by adding a penalty term to make the error function convex.
With the gradient-based method, one must calculate the spatio-temporal gradi-
ents of the underlying signal. Appendix A reviews methods for computing rst and
second order gradients from digital sampled images. Note that the methods used
for calculating the gradient functions can have profound impact on the accuracy
and robustness of the associated motion estimation methods, as have been shown
by Barron et al. [4]. Using a Gaussian pre-lter followed by a central dierence
generally leads to signicantly better results than the simple two point dierence
approximation.
160 Two Dimensional Motion Estimation Chapter 6
where w(x) are the weights assigned to pixel x. Usually, the weight decreases as
the distance from x to xn increases.
The gradient with respect to dn is
gn = @E @ 2
X
n
= w(x)e(x; dn ) ; (6.3.3)
@ dn x2B(xn )
@ x x+d
n
where e(x; dn ) = 2 (x + dn ) 1 (x) is the DFD at x with the estimate dn . Let dln
represent the estimate at the l-th iteration, the rst order gradient descent method
would yield the following update algorithm
dln+1 = dln gn(dln ): (6.3.4)
From Eq. (6.3.3), the update at each iteration depends on the sum of the image
gradients at various pixels scaled by the weighted DFD values at those pixels.
One can also derive an iterative algorithm using the Newton-Ralphson method.
From Eq. (6.3.3), the Hessian matrix is
2 @ @ T @2
= @@ dE2n =
X
Hn w(x) 2 2
@ x @ x x+d
+ w(x)e(x; dn ) 22
@ x x+d
n x2B(x ) n n
n
X @ @ T
w(x) 2 2 :
x2B(x )
@ x @ x x+d
n n
This algorithm converges faster than the rst order gradient descent method, but
it requires more computation in each iteration.
Instead of using gradient-based update algorithms, one can also use exhaustive
search to nd the dn that yields the minimal error within a dened search range.
This will lead to the exhaustive block matching algorithm (EBMA) to be presented
in Sec. 6.4.1. The dierence from the EBMA is that the neighborhood used here is
a sliding window and a MV is determined for each pixel by minimizing the error in
its neighborhood. The neighborhood in general does not have to be a rectangular
block.
Because the estimated MV for a block only aects the prediction error in this block,
one can estimate the MV for each block individually, by minimizing the prediction
error accumulated over this block only, which is:
X
Em (dm ) = j 2 (x + dm ) 1(x)jp ; (6.4.2)
x2Bm
One way to determine the dm that minimizes the above error is by using exhaus-
tive search and this method is called exhaustive block matching algorithm (EBMA).
As illustrated in Fig. 6.6, the EBMA determines the optimal dm for a given block
Bm in the anchor frame by comparing it with all candidate blocks Bm0 in the tracked
frame within a predened search region and nding the one with the minimum error.
The displacement between the two blocks is the estimated motion vector.
To reduce the computational load, the MAD error (p = 1) is often used. The
search region is usually symmetric with respect to the current block, up to Rx pixels
to the left and right, and up to Ry pixels above and below, as illustrated in Fig. 6.6.
If it is known that the dynamic range of the motion is the same in horizontal and
vertical directions, then Rx = Ry = R. The estimation accuracy is determined by
the search stepsize, which is the distance between two nearby candidate blocks in
the horizontal or vertical direction. Normally, the same stepsize is used along the
two directions. In the simplest case, the stepsize is one pixel, which is known as
integer-pel accuracy search.
164 Two Dimensional Motion Estimation Chapter 6
Rx
Ry
dm
B’m
Best match
Search region
Bm
Current block
Figure 6.6. The search procedure of the exhaustive block matching algorithm.
Let the block size be N N pixels, and the search range be R pixels in both
horizontal and vertical directions (cf. Fig. 6.6). With a stepsize of one pixel, the
total number of candidate matching blocks for each block in the anchor frame is
(2R +1)2: Let an operation be dened as consisting of one subtraction, one absolute
value computation, and one addition. The number of operations for calculating the
MAD for each candidate estimate is N 2 : The number of operations for estimating
the MV for one block is then (2R + 1)2 N 2 . For an image of size M M , there are
(M=N )2 blocks (assuming M is a multiple of N ). The total number of operations
for a complete frame is then M 2(2R + 1)2 : It is interesting to note that the overall
computational load is independent of the block size N .
As an example, consider M = 512; N = 16; R = 16; the total operation count
per frame is 2:85 108: For a video sequence with a frame rate of 30 fps, the opera-
tion required per second is 8:55 109, an astronomical number! This example shows
that EBMA requires intense computation, which poses a challenge to applications
requiring software-only implementation. Because of this problem, various fast al-
gorithms have been developed, which trade o the estimation accuracy for reduced
computations. Some fast algorithms are presented in Sec. 6.4.3. One advantage of
EBMA is that it can be implemented in hardware using simple and modular design,
and speed-up can be achieved by using multiple modules in parallel. There have
been many research eorts on eÆcient realization of the EBMA using VLSI/ASIC
chips, which sometimes involve slight modication of the algorithm to trade o the
accuracy for reduced computation or memory space or memory access. For a good
review of VLSI architecture for implementing EBMA and other fast algorithms for
block matching, see [21, 32, 14].
Section 6.4. Block Matching Algorithm 165
Bm:
current
block dm
B’m:
matching
block
Figure 6.7. Half-pel accuracy block matching. Filled circles are samples existing in
the original tracked frame, open circles are samples to be interpolated for calculating
the matching error, for a candidate MV dm = ( 1; 1:5). Instead of calculating these
samples on-demand for each candidate MV, a better approach is to pre-interpolate the
entire tracked frame.
Example 6.1: Figure 6.8(c) shows the estimated motion eld by a half-pel EBMA
algorithm for two given frames in Figure 6.8(a-b). Figure 6.8(d) shows the predicted
anchor frame based on the estimated motion. This is obtained by replacing each block
in the anchor frame by its best matching block in the tracked frame. The image size
is 352 288 and the block size is 16 16: We can see that a majority of blocks
are predicted accurately, however, there are blocks that are not well-predicted. Some
of these blocks undergo non-translational motions, such as those blocks covering
the eyes and the mouth. Other blocks contain both the foreground object and the
background and only the foreground object is moving. They are also blocks where
the image intensity change is due to the change in the re
ection patterns when the
head turns. The motion variation over these blocks cannot be approximated well by
a constant MV, and the EBMA algorithm simply identies a block in the tracked
frame that has the smallest absolute error from a given block in the anchor frame.
Furthermore, the predicted image is discontinuous along certain block boundaries,
which is the notorious blocking artifact common with the EBMA algorithm. These
artifacts are due to the inherent limitation of the block-wise translational motion
model, and the fact that the MV for a block is determined independent of the MVs
of its adjacent blocks.
The accuracy between a predicted image and the original one is usually measured
by PSNR dened previously in Eq. (1.5.6). The PSNR of the predicted image by
the half-pel EBMA is 29.72 dB. With the integer-pel EBMA, the resulting predicted
image is visually very similar, although the PSNR is slightly lower.
(a) (b)
(c) (d)
(e) (f)
Figure 6.8. Example motion estimation results: (a) the tracked frame; (b) the anchor
frame; (c-d) motion eld and predicted image for the anchor frame (PSNR=29.86 dB)
obtained by half-pel accuracy EBMA ; (e-f) motion eld (represented by the deformed
mesh overlaid on the tracked frame) and predicted image (PSNR=29.72 dB) obtained by
the mesh-based motion estimation scheme in [43].
168 Two Dimensional Motion Estimation Chapter 6
Figure 6.9. The 2D-logrithmic search method. The search points in a tracked frame are
shown with respect to a block center at (i,j) in the anchor frame. In this example, the
best matching MVs in steps 1 to 5 are (0,2), (0,4), (2,4), (2,6), and (2,6). The nal MV is
(2,6). From [28, Fig. 11]
Figure 6.10. The three-step search method. In this example, the best matching MVs in
steps 1 to 3 are (3,3), (3,5), and (2,6). The nal MV is (2,6). From [28, Fig. 12]
size equal to or slightly larger than half of the maximum search range. In each
step, nine search points are compared. They consist of the central point of the
search square, and eight search points located on the search area boundaries. The
stepsize is reduced by half after each step, and the search ends with stepsize of
1 pel. At each new step, the search center is moved to the best matching point
resulting from the previous step. Let R0 represent the initial search stepsize, there
are at most L = [log2 R0 + 1] search steps, where [x] represents the lower integer of
x. If R0 = R=2, then L = [log2 R]. At each search step, eight points are searched,
except in the very beginning, when nine points need to be examined. Therefore, the
total number of search points is 8L + 1: For example, for a search range of R = 32;
with EBMA, the total number of search points is 4225, whereas with the three-step
method, the number is reduced to 41; a saving factor of more than 100. Unlike the
2D-log search method, the three-step method has a xed, predictable number of
search steps and search points. In addition, it has a more regular structure. These
features make the three-step method more amenable to VLSI implementation than
the 2D-log method and some other fast algorithms.
Comparison of Dierent Fast Algorithms Table 6.1 compares the minimum and
maximum numbers of search points and the number of search steps required by
several dierent search algorithms. As can be seen, some algorithms have more
regular structure and hence xed number of computations, while others have very
dierent best case and worst case numbers. For VLSI implementation, structure
regularity is more important, whereas for software implementation, the average case
complexity (which is more close to the best case in general) is more important. For
an analysis on the implementation complexity and cost using VLSI circuits for these
algorithms, see the article by Hang et al. [14].
The above discussions assume that the search accuracy is integer-pel. To achieve
half-pel accuracy, one can add a nal step in any fast algorithm, which searches with
a half-pel stepsize in a 1 pel neighborhood of the best matching point found from
the integer-pel search.
170 Two Dimensional Motion Estimation Chapter 6
~ (f ) = j 1 (f ) 2 (f ) = ej 2d f ;
T
(6.4.5)
(f ) (f )j
1 2
where the superscript indicates complex conjugation. Taking the inverse Fourier
transform results in the phase correlation function (PCF):4
PCF(x) = F 1 f ( ~ f )g = Æ(x + d): (6.4.6)
4 The name comes from the fact that it is the cross correlation between the phase portions of
where f (i; j ); 0 i; j 2N 1g are the pixel intensity values of the original block.
The sign-truncated vectors are obtained by
ST patternn (i; j ) = 0 if Meann (i; j) Meann 1 (b 2i c; b 2j c); (6.4.8)
1 otherwise:
The STF vectors, decomposed to the n-th level for a 2N x2N block can then be
represented as
STFVNn = fST patternN ; ST patternN 1 ; :::ST patternN n+1 ; meanN n g: (6.4.9)
When n=N, a block is fully decomposed with the following STF vector
STFVNN = fST patternN ; ST patternN 1 ; :::ST pattern1 ; mean0 g: (6.4.10)
All the intermediate mean vectors are only used to generate ST patterns and can
be discarded. Therefore, the nal STF representation consists of a multiresolution
binary sequence with 34 (4N 1) bits and a one-byte mean. This represents a much
reduced data set compared to the original 4N bytes of pixel values. Also, this feature
set allows binary Boolean operation for the block matching purpose.
Section 6.5. Deformable Block Matching Algorithms (DBMA)* 173
As an example, let us consider how to form the STF vector for a 4x4 block with
2 layers. First, the mean pyramid is formed as
2 3
158 80 59 74
6 80 69 59 74 77 =) 97 67
6
4 87 86 65 62 5 97 64 =) 81
116 100 72 58
The STF vectors is then obtained as:
0 1
0 1 1 0
B
B 1 1 1 0 C
C; 0 1 ; 81
@ 0 0 1 0 A 0 1
0 0 0 1
The STF vector decomposed to one layer for the above example is f 0110 1110 0010 0001,
(97; 67; 97; 64)g. The completely-decomposed STF vector is f 0101, 0110 1110 0010 0001,
81 g. It consists of a 20-bit binary pattern, which includes a 2x2 second layer
sign-pattern and a 4x4 rst layer sign-pattern, and a mean value. In practical im-
plementations, either completely-decomposed or mixed-layer STF vectors can be
used.
Comparison of two STF vectors is accomplished by two parallel decision proce-
dures: i) calculating the absolute error bewteen the mean values, and ii) determining
the Hamming distance between the binary patterns. The later can be accomplished
extremely fast by using an XOR Boolean operator. Therefore, the main compu-
tational load of the HFM-ME lies in the computation of the mean pyramid for
the current and all candidate matching blocks. This computation can however be
done in advance, only once for every possible block. For a detailed analysis of the
computation complexity and a fast algorithm using logrithmic search, see [25].
xm, 3 xm, 2
Bm
xm, 4 xm, 1 Tracked Frame
Anchor Frame
Figure 6.11. The deformable block matching algorithm nds the best matching quadran-
gle in the tracked frame for each block in the anchor frame. The allowed block deformation
depends on the motion model used for the block. Adapted from [39, Fig. 6.9].
fm, 4 fm, 1
dm, 1
dm, 4
following derivation applies to any block B: With the node-based motion model,
the motion parameters for any block are the nodal MVs, i.e., a = [dk ; k 2 K];
where K = f1; 2; : : : ; K g: They can be estimated by minimizing the prediction error
over this block, i.e.,
X
E (a) = j 2 (w(x; a)) 1 (x)jp ; (6.5.2)
x2B
where X
w(x; a) = x + k (x)dk : (6.5.3)
k2K
As with the BMA, there are many ways to minimize the error in Eq. (6.5.2),
including exhaustive search and a variety of gradient-based search method. The
computational load required of the exhaustive search, however, can be unacceptable
in practice, because of the high dimension of the search space. Gradient-based
search algorithms are more feasible in this case. In the following, we derive a
Newton-Ralphson search algorithm, following the approach in [24].
Dene a = [aTx ; aTy ]T with ax = [dx;1; dx;2 ; : : : ; dx;K ]T ; ay = [dy;1 ; dy;2 ; : : : ; dy;K ]T :
One can show that
T
@E
@a
(a) = @@E @E
ax (a); @ ay (a) ;
with
@E
( a) = 2
X
e(x; a)
@ 2(w(x; a)) (x);
@ ax x2B
@x
@E
( a) = 2
X
e(x; a)
@ 2(w(x; a)) (x):
@ ay x2B
@y
In the above equations, e(x; a) = 2 (w(x; a)) 1 (x) and (x) = [1 (x); 2 (x); : : : ; K (x)]T :
By dropping the terms involving second order gradients, the Hessian matrix can be
approximated as
H ( a )
[H(a)] = H (a) H (a) ;
xx H xy (a )
xy yy
with
X @ 2 2
Hxx(a) = 2 @x
(x)(x)T ;
x2B w(x;a)
2
X @ 2
Hyy (a) = 2 @y
(x)(x)T :
x2B w(x;a)
X @ 2 @ 2
Hxy (a) = 2 @x @y (x)(x)T :
x2B w(x;a)
Section 6.6. Mesh-Based Motion Estimation
177
The Newton-Ralphson update algorithm is:
@E
al+1 = al [H(al )] 1 (al ):
@a
(6.5.4)
The update at each iteration thus requires the inversion of the 2K 2K symmetric
matrix [H].
To reduce numerical computations, we can update the displacements in x and
y directions separately. Similar derivation will yield:
@E l
alx+1 = alx [Hxx (al )] 1
@ ax
(a ); (6.5.5)
@E l
aly+1 = aly [Hyy (al )] 1
@ ay
(a ): (6.5.6)
In this case we only need to invert two K K matrices in each update. For the
four-node case, [H] is an 8 8 matrix, while [Hxx], and [Hyy ] are 4 4 matrices.
As with all gradient-based iterative processes, the above update algorithm may
reach a bad local minimum that is far from the global minimum, if the initial solution
is not chosen properly. A good initial solution can often be provided by the EBMA.
For example, consider the four-node model with four nodes at the corners of each
block. One can use the average of the motion vectors of the four blocks attached
to each node as the initial estimate of the MV for that node. This initial estimate
can then be successively updated by using Eq. (6.5.4).
Note that the above algorithm can also be applied to polynomial-based motion
representation. In that case, ax and ay would represent the polynomial coeÆcients
associated with horizontal and vertical displacements, respectively, and k () would
correspond to the elementary polynomial basis functions. However, it is diÆcult to
set the search range for ax and ay and check the feasibility of the resulting motion
eld.
(a)
(b)
(c)
xn(m, 1) dn(m, 1)
dn(m, 2)
xn(m, 2) Dm
Dm
xn(m, 3) dn(m, 3)
(a)
dn(m, 1) dn(m, 2)
xn(m, 1) xn(m, 2)
Dm Dm
(b)
where n(m; k) species the global index of the k-th node in the m-th element (cf.
Fig. 6.14). The function m;k (x) is the interpolation kernel associated with node k
in element m. It depends on the desired contribution of the k-th node in B1;m to
the MV at x. This interpolation mechanism has been shown previously in Fig. 6.12.
To guarantee continuity across element boundary, the interpolation kernels should
satisfy:
X
0 m;k (x) 1; m;k (x) = 1; 8x 2 Bm;
k
and
m;k (xl ) = Æk;l = 1 k = l;
0 k=6 l:
Section 6.6. Mesh-Based Motion Estimation
181
y y
3 1 2
1 2
x
-1 1
3 1
x
1
4 -1 1
(a) (b)
Figure 6.15. (a) A standard triangular element; (b) A standard quadrilateral element (a
square).
In the nite element method (FEM) analysis, these functions are called shape func-
tions [45]. If all the elements have the same shape, then all the shape functions are
equal, i.e., m;k (x) = k (x):
Standard triangular and quadrilateral elements are shown in Fig. 6.15. The
shape functions for the standard triangular element are:
t1 (x; y) = x; t2 (x; y) = y; t3 (x; y) = 1 x y: (6.6.2)
The shape functions for the standard quadrilateral element are:
q1 (x; y) = (1 + x)(1 y)=4; q2 (x; y) = (1 + x)(1 + y)=4; (6.6.3)
q3 (x; y) = (1 x)(1 + y)=4; q4 (x; y) = (1 x)(1 y)=4:
We see that the shape functions for these two cases are aÆne and bilinear functions,
respectively. The readers are referred to [41] for the shape functions for arbitrary
triangular elements. The coeÆcients of these functions depend on the node posi-
tions.
Note that the representation of the motion within each element in Eq. (6.6.1) is
the same as the node-based motion representation introduced in Eq. (6.5.1), except
that the nodes and elements are denoted using global indices. This is necessary
because the nodal MVs are not independent from element to element. It is impor-
tant not to confuse the mesh-based model with the node-based model introduced in
the previous section. There, although several adjacent blocks may share the same
node, the nodal MVs are determined independently in each block. Going back to
Fig. 6.14(b), in the mesh-based model, node n is assigned a single MV, which will
aect the interpolated motion functions in four quadrilateral elements attached to
this node. In the node-based model, node n can have four dierent MVs, depending
on in which block it is considered.
182 Two Dimensional Motion Estimation Chapter 6
B1, m B2, m
Figure 6.16. Mapping from a master element B~ to two corresponding elements in the
anchor and tracked frames B1;m and B2;m :
In general, the error function in Eq. (6.6.4) is diÆcult to calculate because of the
irregular shape of B1;m . To simplify the calculation, we can think of Bt;m; t = 1; 2;
as being deformed from a master element with a regular shape. In general, the
master element for dierent elements could dier. Here, we only consider the case
when all the elements have the same topology that can be mapped from the same
master element, denoted by B~. Fig. 6.16 illustrates such a mapping.
Section 6.6. Mesh-Based Motion Estimation
183
Let ~k (u) represent the shape function associated with the k-th node in B~; the
mapping functions from B~ to Bt;m can be represented as
X
w~ t;m(u) = ~k (u)xt;n(m;k) ; u 2 B~; t = 1; 2; (6.6.5)
k2K
Then the error in Eq. (6.6.4) can be calculated over the master element as
X X
E (dn ; n 2 N ) = je~m (u)jp Jm (u); (6.6.6)
m2M u2B~
where
e~m (u) = 2 (w
~ 2;m (u)) w
1 ( ~ 1;m ( u))
(6.6.7)
represents the error between the two image frames at points that are
both
mapped
from u in the master element (cf. Fig 6.16). The function Jm (u) = det @ w~ 1@ u (u)
;m
where Mn includes the indices of the elements that are attached to node n, and
k(m; n) species the local index of node n in the m-th adjacent element. Figure 6.17
illustrates the neighboring elements and shape functions attached to node n in the
quadrilateral mesh case.
It can be seen that the gradient with respect to one node only depends on the
errors in the several elements attached to it. Ideally, in each iteration of a gradient-
based search algorithm, to calculate the above gradient function associated with
any node, one should assume the other nodes are xed at the positions obtained
in the previous iteration. All the nodes should be updated at once at the end of
5 Strictly speaking, the use of Jacobian is correct only when the error is dened as an integral over
B~. Here we assume the sampling over B~ is suÆciently dense when using the sum to approximate
the integration.
184 Two Dimensional Motion Estimation Chapter 6
Bm1
f1 Bm2
xn f4
f2
f3
Bm4
Bm3
Figure 6.17. Neighborhood structure in a quadrilateral mesh: For a given node n, there
are four elements attached to it, each with one shape function connected to this node.
the iteration, before going to the next iteration. But in reality, to speed up the
process, one can update one node at a time, while xing its surrounding nodes.
Of course, this sub-optimal approach could lead to divergence or convergence to
a local minimum that is worse than the one obtained by updating all the nodes
simultaneously. Instead of updating the nodes in the usual raster order, to improve
the accuracy and convergence rate, one can order the nodes so that the nodes whose
motion vectors can be estimated more accurately be updated rst. Because of the
uncertainty in motion estimation in smooth regions, it may be better to update the
nodes with large edge magnitude and small motion compensation error rst. This
is known as highest condence rst [7] and this approach was taken in [2]. Another
possibility is to divide all the nodes into several groups so that the nodes in the same
group do not share same elements, and therefore are independent in their impact
on the error function. Sequential update of the nodes in the same group is then
equivalent to simultaneous updates of these nodes. This is the approach adopted
in [42]. Either the rst order gradient descent method or the second order Newton-
Ralphson type of update algorithm could be used. The second order method will
converge much faster, but it is also more liable to converge to bad local minima.
The newly updated nodal positions based on the gradient function can lead
to overly deformed elements (including
ip-over and obtuse elements). In order
to prevent such things from happening, one should limit the search range where
the updated nodal position can fall into. If the updated position goes beyond this
region, then it should be projected back to the nearest point in the dened search
range. Figure 6.18 shows the legitimate search range for the case of a quadrilateral
mesh.
The above discussion applies not only to gradient-based update algorithms, but
also exhaustive search algorithms. In this case, one can update one node at a
time, by searching for the nodal position that will minimize the prediction errors
Section 6.6. Mesh-Based Motion Estimation
185
xn xn
x’n
Search
region
(a) (b)
Figure 6.18. The search range for node n given the positions of other nodes: the diamond
region (dash line) is the theoretical limit, the inner diamond region (shaded) is used in
practice. When xn falls outside the diamond region (a), at least one element attached to
it becomes obtuse. By projecting xn onto the inner diamond (b), all four elements would
not be overly deformed.
in elements attached to it in the search range illustrated in Fig. 6.18. For each
candidate position, one calculates the error accumulated over the elements attached
to this node, i.e., replacing n 2 M by n 2 Mn in Eq. (6.6.4). The optimal position
is the one with the minimal error. Here again, the search order is very important.
Example 6.2: Figure 6.8 shows the motion estimation result obtained by an ex-
haustive search approach for backward motion estimation using a rectangular mesh
at each new frame [43]. Figure 6.8(e) is the deformed mesh overlaid on top of the
tracked frame, and Figure 6.8(f) is the predicted image for the anchor frame. Note
that each deformed quadrangle in Fig. 6.8(e) corresponds to a square block in the
anchor frame. Thus a narrow quadrangle in the right side of the face indicates it
is expanded in the anchor frame. We can see that the mesh is deformed smoothly,
which corresponds to a smooth motion eld. The predicted image does not suer
from the blocking artifact associated with the EBMA (Fig. 6.8(d) vs. Fig. 6.8(f))
and appears to be a more successful prediction of the original. A careful comparison
between the predicted image (Fig. 6.8(f)) and the actual image (Fig. 6.8(b)) however
will reveal that the eye closing and mouth movement are not accurately reproduced,
and there are some articial warping artifacts near the jaw and neck. In fact, the
PSNR of the predicted image is lower than that obtained by the EBMA.
Until now, we have assumed that a single mesh is generated (or propagated from
the previous frame in the forward tracking case) for the entire current frame, and
every node in this mesh is tracked to one and only one node in the tracked frame, so
that the nodes in the tracked frame still form a mesh that covers the entire frame.
In order to handle newly appearing or disappearing objects in a scene, one should
186 Two Dimensional Motion Estimation Chapter 6
eld can be represented by a projective mapping only if the object surface is planar. When the
object surfaces is spatially varying, the mapping function at any point also depends on the surface
depth of that point and cannot be represented by a global model.
Section 6.7. Global Motion Estimation 187
model to the entire frame. These problems can be overcome by a robust estimation
method [35], if the global motion is dominant over other local motions, in the sense
that the pixels that experience the same global motion and only the global motion
occupy a signicantly larger portion of the underlying image domain than those
pixels that do not.
The basic idea in robust estimation is to consider the pixels that are governed
by the global motion as inliers, and the remaining pixels as outliers. Initially, one
assume that all the pixels undergo the same global motion, and estimate the motion
parameters by minimizing the prediction or tting error over all the pixels. This
will yield an initial set of motion parameters. With this initial solution, one can
then calculate the prediction or tting error over each pixel. The pixels where the
errors exceed a certain threshold will be classied as outliers and be eliminated
from the next iteration. The above process is then repeated to the remaining inlier
pixels. This process iterates until no outlier pixels exist. This approach is called
Hard Threshold Robust Estimator.
Rather than simply labeling a pixel as either inlier or outlier at the end of each
iteration, one can also assign a dierent weight to each pixel, with a large weight
for a pixel with small error, and vice verse. At the next minimization or tting
iteration, a weighted error measure is used, so that the pixels with larger errors in
the previous iteration will have less impact than those with smaller errors. This
approach is known as Soft Threshold Robust Estimator.
6.7.2 Direct Estimation
In either the hard or soft threshold robust estimator, each iteration involves the
minimization of an error function. Here we derive the form of the function when
the model parameters are directly obtained by minimizing the prediction error. We
only consider the soft-threshold case, as the hard-threshold case can be considered
as a special case where the weights are either one or zero. Let the mapping function
from the anchor frame to the tracked frame be denoted by w(x; a), where a is the
vector that contains all the global motion parameters. The prediction error can be
written as, following Eq. (6.2.1):
X
EDFD = w(x) j 2 (w(x; a)) 1 (x)jp (6.7.1)
x2
where w(x) are the weighting coeÆcients for pixel x. Within each iteration of
the robust estimation process, the parameter vector a is estimated by minimizing
the above error, using either gradient-based method or exhaustive search. The
weighting factor at x, w(x), in a new iteration will be adjusted based on the DFD
at x calculated based on the motion parameters estimated in the previous iteration.
6.7.3 Indirect Estimation
In this case, we assume that the motion vectors d(x) have been estimated at a set
of suÆciently dense points x 2 0 ; where represent the set of all pixels.
188 Two Dimensional Motion Estimation Chapter 6
This can be accomplished, for example, using either the block-based or mesh-based
approaches described before. One can also choose to estimate the motion vectors at
only selected feature points, where the estimation accuracy is high. The task here is
to determine a so that the model d(x; a) can approximate the pre-estimated motion
vectors d(x); x 2 0 well. This can be accomplished by minimizing the following
tting error:
X
Etting = w(x) jd(x; a) d(x)jp (6.7.2)
x20
As an example, consider the aÆne motion model given in Eq. (5.5.16). The
motion parameter vector is a = [a0 ; a1 ; a2 ; b0; b1 ; b2 ]T ; and the matrix [A(x)] is
[A(x)] = 10 x0 y0 01 x0 y0 :
In fact, the parameters for the x and y dimensions are not coupled and can be
estimated separately, which will reduce the matrix sizes involved. For example, to
estimate the x dimensional parameter ax = [a0 ; a1 ; a2 ], the associated matrix is
[Ax (x)] = [1; x; y], and the solution is:
! 1 !
X X
ax = w(x)[Ax (x)]T [ Ax(x)] w(x)[Ax (x )]T dx ( x) : (6.7.4)
x20 x20
connected subset, isolated pixels may be merged into its surrounding region, and
nally region boundaries can be smoothed using morphological operators.
When the motion model for each region is not a simple translation, motion-based
clustering is not as straight forward. This is because one cannot use the similarity
between motion vectors as the criterion for performing clustering. One way is to
nd a set of motion parameters for each pixel by tting the motion vectors in its
neighborhood into a specied model. Then one can employ the clustering method
described previously, by replacing the raw motion vector with the motion parameter
vector. If the original motion eld is given in the block-based representation using
a higher order model, then one can cluster blocks with similar motion parameters
into the same region. Similarly, with the mesh-based motion representation, one can
derive a set of motion parameters for each element based on its nodal displacements,
and then clustering the elements with similar parameters into the same region. This
is the parallel approach described in [44].
Layered Approach Very often, the motion eld in a scene can be decomposed into
layers, with the rst layer representing the most dominant motion, the second layer
the less dominant one, and so on. Here, the dominance of a motion is determined
by the area of the region undergoing the corresponding motion. The most dominant
motion is often a re
ection of the camera motion, which aects the entire imaged
domain. For example, in a video clip of a tennis play, the background will be the
rst layer, which usually undergoes a coherent global camera motion; the player the
second layer (which usually contains several sub-object level motions corresponding
to the movements of dierent parts of the body), the racket the third, and the
ball the fourth layer. To extract the motion parameters in dierent layers, one
can use the robust estimator method described in Sec. 6.7.1 recursively. First, we
try to model the motion eld of the entire frame by a single set of parameters,
and continuously eliminate outlier pixels from the remaining inlier group, until all
the pixels within the inlier group can be modeled well. This will yield the rst
dominant region (corresponding to the inlier region) and its associated motion.
The same approach can then be applied to the remaining pixels (the outlier region)
to identify the next dominant region and its motion. This process continues until
no more outlier pixels exist. As before, postprocessing may be invoked at the end
of the iterations to improve the spatial connectivity of the resulting regions. This
is the sequential approach described in [44].
For the above scheme to work well, the inlier region must be larger than the
outlier region at any iteration. This means that the largest region must be greater
than the combined area of all other regions, the second largest region must be
greater than the combined area of the remaining regions, and so on. This condition
is satised in most video scenes, which often contain a stationary background that
covers a large portion of the underlying image, and dierent moving objects with
varying sizes.
Section 6.9. Multi-Resolution Motion Estimation 191
6.8.2 Joint Region Segmentation and Motion Estimation
Theoretically, one can formulate the joint estimation of the region segmentation map
and the motion parameters of each region as an optimization problem. The function
to be minimized could be a combination of the motion compensated prediction
error and a region smoothness measure. The solution of the optimization problem
however is diÆcult because of the very high dimension of the parameter space
and the complicated interdependence between these parameters. In practice, a
sub-optimal approach is often taken, which alternates between the estimation of
the segmentation and motion parameters. Based on an initial segmentation, the
motion of each region is rst estimated. In the next iteration, the segmentation
is rened, e.g., by eliminating outlier pixels in each region where the prediction
errors are large, and by merging pixels sharing similar motion models. The motion
parameters for each rened region are then be re-estimated. This process continues
until no more changes in the segmentation map occur.
An alternative approach is to estimate the regions and their associated motions
in a layered manner, similar to the layered approach described previously. There,
we assume that the motion vector at every point is already known, and the identi-
cation of the region with the most dominant motion (i.e. the inlier) is accomplished
by examining the tting error induced by representing individual MVs using a set
of motion parameters. This is essentially the indirect robust estimator presented
in Sec. 6.7.3. In the joint region segmentation and motion estimation approach,
to extract the next dominant region and its associated motion from the remaining
pixels, we can use the direct robust estimator. That is, we directly estimate the
motion parameters, by minimizing the prediction errors at these pixels. Once the
parameters are determined, we determine whether a pixel belongs to the inlier by
examining the prediction error at this pixel. We then re-estimate the motion pa-
rameters by minimizing the prediction errors at the inlier pixels only. This approach
has been taken by Hsu [18].
d1,0,0
Y1 , 1 Y2 , 1
d 2,0,1
Y1 , 2 Y2 , 2
~
d 2,0,1 q 2,0,1
~
d3,1,2
d3,1,2
q3,1,2
Y1 , 3 Y2 , 3
range.
In this section, we rst describe the multi-resolution approach for motion esti-
mation in a general setting, which is applicable to any motion models. We then
focus on the block-translation model, and describe a hierarchical block matching
algorithm.
is minimized. The new motion eld obtained after this step is dl (x) = ql (x)+ d~ l (x):
Upon completion of successive renements, the total motion at the nest resolution
is
d(x) = qL(x) + U (qL 1 (x) + U (qL 2 (x) + + U (q1 (x) + d0(x)) )):
The initial conditions for the above procedure is d0 (x) = 0: One can either di-
rectly specify the total motion d(x), or the motion updates at all levels, ql (x); l =
1; 2; : : : ; L: The latter represents the motion in a layered structure, which is desired
in applications requiring progressive retrieval of the motion eld.
The benets of the multi-resolution approach are two folds. First, the minimiza-
tion problem at a coarser resolution is less ill-posed than at a ner resolution, and
therefore, the solution obtained at a coarser level is more likely to be close to the
true solution at that resolution. The interpolation of this solution to the next reso-
lution level provides a good initial solution that is close to the true solution at that
level. By repeating this step successively from the coarsest to the nest resolution
level, the solution obtained at the nest resolution is more likely to be close to the
true solution (the global minimum). Second, the estimation at each resolution level
can be conned to a signicantly smaller search range than the true motion range
at the nest resolution, so that the total number of searches to be conducted is
smaller than the number of searches required in the nest resolution directly. The
actual number of searches will depend on the search ranges set at dierent levels.
The use of multi-resolution representation for image processing was rst intro-
duced by Burt and Adelson [6]. Application to motion estimation depends on the
motion model used. In the above presentation, we have assumed that motion vec-
tors at all pixels are to be estimated. The algorithm can be easily adapted to
estimate block-based, mesh-based, global or object level motion parameters. Be-
cause the block-wise translational motion model is the most popular for practical
applications, we consider this special case in more detail in the following.
6.9.2 Hierarchical Block Matching Algorithm (HBMA)
As indicated earlier in Sec. 6.4.1, using an exhaustive search scheme to derive block
MVs requires an extremely large number of computations. In addition, the esti-
mated block MVs often lead to a chaotic motion eld. In this section, we introduce
a hierarchical block matching algorithm (HBMA), which is a special case of the
multi-resolution approach presented before. Here, the anchor and tracked frames
are each represented by a pyramid, and the EBMA or one of its fast variants is
employed to estimate MVs of blocks in each level of the pyramid. Fig. 6.19 illus-
trates the process when the spatial resolution is reduced by half both horizontally
194 Two Dimensional Motion Estimation Chapter 6
and vertically at each increasing level of the pyramid. Here, we assume that the
same block size is used at dierent levels, so that the number of blocks is reduced
by half in each dimension as well. Let the MV for block (m; n) at level l be denoted
by dl;m;n . Starting from level 1, we rst nd the MVs for all blocks in this level,
d1;m;n. At each new level l > 1, for each block, its initial MV d~ l;m;n is interpolated
from a corresponding block in level l 1 by
d~ l;m;n = U (dl 1;bm=2c;bn=2c ) = 2dl 1;bm=2c;bn=2c : (6.9.1)
Then a correction vector ql;m;n is searched, yielding the nal estimated MV
dl;m;n = d~ l;m;n + ql;m;n: (6.9.2)
Example 6.3: In Fig. 6.20, we show two video frames, of size 32 32, in which a
gray block in the anchor frame moved by a displacement of (13,11). We show how
to use a three level HBMA to estimate the block motion eld. The block size used at
each level is 4 4, and the search stepsize is 1 pixel. Starting from level 1, for block
(0,0), the MV is found to be d1;0;0 = d1 = (3; 3): When going to level 2, for block
(0; 1), it is initially assigned the MV d~ 2;0;1 = U (d1;0;0 ) = 2d1 = (6; 6): Starting
with this initial MV, the correction vector is found to be q2 = (1; 1), leading to the
nal estimated MV d2;0;1 = d2 = (7; 5). Finally at level 3, block (1,2) is initially
assigned a MV of d~ 3;1;2 = U (d2;0;1 ) = 2d2 = (14; 10). With a correction vector of
q3 = ( 1; 1), the nal estimated MV is d3;1;2 = d3 = (13; 11):
Note that using a block width N at level l corresponds to a block width of 2L lN
at the full resolution. The same scaling applies to the search range and stepsize.
Therefore, by using the same block size, search range and stepsize at dierent levels,
we actually use a larger block size, a larger search range, as well as a larger stepsize
in the beginning of the search, and then gradually reduce (by half) these in later
steps.
The number of operations involved in the HBMA depends on the search range
at each level. If the desired search range is R in the nest resolution, then with a
L-level pyramid, one can set the search range to be R=2L 1 at the rst level. For
the remaining levels, because the initial MV interpolated from the previous level is
usually quite close to the true MV, the search range for the correction vector do not
need to be very large. However, for simplicity, we assume every level uses a search
range of R=2L 1. If the image size is M M and block size is N N at every level,
the number of blocks at the l-th level is (M=2L l N )2 , and the number of searches
is (M=2L lN )2 (2R=2L 1 + 1)2 : Because the number of operations required for
each search is N 2 , the total number of operations is
L
(M=2L l)2 (2R=2L 1 + 1)2 = 43 (1 4 (L 1))M 2 (2R=2L 1 + 1)2
X
l=1
13 4 (L 2)4M 2R2 :
Section 6.9. Multi-Resolution Motion Estimation 195
d1
Y1 , 1 Y2 , 1
d2
~
d2
Y1 , 2 Y2 , 2 q2
~
d3
d3
q3
Y1 , 3 Y2 , 3
Figure 6.20. An example of using 3-level HBMA for block motion estimation. See
Example 6.3.
Recall that the operation count for EBMA is M 2 (2R +1)2 4M 2 R2 (cf. Sec. 6.4.1).
Therefore, the hierarchical scheme using the above parameter selection will reduce
the computation by a factor of 3 4L 2. Typically, the number of levels L is 2 or 3.
Example 6.4: Figure 6.21 shows the estimation results obtained with the HBMA
approach, for the same pair of video frames given in Figure 6.8. For this example,
a three-level pyramid is used. The search range in each level is set to 4, so that the
equivalent search range in the original resolution is R = 16. Integer-pel accuracy
search is used in all the levels. The nal integer-accuracy solution is further rened
to the half-pel accuracy by using a half-pel stepsize search in a search range of one
pixel. Comparing the result in the nal level with the ones shown in Figs. 6.8(c)
and 6.8(d), we can see that the multi-resolution approach indeed yields a smoother
motion eld than EBMA. Visual observation also reveals that this motion eld rep-
resents more truthfully the motion between the two image frames in Figs. 6.8(a) and
6.8(b). This is true in spite of the fact that the EBMA yields a higher PSNR. In
terms of computational complexity, the half-pel accuracy EBMA algorithm used for
196 Two Dimensional Motion Estimation Chapter 6
(a) (b)
(c) (d)
(e) (f)
Figure 6.21. Example motion estimation results by HBMA for the two images shown
in Fig. 6.8: (a-b) the motion eld and predicted image at level 1; (c-d) the motion eld
and predicted image at level 2; (e-f) the motion eld and predicted image at the nal level
(PSNR=29.32); A three-level HBMA algorithm is used. The block size is 16 16 at all
levels. The search range is 4 at all levels with integer-pel accuracy. The result in the nal
level is further rened by a half-pel accuracy search in the range of 1.
Fig. 6.8(c-d) requires 4.3E+8 operations, while the three level algorithm here uses
only 1.1E+7 operations, if we neglect the nal renement step using half-pel search.
There are many variants to implementation of HBMA. Bierling is the rst who
applied this idea to block-based motion model [5]. A special case of hierarchical
BMA is known as variable size or quad-tree BMA, which starts with a larger block
size, and then repeatedly divides a block into four if the matching error for this
block is still larger than a threshold. In this case, all the processings are done on
the original image resolution.
Section 6.10. Summary 197
6.10 Summary
Relation between Image Intensity and Motion
Almost all motion estimation algorithms are based on the constant intensity as-
sumption (Eq. (6.1.1) or Eq. (5.2.11)), or the optical
ow equation (Eq. (6.1.3))
derived based on this assumption. This enables us to estimate motion by identi-
fying pixels with similar intensity, subject to some motion models. Note that this
assumption is valid only when the illumination source is ambient and temporally
invariant, and that the object surface is diusely re
ecting (Sec. 5.2).
When the motion direction is orthogonal to image intensity gradient, or if the
image gradient is zero, motion does not induce changes in image intensity. This is
the inherent limit of intensity based motion estimation methods.
Key Components in Motion Estimation
Motion representation: This depends on the partition used to divide a frame
(pixel-based, block-based, mesh-based, region-based, global), the motion model used
for each region of the partition (block, mesh-element, object-region, or entire frame),
and the constraint between motions in adjacent regions.
Motion estimation criterion: We presented three criteria for estimating the mo-
tion parameters over each region: i) minimizing DFD (when the motion is small,
this is equivalent to the method based on the optical-
ow equation); ii) making the
resulting motion eld as smooth as possible across regions, while minimizing DFD;
and iii) maximizing the a posterior probability of the motion eld given the observed
frames. We showed that iii) essentially requires i) and ii). Instead of minimizing
DFD, one can also detect peaks in the PCF, when the motion in a region is a pure
translation.
Optimization methods: For chosen representation and criterion, the motion esti-
mation problem is usually converted to an optimization (minimization or maximiza-
tion) problem, which can be accomplished by using exhaustive search or gradient-
based search. To speech up the search and avoid being trapped in local minima, a
multi-resolution procedure can be used.
Application of Motion Estimation in Video Coding
Motion estimation is a key element in any video coding system. As will be discussed
in Sec. 9.3.1, an eective video coding method is to use block-wise temporal predic-
tion, by which a block in a frame to be coded is predicted from its corresponding
block in a previously coded frame, then the prediction error is coded. To minimize
the bit rate for coding the prediction error, the appropriate criterion for motion
estimation is to minimize the prediction error. The fact that the estimated motion
eld does not necessarily resemble the actual motion eld is not problematic in such
applications. Therefore, block matching algorithms described in Sec. 6.4 (EBMA
and its fast variant including HBMA) oer simple and eective solutions. Instead of
198 Two Dimensional Motion Estimation Chapter 6
using the MV estimated for each block directly for the prediction of that block, one
can use a weighted average of the predicted values based on the MV's estimated for
its neighboring blocks. This is known as overlapping block motion compensation,
which will be discussed in Sec. 9.3.2.
Note that in the above video coding method, the motion vectors also need to
be coded, in addition to the prediction error. Therefore, minimizing the prediction
error alone is not the best criterion to use. Since a smoother motion eld requires
fewer bits to code, imposing smoothness in the estimated motion eld, if done
properly, can help improve the overall coding eÆciency. More advanced motion
estimation algorithms therefore operate by minimizing the total bit rate used for
coding the MVs and the prediction errors. This subject is discussed further in
Sec. 9.3.3.
To overcome the blocking artifacts produced by block-based motion estimation
methods, high-order block-based (DBMA), mesh-based or a combination of block-
based, mesh-based, and/or DBMA can be applied. However, these more compli-
cated schemes usually do not lead to signicant gain in the coding eÆciency.
In more advanced video coding schemes (Chap 10), global motion estimation is
usually applied to the entire frame, prior to block-based motion estimation, to
compensate the eect of camera motion. Moreover, an entire frame is usually
segmented into several regions or objects, and the motion parameters for each region
or object are estimated using the global motion estimation method discussed here.
6.11 Problems
6.1 Describe the pros and cons of dierent motion representation methods (pixel-
based, block-based, mesh-based, global).
6.2 Describe the pros and cons of the exhaustive search and gradient descent
methods. Also compare between rst order and second order gradient descent
methods.
6.3 What are the main advantages of the multi-resolution estimation method,
compared to an approach using a single resolution? Are there any disadvan-
tages?
6.4 In Sec. 6.3.2, we derived the multipoint neighborhood method using the gra-
dient descent method. Can you nd a closed form solution using the optical
ow equation? Under what condition, will your solution be valid?
6.5 In Sec. 6.4.1, we described an exhaustive search algorithm for determine block
MVs in the block-based motion representation. Can you nd a closed form
solution using the optical
ow equation? Under what condition, will your
solution be valid?
6.6 In Eq. (6.2.7), we showed that, if the motion eld is a constant, one can use the
optical
ow equation to set up a least squares problem, and obtain a closed-
Section 6.11. Problems 199
form solution. Suppose that the motion eld is not a constant, but can be
modeled by a polynomial mapping. Can you nd a closed-form solution for
the polynomial coeÆcients?
Hint: any polynomial mapping function can be represented as d(x; a) = [A(x)]a,
where a contains all the polynomial coeÆcients.
6.7 In Sec. 6.4.5, we said that when there are several patches in a range block
in 1 (x) that undergo dierent motions, there will be several peaks in the
PCF. Each peak corresponds to the motion of one patch. The location of the
peak indicates the MV of the patch, whereas the amplitude of the peak is
proportional to the size of the patch. Can you prove this statement at least
qualitatively? You can simplify your derivation by considering the 1-D case
only.
6.8 With EBMA, does the computational requirement depends on the block size?
6.9 In Sec. 6.9.2, we derived the number of operations required by HBMA, if the
search range at every level is R=2L 1. What would be the number if one uses
a search range of 1 pel in every level, except at the rst level, where the
search range is set to R=2L 1? Is this parameter set-up appropriate?
6.10 Consider a CCIR601 format video, with Y-component frame size of 720 480.
Compare the required computation by an EBMA algorithm (integer-pel) with
block size 16 16 and that by a two-level HBMA algorithm. Assume the
maximum motion range is 32: You can compare the computation by the op-
eration number with each operation including one subtraction, one absolute
value computation, and one addition. You can make your own assumption
about the search range at dierent levels with HBMA. For simplicity, ig-
nore the computations required for generating the pyramid and assume only
integer-pel search.
6.11 Repeat the above for a three-level HBMA algorithm.
6.12 Write a C or Matlab code for implementing EBMA with integer-pel accuracy.
Use a block size of 16 16. The program should allow the user to choose
the search range, so that you can compare the results obtained with dierent
search ranges. Note that the proper search range depends on the extent of
the motion in your test images. Apply the program to two adjacent frames
of a video sequence. Your program should produce and plot the estimated
motion eld, the predicted frame, the prediction error image. It should also
calculate the PSNR of the predicted frame compared to the original tracked
frame. With Matlab, you can plot the motion eld using the function quiver.
6.13 Repeat the above exercise for EBMA with half-pel accuracy. Compare the
PSNR of the predicted image obtained using integer-pel and that using half-
pel accuracy estimation. Which method gives you more accurate prediction?
200 Two Dimensional Motion Estimation Chapter 6
6.12 Bibliography
[1] J. K. Aggarwal and N. Nandhahumar. On the computation of motion from
sequences of images - a review. Proceedings of The IEEE, 76:917{935, 1988.
[2] Y. Altunbasak and A. M. Tekalp. Closed-form connectivity-preserving solutions
for motion compensation using 2-d meshes. IEEE Trans. Image Process., 6:1255{
1269, Sept. 1997.
Section 6.12. Bibliography 201
[3] Y. Altunbasak and A. M. Tekalp. Occlusion-adaptive, content-based mesh de-
sign and forward tracking. IEEE Trans. Image Process., 6:1270{1280, Sept. 1997.
[4] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical
ow
techniques. International Journal of Computer Vision, 12:43{77, 1994.
[5] M. Bierling. Displacement estimation by hierarchical block matching. In Proc.
SPIE: Visual Commun. Image Processing, volume SPIE-1001, pages 942{951,
Cambridge, MA, Nov. 1988.
[6] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code.
IEEE Trans. Commun., COM-31:532{540, 1983.
[7] P. Chou and C. Brown. The theory and practice of bayesian image labeling.
International Journal of Computer Vision, 4:185{210, 1990.
[8] R. O. Duda and P. E. Hart. Patterns classication and Scene analysis. John
Wiley & Sons, 1973.
[9] D. J. Fleet and A. D. Jepson. Computation of component image velocity from
local phase information. International Journal of Computer Vision, 5:77{104,
1990.
[10] D.J. Fleet. Disparity from local weighted phase-correlation. In IEEE Interna-
tional Conference on Systems, Man, and Cybernetics: Humans, Information and
Technology, pages 48 {54, 1994.
[11] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the
baysian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., 6:721{
741, Nov. 1984.
[12] B. Girod. Motion-compensating prediction with fractional-pel accuracy. IEEE
Trans. Commun., 41:604{612, 1993.
[13] B. Girod. Motion-compensating prediction with fractional-pel accuracy. IEEE
Transactions on Communications, 41(4):604{612, April 1993. -.
[14] H.-M. Hang, Y.-M. Chou, and S.-Chih. Cheng. Motion estimation for video
coding standards. Journal of VLSI Signal Processing Systems for Signal, Image,
and Video Technology, 17(2/3):113{136, Nov. 1997.
[15] R. M. Haralick and J. S. Lee. The facet approach to optical
ow. In Proc.
Image Understanding Workshop, 1993.
[16] B. K. Horn and B. G. Schunck. Determining optical
ow. Articial intelligence,
17, 1981.
[17] B. K. P. Horn. Robot Vision. MIT Press, Cambridge, MA, 1986.
202 Two Dimensional Motion Estimation Chapter 6