100% found this document useful (2 votes)
756 views56 pages

2D Motion Estimation

This document provides an overview of 2D motion estimation. It begins by defining optical flow as the observed or apparent 2D motion based on intensity changes in images, which may differ from true 2D motion. It then derives the optical flow equation, which relates image gradients to flow vectors and shows that motion estimation is ambiguous without additional constraints. The document discusses how motion estimation involves parameterizing the motion field, formulating an optimization criterion, and searching for optimal parameters. It notes applications include video compression, sampling rate conversion, and filtering.

Uploaded by

nikhila krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
756 views56 pages

2D Motion Estimation

This document provides an overview of 2D motion estimation. It begins by defining optical flow as the observed or apparent 2D motion based on intensity changes in images, which may differ from true 2D motion. It then derives the optical flow equation, which relates image gradients to flow vectors and shows that motion estimation is ambiguous without additional constraints. The document discusses how motion estimation involves parameterizing the motion field, formulating an optimization criterion, and searching for optimal parameters. It notes applications include video compression, sampling rate conversion, and filtering.

Uploaded by

nikhila krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Chapter 6

TWO DIMENSIONAL
MOTION ESTIMATION

Motion estimation is an important part of any video processing system. In this


chapter, we are only concerned with the estimation of 2D motion. In Chapter 7,
we will discuss estimation of actual 3D motion. As will be seen, 2D motion estima-
tion is often a preprocessing step required for 3D structure and motion estimation.
Also, 2D motion estimation itself has a wide range of applications, including video
compression, video sampling rate conversion, video ltering, etc. Depending on
the intended applications for the resulting 2D motion vectors, motion estimation
methods could be very di erent. For example, for computer vision applications,
where the 2D motion vectors are to be used to deduce 3D structure and motion
parameters, a sparse set of 2D motion vectors at critical feature points may be
suÆcient. The motion vectors must be physically correct for them to be useful. On
the other hand, for video compression applications, the estimated motion vectors
are used to produce a motion-compensated prediction of a frame to be coded from
a previously coded reference frame. The ultimate goal is to minimize the total bits
used for coding the motion vectors and the prediction errors. There is a trade-o
that one can play between the accuracy of the estimated motion, and the number
of bits used to specify the motion. Sometimes, even when the estimated motion
is not an accurate representation of the true physical motion, it can still lead to
good temporal prediction and in that regard is considered a good estimate. In this
chapter, we focus on the type of motion estimation algorithms targeted for motion
compensated processing (prediction, ltering, interpolation, etc.). For additional
readings on this topic, the reader is referred to the review papers by Musmann et
al. [28] and by Stiller and Konrad [38]. For a good treatment of motion estimation
methods for computer vision applications, please see the article by Aggarwal and
Nandhahumar [1].
All the motion estimation algorithms are based on temporal changes in image
intensities (more generally color). In fact, the observed 2D motions based on inten-
sity changes may not be the same as the actual 2D motions. To be more precise,
the velocity of observed or apparent 2D motion vectors are referred to as optical

148
Section 6.1. Optical Flow 149
ow. Optical ow can be caused not only by object motions, but also camera
movements or illumination condition changes. In this chapter, we start by de ning
optical ow. We then derive the optical ow equation, which imposes a constraint
between image gradients and ow vectors. This is a fundamental equality that many
motion estimation algorithms are based on. Next we present the general method-
ologies for 2D motion estimation. As will be seen, motion estimation problem is
usually converted to an optimization problem and involves three key components:
parameterization of the motion eld, formulation of the optimization criterion, and
nally searching of the optimal parameters. Finally, we present motion estimation
algorithms developed based on di erent parameterizations of the motion eld and
di erent estimation criteria. Unless speci ed otherwise, the word \motion" refers
to 2D motion, in this chapter.

6.1 Optical Flow

6.1.1 2D Motion vs. Optical Flow


The human eye perceives motion by identifying corresponding points at di erent
times. The correspondence is usually determined by assuming that the color or
brightness of a point does not change after the motion. It is interesting to note
that the observed 2D motion can be di erent from the actual projected 2D motion
under certain circumstances. Figure 6.1 illustrates two special cases. In the rst
example, a sphere with a uniform at surface is rotating under a constant ambient
light. Because every point on the sphere re ects the same color, the eye cannot
observe any change in the color pattern of the imaged sphere and thus considers
the sphere as being stationary. In the second example, the sphere is stationary,
but is illuminated by a point light source that is rotating around the sphere. The
motion of the light source causes the movement of the re ecting light spot on the
sphere, which in turn can make the eye believe the sphere is rotating. The observed
or apparent 2D motion is referred to as optical ow in computer vision literature.
The above examples reveal that the optical ow may not be the same as the true
2D motion. When only image color information is available, the best one can hope
to estimate accurately is the optical ow. However, in the remaining part of this
chapter, we will use the term 2D motion or simply motion to describe optical ow.
The readers should bear in mind that sometimes it may be di erent from the true
2D motion.

6.1.2 Optical Flow Equation and Ambiguity in Motion Estimation


Consider a video sequence whose luminance variation is represented by (x; y; t):1
Suppose an imaged point (x; y) at time t is moved to (x + dx ; y + dy ) at time t + dt .
Under the constant intensity assumption introduced in Sec. 5.2.3 (Eq. (5.2.11)), the
1 In this book, we only consider motion estimation based on the luminance intensity information,

although the same methodology can be applied to the full color information.
150 Two Dimensional Motion Estimation Chapter 6

Figure 6.1. The optical ow is not always the same as the true motion eld. In (a), a
sphere is rotating under a constant ambient illumination, but the observed image does not
change. In (b), a point light source is rotating around a stationary sphere, causing the
highlight point on the sphere to rotate. Adapted from [17, Fig.12-2].

images of the same object point at di erent times have the same luminance value.
Therefore,
(x + dx ; y + dy ; t + dt ) = (x; y; t): (6.1.1)
Using Taylor's expansion, when dx ; dy ; dt are small, we have
(x + dx ; y + dy ; t + dt ) = (x; y; t) + @@x dx + @@y dy + @@t dt : (6.1.2)
Combining Eqs. (6.1.1) and (6.1.2) yields
@ @ @
d + d + d = 0: (6.1.3)
@x x @y y @t t
The above equation is written in terms of the motion vector (dx ; dy ). Dividing both
sides by dt yields
@ @ @
v + v +
@x x @y y @t
= 0 or r T v + @@t = 0: (6.1.4)
h iT
where (vx ; vy ) represents the velocity vector, r = @@x ; @@y is the spatial gradient
vector of (x; y; t). In arriving at the above equation, we have assumed that dt is
small, so that vx = dx =dt ; vy = dy =dt : The above equation is commonly known as
the optical ow equation.2 The conditions for this relation to hold is the same as
that for the constant intensity assumption, and have been discussed previously in
Sec. 5.2.3.
2
Another way to derive the optical ow equation is by representing the constant intensity
assumption as d (dt
x;y;t) = 0. Expanding d (x;y;t) in terms of the partials will lead to the same
dt
equation.
Section 6.1. Optical Flow 151
y

Ñy

v vtet

vnen
Tangent line

Decomposition of motion v into normal (vn en ) and tangent vt et components.


r
Figure 6.2.
Given and @@t ; any MV on the tangent line satis es the optical ow equation.

As shown in Fig. 6.2, the ow vector v at any point x can be decomposed into
two orthogonal components as
v = vnen + vtet; (6.1.5)
where en is the direction vector of the image gradient r , to be called the normal
direction, and et is orthogonal to en , to be called the tangent direction. The optical
ow equation in Eq. (6.1.4) can be written as
vn kr k + @@t = 0; (6.1.6)
where kr k is the magnitude of the gradient vector. Three consequences from
Eq. (6.1.4) or (6.1.6) are:
1. At any pixel x, one cannot determine the motion vector v based on r and
@ alone. There is only one equation for two unknowns (v and v , or v and
@t x y n
vt ). In fact, the underdetermined component is vt : To solve both unknowns,
one needs to impose additional constraints. The most common constraint is
that the ow vectors should vary smoothly spatially, so that one can make use
of the intensity variation over a small neighborhood surrounding x to estimate
the motion at x.
2. Given r and @@t ; the projection of the motion vector along the normal
direction is xed, with vn = @@t /kr k ; whereas the projection onto the
tangent direction, vt , is undetermined. Any value of vt would satisfy the
optical ow equation. In Fig. 6.2, this means that any point on the tangent line
will satisfy the optical ow equation. This ambiguity in estimating the motion
vector is known as the aperture problem. The word \aperture" here refers to
152 Two Dimensional Motion Estimation Chapter 6

True Motion
Aperture 2

x2
Aperture 1

x1

Figure 6.3. The aperture problem in motion estimation: To estimate the motion at x1
using aperture 1, it is impossible to determine whether the motion is upward or perpen-
dicular to the edge, because there is only one spatial gradient direction in this aperture.
On the other hand, the motion at x2 can be determined accurately, because the image has
gradient in two di erent directions in aperture 2. Adapted from [39, Fig. 5.7].

the small window over which to apply the constant intensity assumption. The
motion can be estimated uniquely only if the aperture contains at least two
di erent gradient directions, as illustrated in Fig. 6.3.
3. In regions with constant brightness so that kr k = 0; the ow vector is
indeterminate. This is because there is no perceived brightness changes when
the underlying surface has a at pattern. The estimation of motion is reliable
only in regions with brightness variation, i.e., regions with edges or non- at
textures.
The above observations are consistent with the relation between spatial and tem-
poral frequencies discussed in Sec. 2.3.2. There, we have shown that the temporal
frequency of a moving object is zero if the spatial frequency is zero, or if the motion
direction is orthogonal to the spatial frequency. When the temporal frequency is
zero, no changes can be observed in image patterns, and consequently, motion is
indeterminate.
As will be seen in the following sections, the optical ow equation or, equiva-
lently, the constant intensity assumption plays a key role in all motion estimation
algorithms.

6.2 General Methodologies


In this chapter, we consider the estimation of motion between two given frames,
(x; y; t1 ) and (x; y; t2 ). Recall from Sec. 5.5.1, the MV at x between time t1 and
t2 is de ned as the displacement of this point from t1 to t2 . We will call the frame
at time t1 the anchor frame, and the frame at t2 the tracked frame. Depending on
the intended application, the anchor frame can be either before or after the tracked
frame in time. As illustrated in Fig. 6.4, the problem is referred to as forward
motion estimation, when t1 < t2 , and as backward motion estimation, when t1 > t2 .
For notation convenience, from now on, we use 1 (x) and 2 (x) to denote the
Section 6.2. General Methodologies 153

Time t+Dt
Backward motion estimation
Time t

Time t-Dt
d(x;t,t+Dt)
x

Tracked Frame
d(x;t,t-Dt) x

Anchor Frame Forward motion estimation

Tracked Frame

Figure 6.4. Forward and backward motion estimation. Adapted from [39, Fig. 5.5].

anchor and tracked frames, respectively. In general, we can represent the motion
eld as d(x; a), where a = [a1 ; a2 ; : : : ; aL]T is a vector containing all the motion
parameters. Similarly, the mapping function can be denoted by w(x; a) = x +
d(x; a): The motion estimation problem is to estimate the motion parameter vector
a. Methods that have been developed can be categorized into two groups: feature-
based and intensity-based. In the feature-based approach, correspondences between
pairs of selected feature points in two video frames are rst established. The motion
model parameters are then obtained by a least squares tting of the established
correspondences into the chosen motion model. This approach is only applicable
to parametric motion models and can be quite e ective in, say, determining global
motions. The intensity-based approach applies the constant intensity assumption or
the optical ow equation at every pixel and requires the estimated motion to satisfy
this constraint as closely as possible. This approach is more appropriate when the
underlying motion cannot be characterized by a simple model, and that an estimate
of a pixel-wise or block-wise motion eld is desired.
In this chapter, we only consider intensity-based approaches, which are more
widely used in applications requiring motion compensated prediction and ltering.
In general, the intensity-based motion estimation problem can be converted into
an optimization problem, and three key questions need to be answered: i) how to
parameterize the underlying motion eld? ii) what criterion to use to estimate the
parameters? and iii) how to search for the optimal parameters? In this section, we
rst describe several ways to represent a motion eld. Then we introduce di erent
types of estimation criteria. Finally, we present search strategies commonly used
for motion estimation. Speci c motion estimation schemes using di erent motion
representations and estimation criteria will be introduced in subsequent sections.
154 Two Dimensional Motion Estimation Chapter 6

6.2.1 Motion Representation


A key problem in motion estimation is how to parameterize the motion eld. As
shown in Sec. 5.5, the 2D motion eld resulting from a camera or object motion
can usually be described by a few parameters. However, usually, there are multiple
objects in the imaged scene that move di erently. Therefore, a global parameterized
model is usually not adequate. The most direct and unconstrained approach is to
specify the motion vector at every pixel. This is the so-called pixel-based representa-
tion. Such a representation is universally applicable, but it requires the estimation
of a large number of unknowns (twice the number of pixels!) and the solution can
often be physically incorrect unless a proper physical constraint is imposed during
the estimation step. On the other hand, if only the camera is moving or the imaged
scene contains a single moving object with a planar surface, one could use a global
motion representation to characterize the entire motion eld. In general, for scenes
containing multiple moving objects, it is more appropriate to divide an image frame
into multiple regions so that the motion within each region can be characterized well
by a parameterized model. This is known as region-based motion representation,3
which consists of a region segmentation map and several sets of motion parameters,
one for each region. The diÆculty with such an approach is that one does not
know in advance which pixels have similar motions. Therefore, segmentation and
estimation have to be accomplished iteratively, which requires intensive amount of
computations that may not be feasible in practice.
One way to reduce the complexity associated with region-based motion repre-
sentation is by using a xed partition of the image domain into many small blocks.
As long as each block is small enough, the motion variation within each block can
be characterized well by a simple model and the motion parameters for each block
can be estimated independently. This brings us to the popular block-based repre-
sentation. The simplest version models the motion in each block by a constant
translation, so that the estimation problem becomes that of nding one MV for
each block. This method provides a good compromise between accuracy and com-
plexity, and has found great success in practical video coding systems. One main
problem with the block-based approach is that it does not impose any constraint
on the motion transition between adjacent blocks. The resulting motion is often
discontinuous across block boundaries, even when the true motion eld is changing
smoothly from block to block. One approach to overcome this problem is by using
a mesh-based representation, by which the underlying image frame is partitioned
into non-overlapping polygonal elements. The motion eld over the entire frame
is described by the MVs at the nodes (corners of polygonal elements) only, and
the MVs at the interior points of an element are interpolated from the nodal MVs.
This representation induces a motion eld that is continuous everywhere. It is more
appropriate than the block-based representation over interior regions of an object,
3 This is sometimes called object-based motion representation [27]. Here we use the word \region-

based" to acknowledge the fact that we are only considering 2D motions, and that a region with
a coherent 2D motion may not always correspond to a physical object.
Section 6.2. General Methodologies 155

Figure 6.5. Di erent motion representations: (a) global, (b) pixel-based, (c) block-based,
and (d) region-based. From [38, Fig. 3].

which usually undergoes a continuous motion, but it fails to capture motion discon-
tinuities at object boundaries. Adaptive schemes that allow discontinuities when
necessary is needed for more accurate motion estimation. Figure 6.5 illustrates the
e ect of several motion representations described above for a head-and-shoulder
scene. In the next few sections, we will introduce motion estimation methods using
di erent motion representations.

6.2.2 Motion Estimation Criteria


For a chosen motion model, the problem is how to estimate the model parameters. In
this section, we describe several di erent criteria for estimating motion parameters.

Criterion based on Displaced Frame Di erence


The most popular criterion for motion estimation is to minimize the sum of the
errors between the luminance values of every pair of corresponding points between
the anchor frame 1 and the tracked frame 2 . Recall that x in 1 is moved to
w(x; a) in 2. Therefore, the objective function can be written as,
X
EDFD (a) = j 2 (w(x; a)) x)jp ;
1( (6.2.1)
x2
156 Two Dimensional Motion Estimation Chapter 6

where  is the domain of all pixels in 1 , and p is a positive number. When p = 1,


the above error is called mean absolute di erence (MAD), and when p = 2, the
mean squared error (MSE). The error image, e(x; a) = 2 (w(x; a)) 1 (x), is
usually called displaced frame di erence (DFD) image, and the above measure the
DFD error.
The necessary condition for minimizing EDFD is that its gradient vanishes. In
the case of p = 2, this gradient is
@EDFD
(x)) @ d(x) r (w(x; a))
X
@a
= 2 ( (w(x; a)) 2 1 (6.2.2)
@a 2
x2
where
" #T
@d @dx @dx  @dx
@a
= @a1 @a2 @aL :
@dy
@a1
@dy
@a2  @dy
@aL

Criterion based on Optical Flow Equation


Instead of minimizing the DFD error, another approach is to solve the system
of equations established based on the optical ow constraint given in Eq. (6.1.3).
Let 1 (x; y) = (x; y; t); 2 (x; y) = (x; y; t + dt ): If dt is small, we can assume
@ dt = (x)
@t 2 1 (x): Then, Eq. (6.1.3) can be written as
@ 1 @
@x
dx + 1 dy + ( 2
@y 1) = 0 or r 1T d + ( 2 1 ) = 0: (6.2.3)
This discrete version of the optical ow equation is more often used for motion
estimation in digital videos. Solving the above equations for all x can be turned
into a minimization problem with the following objective function:
X p
E ow (a) = r 1 (x)T d(x; a) + 2 (x) 1 (x) : (6.2.4)
x2
The gradient of E ow is, when p = 2;
@E ow
x) @ d@(ax) r 1(x):
X
@a
= 2 r 1 (x)T d(x; a) + 2(x) 1( (6.2.5)
x2
If the motion eld is constant over a small region 0, i.e., d(x; a) = d0 ; x 2 0 ,
then Eq. (6.2.5) becomes
@E ow X
@ d0
= r 1(x)T d0 + 2(x) 1( x) r 1(x): (6.2.6)
x2 0

Setting the above gradient to zero yields the least squares solution for d0 :
! 1 !
X X
d0 = r 1(x)r 1(x )T ( 1 (x) 2( x)) r 1(x) : (6.2.7)
x20 x20
Section 6.2. General Methodologies 157
When the motion is not a constant, but can be related to the model parameters
linearly, one can still derive a similar least-squares solution. See Prob. 6.6 in the
Problem section.
An advantage of the above method is that the minimizing function is a quadratic
function of the MVs, when p = 2. If the motion parameters are linearly related
to the MVs, then the function has a unique minimum and can be solved easily.
This is not true with the DFD error given in Eq. (6.2.1). However, the optical
ow equation is valid only when the motion is small, or when an initial motion
estimate d~ (x) that is close to the true motion can be found and one can pre-update
2 (x) to 2 (x + d
~ (x)) When this is not the case, it is better to use the DFD error
criterion, and nd the minimal solution using the gradient descent or exhaustive
search method.
Regularization
Minimizing the DFD error or solving the optical ow equation does not always give
physically meaningful motion estimate. This is partially because the constant inten-
sity assumption is not always correct. The imaged intensity of the same object point
may vary after an object motion because of the various re ectance and shadowing
e ects. Secondly, in a region with at texture, many di erent motion estimates can
satisfy the constant intensity assumption or the optical ow equation. Finally, if the
motion parameters are the MVs at every pixel, the optical ow equation does not
constrain the motion vector completely. These factors make the problem of motion
estimation a ill-posed problem.
To obtain a physically meaningful solution, one needs to impose additional con-
straints to regularize the problem. One common regularization approach is to add
a penalty term to the error function in (6.2.1) or (6.2.4), which should enforce the
resulting motion estimate to bear the characteristics of common motion elds. One
well-known property of a typical motion eld is that it usually varies smoothly from
pixel to pixel, except at object boundaries. To enforce the smoothness, one can use
a penalty term that measures the di erences between the MVs of adjacent pixels,
i.e., X X
Es (a) = kd(x; a) d(y; a)k2 ; (6.2.8)
x2 y2Nx

where k  k represents the 2-norm, Nx represents the set of pixels adjacent to x.


Either the 4-connectivity or 8-connectivity neighborhood can be used.
The overall minimization criterion can be written as
E = EDFD + ws Es : (6.2.9)
The weighting coeÆcient ws should be chosen based on the importance of motion
smoothness relative to the prediction error. To avoid over-blurring, one should re-
duce the weighting at object boundaries. This, however, requires accurate detection
of object boundaries, which is not a trivial task.
158 Two Dimensional Motion Estimation Chapter 6

Bayesian Criterion
The Bayesian estimator is based on a probablistic formulation of the motion esti-
mation problem, pioneered by Konrad and Dubois [22, 38]. Under this formulation,
given an anchor frame 1 , the image function at the tracked frame 2 is considered
a realization of a random eld , and the motion eld d is a realization of another
random eld D . The a posterior probability distribution of the motion eld D
given a realization of and 1 can be written, using the Bayes rule
P ( = 2 jD = d; 1 )P (D = d; 1 )
P (D = dj = 2 ; 1 ) = : (6.2.10)
P ( = 2 ; 1 )
In the above notation, the semicolon indicates that subsequent variables are de-
terministic parameters. An estimator based on the Bayesian criterion attempts to
maximize the a posterior probability. But for given 1 and 2 , maximizing the
above probability is equivalent to maximizing the numerator only. Therefore, the
maximum a posterior (MAP) estimate of d is
dMAP = argmaxd fP ( = 2jD = d; 1)P (D = d; 1)g : (6.2.11)
The rst probability denotes the likelihood of an image frame given the motion
eld and the anchor frame. Let E represent the random eld corresponding to the
DFD image e(x) = 2 (x + d) 1 (x) for given d and 1 , then
P ( = 2 jD = d; 1 ) = P (E = e);
and the above equation becomes
dMAP = argmaxd fP (E = e)P (D = d; 1 )g
= argmind f log P (E = e) log P (D = d; 1 )g : (6.2.12)
From the source coding theory (Sec. 8.3.1), the minimum coding length for a
source X is its entropy, log P (X = x). We see that the MAP estimate is
equivalent to minimizing the sum of the coding length for the DFD image e and
that for the motion eld d. As will be shown in Sec. 9.3.1, this is precisely what
a video coder using motion-compensated prediction needs to code. Therefore, the
MAP estimate for d is equivalent to a minimum description length (MDL) estimate
[34]. Because the purpose of motion estimation in video coding is to minimize the
bit rate, the MAP criterion is a better choice than minimizing the prediction error.
The most common model for the DFD image is a zero-mean independently
identically distributed (i.i.d.) Gaussian eld, with distribution
x2 e (x) ;
P 2
P (E = e) = (2 )2 j j = 2 exp (6.2.13)
2 2
where jj denotes the size of  (i.e., the number of pixels in ). With this model,
minimizing the rst term in Eq. (6.2.12) is equivalent to minimizing the previously
de ned DFD error (when p = 2).
Section 6.2. General Methodologies 159
For the motion eld D , a common model is a Gibbs/Markov random eld [11].
Such a model is de ned by a neighborhood structure called clique. Let C represent
the set of cliques, the model assumes
1
P (D = d) = exp(
X
V (d)); (6.2.14)
c
Z c2C

where Z is a normalization factor. The function Vc (d) is called the potential func-
tion, which is usually de ned to measure the di erence between pixels in the same
clique: X
Vc (d) = jd(x) d(y)j2 : (6.2.15)
(x;y)2c
Under this model, minimizing the second term in Eq. (6.2.12) is equivalent to min-
imizing the smoothing function in Eq. (6.2.8). Therefore, the MAP estimate is
equivalent to the DFD-based estimator with an appropriate smoothness constraint.
6.2.3 Minimization Methods
The error functions presented in Sec. 6.2.2 can be minimized using various opti-
mization methods. Here we only consider exhaustive search and gradient-based
search methods. Usually, for the exhaustive search, the MAD is used for reasons of
computational simplicity, whereas for the gradient-based search, the MSE is used
for its mathematical tractability.
Obviously, the advantage of the exhaustive search method is that it guarantees
reaching the global minimum. However, such search is feasible only if the number of
unknown parameters is small, and each parameter takes only a nite set of discrete
values. To reduce the search time, various fast algorithms can be developed, which
achieve sub-optimal solutions.
The most common gradient descent methods include the steepest gradient de-
scent and the Newton-Ralphson method. A brief review of these methods is provided
in Appendix B. A gradient-based method can handle unknown parameters in a high
dimensional continuous space. However, it can only guarantee the convergence to
a local minimum. The error functions introduced in the previous section in general
are not convex and can have many local minima that are far from the global mini-
mum. Therefore, it is important to obtain a good initial solution through the use of
a prior knowledge, or by adding a penalty term to make the error function convex.
With the gradient-based method, one must calculate the spatio-temporal gradi-
ents of the underlying signal. Appendix A reviews methods for computing rst and
second order gradients from digital sampled images. Note that the methods used
for calculating the gradient functions can have profound impact on the accuracy
and robustness of the associated motion estimation methods, as have been shown
by Barron et al. [4]. Using a Gaussian pre- lter followed by a central di erence
generally leads to signi cantly better results than the simple two point di erence
approximation.
160 Two Dimensional Motion Estimation Chapter 6

One important search strategy is to use a multi-resolution representation of the


motion eld and conduct the search in a hierarchical manner. The basic idea is to
rst search the motion parameters in a coarse resolution, propagate this solution
into a ner resolution, and then re ne the solution in the ner resolution. It can
combat both the slowness of exhaustive search methods and the non-optimality
of gradient-based methods. We will present the multi-resolution method in more
detail in Sec. 6.9.

6.3 Pixel-Based Motion Estimation


In pixel-based motion estimation, one tries to estimate a motion vector for ev-
ery pixel. Obviously, this problem is ill-de ned. If one uses the constant inten-
sity assumption, for every pixel in the anchor frame, there are many pixels in the
tracked frame that have exactly the same image intensity. If one uses the optical
ow equation instead, the problem is again indeterminate, because there is only
one equation for two unknowns. To circumvent this problem, there are in general
four approaches. First, one can use the regularization technique to enforce some
smoothness constraints on the motion eld, so that the motion vector of a new pixel
is constrained by those found for surrounding pixels. Second, one can assume the
motion vectors in a neighborhood surrounding each pixel are the same, and apply
the constant intensity assumption or the optical ow equation over the entire neigh-
borhood. The third approach is to make use of additional invariance constraints.
In addition to intensity invariance, which leads to the optical ow equation, one
can assume that the intensity gradient is invariant under motion, as proposed in
[29, 26, 15]. Finally, one can also make use of the relation between the phase func-
tions of the frame before and after motion [9]. In [4], Barron, et al. evaluated
various methods for optical ow computation, by testing these algorithms on both
synthetic and real world imageries. In this section, we will describe the rst two
approaches only. We will also introduce the pel-recursive type of algorithms which
are developed for video compression applications.

6.3.1 Regularization Using Motion Smoothness Constraint


Horn and Schunck [16] proposed to estimate the motion vectors by minimizing the
following objective function, which is a combination of the ow-based criterion and
a motion smoothness criterion:
 
X @ @ @ 2
E (v(x)) = krvx k2 + krvy k2 : (6.3.1)

vx + vy + + ws
x2
@x @y @t

In their original algorithm, the spatial gradient of vx and vy are approximated by


rvx = [vx(x; y) vx(x 1;Ty); vx(x; y) vx(x; y 1)]T ; rvy = [vy (x; y) vy (x
1; y); vy (x; y) vy (x; y 1)] : The minimization of the above error function is ac-
complished by a gradient-based method known as Gauss-Siedel method.
Section 6.3. Pixel-Based Motion Estimation 161
Nagle and Enkelmann conducted a comprehensive evaluation of the e ect of
smoothness constraints on motion estimation [30]. In order to avoid over-smoothing
of the motion eld, Nagel suggested an oriented-smoothness constraint in which
smoothness is imposed along the object boundaries, but not across the boundaries
[29]. This has resulted in signi cant improvement in motion estimation accuracy
[4].
6.3.2 Using a Multipoint Neighborhood
In this approach, when estimating the motion vector at a pixel xn , we assume that
the motion vectors of all the pixels in a neighborhood B(xn ) surrounding it are the
same, being dn . To determine dn , one can either minimize the prediction error over
B(xn ), or solve the optical ow equation using a least squares method. Here we
present the rst approach. To estimate the motion vector dn for xn , we minimize
the DFD error over B(xn ):
En (dn ) =
1 X w(x) ( (x + d )
1 (x)) ;
2 (6.3.2)
2 x2B(x ) 2 n
n

where w(x) are the weights assigned to pixel x. Usually, the weight decreases as
the distance from x to xn increases.
The gradient with respect to dn is

gn = @E @ 2
X
n
= w(x)e(x; dn ) ; (6.3.3)
@ dn x2B(xn )
@ x x+d
n

where e(x; dn ) = 2 (x + dn ) 1 (x) is the DFD at x with the estimate dn . Let dln
represent the estimate at the l-th iteration, the rst order gradient descent method
would yield the following update algorithm
dln+1 = dln gn(dln ): (6.3.4)
From Eq. (6.3.3), the update at each iteration depends on the sum of the image
gradients at various pixels scaled by the weighted DFD values at those pixels.
One can also derive an iterative algorithm using the Newton-Ralphson method.
From Eq. (6.3.3), the Hessian matrix is

2 @ @ T @2
= @@ dE2n =
X
Hn w(x) 2 2
@ x @ x x+d
+ w(x)e(x; dn ) 22
@ x x+d
n x2B(x ) n n
n

X @ @ T
 w(x) 2 2 :
x2B(x )
@ x @ x x+d
n n

The Newton-Ralphson update algorithm is then (See Appendix B):


dln+1 = dln H(dln) 1 gn(dln): (6.3.5)
162 Two Dimensional Motion Estimation Chapter 6

This algorithm converges faster than the rst order gradient descent method, but
it requires more computation in each iteration.
Instead of using gradient-based update algorithms, one can also use exhaustive
search to nd the dn that yields the minimal error within a de ned search range.
This will lead to the exhaustive block matching algorithm (EBMA) to be presented
in Sec. 6.4.1. The di erence from the EBMA is that the neighborhood used here is
a sliding window and a MV is determined for each pixel by minimizing the error in
its neighborhood. The neighborhood in general does not have to be a rectangular
block.

6.3.3 Pel-Recursive Methods


In a video coder using motion compensated prediction, one needs to specify both
the MVs and the DFD image. With a pixel-based motion representation, one would
need to specify a MV for each pixel, which is very costly. In pel-recursive motion
estimation methods, which are developed for video coding applications, the MVs
are obtained recursively. Speci cally, the MV at a current pixel is updated from
those of its neighboring pixels that are coded before. The decoder can derive the
same MV following the same update rule, so that the MVs do not need to be coded.
A variety of such algorithms have been developed, where the update rules all follow
some types of gradient-descent methods [31].
Although pel-recursive methods are very simple, their motion estimation ac-
curacy is quite poor. As a result, the prediction error is still large and requires
a signi cant number of bits to code. These algorithms have been used in earlier
generations of video codecs because of their simplicity. Today's codecs use more
sophisticated motion estimation algorithms, which can provide a better trade-o
between the bits used for specifying MVs and the DFD image. The most popular
one is the block matching algorithm to be discussed in the next section.

6.4 Block Matching Algorithm


As already seen, a problem with pixel-based motion estimation is that one must
impose some smoothness constraints to regularize the problem. One way of imposing
smoothness constraints on the estimated motion eld is to divide the image domain
into non-overlapping small regions, called blocks, and assume that the motion within
each block can be characterized by a simple parametric model, e.g., constant, aÆne,
or bilinear. If the block is suÆciently small, then this model can be quite accurate.
In this section, we describe motion estimation algorithms developed using the block-
based motion representation. We will use Bm to represent the m-th image block,
M the number of blocks, and M = f1; 2; : : : ; M g. The partition into blocks should
satisfy
[ \
Bm = ; and Bm Bn = ;; m 6= n:
m2M
Section 6.4. Block Matching Algorithm 163
Theoretically, a block can have any polygonal shape. But in practice, the square
shape is used almost exclusively. The triangular shape has also been used, which is
more appropriate when the motion in each block is described by an aÆne model.
In the simplest case, the motion in each block is assumed to be constant, i.e.,
the entire block undergoes a translation. This is called the block-wise translational
model. In this section, we consider this simple case only, where the motion esti-
mation problem is to nd a single MV for each block. This type of algorithm is
collectively referred as block matching algorithm (BMA). In the next section, we
will consider the more general case where the motion in each block is characterized
by a more complex model.
6.4.1 The Exhaustive Search Block Matching Algorithm (EBMA)
Given an image block in the anchor frame Bm, the motion estimation problem at
hand is to determine a matching block Bm0 in the tracked frame so that the error
between these two blocks is minimized. The displacement vector dm between the
spatial positions of these two blocks (the center or a selected corner) is the MV of
this block. Under the block-wise translational model, w(x; a) = x + dm ; x 2 Bm ,
the error in Eq. (6.2.1) can be written as:
X X
E (dm ; 8m 2 M) = j 2 (x + dm ) 1( x)jp : (6.4.1)
m2M x2Bm

Because the estimated MV for a block only a ects the prediction error in this block,
one can estimate the MV for each block individually, by minimizing the prediction
error accumulated over this block only, which is:
X
Em (dm ) = j 2 (x + dm ) 1(x)jp ; (6.4.2)
x2Bm

One way to determine the dm that minimizes the above error is by using exhaus-
tive search and this method is called exhaustive block matching algorithm (EBMA).
As illustrated in Fig. 6.6, the EBMA determines the optimal dm for a given block
Bm in the anchor frame by comparing it with all candidate blocks Bm0 in the tracked
frame within a prede ned search region and nding the one with the minimum error.
The displacement between the two blocks is the estimated motion vector.
To reduce the computational load, the MAD error (p = 1) is often used. The
search region is usually symmetric with respect to the current block, up to Rx pixels
to the left and right, and up to Ry pixels above and below, as illustrated in Fig. 6.6.
If it is known that the dynamic range of the motion is the same in horizontal and
vertical directions, then Rx = Ry = R. The estimation accuracy is determined by
the search stepsize, which is the distance between two nearby candidate blocks in
the horizontal or vertical direction. Normally, the same stepsize is used along the
two directions. In the simplest case, the stepsize is one pixel, which is known as
integer-pel accuracy search.
164 Two Dimensional Motion Estimation Chapter 6

Rx

Ry

dm
B’m
Best match

Search region

Bm
Current block

Figure 6.6. The search procedure of the exhaustive block matching algorithm.

Let the block size be N  N pixels, and the search range be R pixels in both
horizontal and vertical directions (cf. Fig. 6.6). With a stepsize of one pixel, the
total number of candidate matching blocks for each block in the anchor frame is
(2R +1)2: Let an operation be de ned as consisting of one subtraction, one absolute
value computation, and one addition. The number of operations for calculating the
MAD for each candidate estimate is N 2 : The number of operations for estimating
the MV for one block is then (2R + 1)2 N 2 . For an image of size M  M , there are
(M=N )2 blocks (assuming M is a multiple of N ). The total number of operations
for a complete frame is then M 2(2R + 1)2 : It is interesting to note that the overall
computational load is independent of the block size N .
As an example, consider M = 512; N = 16; R = 16; the total operation count
per frame is 2:85  108: For a video sequence with a frame rate of 30 fps, the opera-
tion required per second is 8:55  109, an astronomical number! This example shows
that EBMA requires intense computation, which poses a challenge to applications
requiring software-only implementation. Because of this problem, various fast al-
gorithms have been developed, which trade o the estimation accuracy for reduced
computations. Some fast algorithms are presented in Sec. 6.4.3. One advantage of
EBMA is that it can be implemented in hardware using simple and modular design,
and speed-up can be achieved by using multiple modules in parallel. There have
been many research e orts on eÆcient realization of the EBMA using VLSI/ASIC
chips, which sometimes involve slight modi cation of the algorithm to trade o the
accuracy for reduced computation or memory space or memory access. For a good
review of VLSI architecture for implementing EBMA and other fast algorithms for
block matching, see [21, 32, 14].
Section 6.4. Block Matching Algorithm 165

Bm:
current
block dm

B’m:
matching
block

Figure 6.7. Half-pel accuracy block matching. Filled circles are samples existing in
the original tracked frame, open circles are samples to be interpolated for calculating
the matching error, for a candidate MV dm = ( 1; 1:5). Instead of calculating these
samples on-demand for each candidate MV, a better approach is to pre-interpolate the
entire tracked frame.

6.4.2 Fractional Accuracy Search


As already hinted, the stepsize for searching the corresponding block in the BMA
algorithm does not have to be an integer. For more accurate motion representation,
fractional-pel accuracy is needed. A problem with using fractional stepsizes is that
there may not be corresponding sample points in the tracked frame for certain
sample points in the anchor frame. These samples need to be interpolated from the
available sample points. Bilinear interpolation is commonly used for this purpose.
In general, to realize a stepsize of a 1=K pixel, the tracked frame needs to be
interpolated by a factor of K rst. An example of K = 2 is shown in Fig. 6.7,
which is known as half-pel accuracy search. It has been shown that half-pel accuracy
search can provide a signi cant improvement in the estimation accuracy over the
integer-pel accuracy search, especially for low-resolution videos.
A question that naturally arises is what is the appropriate search stepsize for
motion estimation. Obviously, it depends on the intended application of estimated
motion vectors. For video coding, where the estimated motion is used to predict
a current frame (the anchor frame) from a previously coded frame (the tracked
frame), it is the prediction error (i.e., the DFD error) that should be minimized. A
statistical analysis of the relation between the prediction error and search precision
has been considered by Girod [12] and will be presented in Sec. 9.3.5.
Obviously, with a fractional-pel stepsize, the complexity of the EBMA is fur-
ther increased. For example, with half-pel search, the number of search points is
quadrupled over that using integer-pel accuracy. The overall complexity is more
than quadrupled, considering the extra computation required for interpolating the
tracked frame.
166 Two Dimensional Motion Estimation Chapter 6

Example 6.1: Figure 6.8(c) shows the estimated motion eld by a half-pel EBMA
algorithm for two given frames in Figure 6.8(a-b). Figure 6.8(d) shows the predicted
anchor frame based on the estimated motion. This is obtained by replacing each block
in the anchor frame by its best matching block in the tracked frame. The image size
is 352  288 and the block size is 16  16: We can see that a majority of blocks
are predicted accurately, however, there are blocks that are not well-predicted. Some
of these blocks undergo non-translational motions, such as those blocks covering
the eyes and the mouth. Other blocks contain both the foreground object and the
background and only the foreground object is moving. They are also blocks where
the image intensity change is due to the change in the re ection patterns when the
head turns. The motion variation over these blocks cannot be approximated well by
a constant MV, and the EBMA algorithm simply identi es a block in the tracked
frame that has the smallest absolute error from a given block in the anchor frame.
Furthermore, the predicted image is discontinuous along certain block boundaries,
which is the notorious blocking artifact common with the EBMA algorithm. These
artifacts are due to the inherent limitation of the block-wise translational motion
model, and the fact that the MV for a block is determined independent of the MVs
of its adjacent blocks.
The accuracy between a predicted image and the original one is usually measured
by PSNR de ned previously in Eq. (1.5.6). The PSNR of the predicted image by
the half-pel EBMA is 29.72 dB. With the integer-pel EBMA, the resulting predicted
image is visually very similar, although the PSNR is slightly lower.

6.4.3 Fast Algorithms


As shown above, EBMA requires a very large amount of computation. To speed up
the search, various fast algorithms for block matching have been developed. The
key to reduce the computation is by reducing the number of search candidates. As
described before, for a search range of R and a stepsize of 1 pixel, the total number
of candidates is (2R + 1)2 with EBMA. Various fast algorithms di er in the way to
skip those candidates that are unlikely to have small errors.
2D-Log Search Method One popular fast search algorithm is the 2D-log search
[19], illustrated in Fig. 6.9. It starts from the position corresponding to zero-
displacement. Each step tests ve search points in a diamond arrangement. In
the next step, the diamond search is repeated with the center moved to the best
matching point resulting from the previous step. The search stepsize (the radius of
the diamond) is reduced if the best matching point is the center point, or on the
border of the maximum search range. Otherwise, the stepsize remains the same.
The nal step is reached when the stepsize is reduced to 1 pel, and nine search
points are examined at this last step. The initial stepsize is usually set to half
of the maximum search range. With this method, one cannot pre-determine the
number of steps and the total number of search points, as it depends on the actual
MV. But the best case (requiring the minimum number of search points) and the
Section 6.4. Block Matching Algorithm 167

(a) (b)

(c) (d)

(e) (f)
Figure 6.8. Example motion estimation results: (a) the tracked frame; (b) the anchor
frame; (c-d) motion eld and predicted image for the anchor frame (PSNR=29.86 dB)
obtained by half-pel accuracy EBMA ; (e-f) motion eld (represented by the deformed
mesh overlaid on the tracked frame) and predicted image (PSNR=29.72 dB) obtained by
the mesh-based motion estimation scheme in [43].
168 Two Dimensional Motion Estimation Chapter 6

Figure 6.9. The 2D-logrithmic search method. The search points in a tracked frame are
shown with respect to a block center at (i,j) in the anchor frame. In this example, the
best matching MVs in steps 1 to 5 are (0,2), (0,4), (2,4), (2,6), and (2,6). The nal MV is
(2,6). From [28, Fig. 11]

Figure 6.10. The three-step search method. In this example, the best matching MVs in
steps 1 to 3 are (3,3), (3,5), and (2,6). The nal MV is (2,6). From [28, Fig. 12]

worst case (requiring the maximum number) can be analyzed.


Three-Step Search Method Another popular fast algorithm is the three-step
search algorithm [20]. As illustrated in Fig. 6.10, the search starts with a step-
Section 6.4. Block Matching Algorithm 169
Table 6.1. Comparison of Fast Search Algorithms for a Search Range of R = 7.
From [14, Table 1]
Search Algorithm Number of Number of
Search Points Search Steps
Minimum Maximum Minimum Maximum
EBMA 225 225 1 1
2D-log [19] 13 26 2 8
three-step [20] 25 25 3 3

size equal to or slightly larger than half of the maximum search range. In each
step, nine search points are compared. They consist of the central point of the
search square, and eight search points located on the search area boundaries. The
stepsize is reduced by half after each step, and the search ends with stepsize of
1 pel. At each new step, the search center is moved to the best matching point
resulting from the previous step. Let R0 represent the initial search stepsize, there
are at most L = [log2 R0 + 1] search steps, where [x] represents the lower integer of
x. If R0 = R=2, then L = [log2 R]. At each search step, eight points are searched,
except in the very beginning, when nine points need to be examined. Therefore, the
total number of search points is 8L + 1: For example, for a search range of R = 32;
with EBMA, the total number of search points is 4225, whereas with the three-step
method, the number is reduced to 41; a saving factor of more than 100. Unlike the
2D-log search method, the three-step method has a xed, predictable number of
search steps and search points. In addition, it has a more regular structure. These
features make the three-step method more amenable to VLSI implementation than
the 2D-log method and some other fast algorithms.

Comparison of Di erent Fast Algorithms Table 6.1 compares the minimum and
maximum numbers of search points and the number of search steps required by
several di erent search algorithms. As can be seen, some algorithms have more
regular structure and hence xed number of computations, while others have very
di erent best case and worst case numbers. For VLSI implementation, structure
regularity is more important, whereas for software implementation, the average case
complexity (which is more close to the best case in general) is more important. For
an analysis on the implementation complexity and cost using VLSI circuits for these
algorithms, see the article by Hang et al. [14].
The above discussions assume that the search accuracy is integer-pel. To achieve
half-pel accuracy, one can add a nal step in any fast algorithm, which searches with
a half-pel stepsize in a 1 pel neighborhood of the best matching point found from
the integer-pel search.
170 Two Dimensional Motion Estimation Chapter 6

6.4.4 Imposing Motion Smoothness Constraints


From Figures 6.8(c), we see that the motion eld obtained using EBMA is quite
chaotic. This is because no constraints are imposed on the spatial variation of
the block MVs. Several approaches have been developed to make the estimated
motion eld smoother so that it is closer to the physical motion eld. One e ective
approach is to use a hierarchical approach, which estimates the MVs in a coarser
spatial resolution rst, and then continuously re ne the MVs in successive ner
resolutions. The propagation of the MVs from a coarser resolution to a ner one
is accomplished by spatial interpolation, which induces a certain degree of spatial
continuity on the resulting motion eld. This technique will be treated in more
detail in Sec. 6.9. Another approach is to explicitly impose a smoothness constraint
by adding a smoothing term in the error criterion in Eq. (6.4.2), which measures the
variation of the MVs of adjacent blocks. The resulting overall error function will be
similar to that in Eq. (6.2.9), except that the motion vectors are de ned over blocks
and the prediction error is summed over a block. The challenge is to determine a
proper weighting between the prediction error term and the smoothing term so that
the resulting motion eld is not over-smoothed. Ideally, the weighting should be
adaptive: it should not be applied near object boundaries. A more diÆcult task is
to identify object boundaries where motion discontinuity should be allowed.

6.4.5 Phase Correlation Method


Instead of minimizing the DFD, another motion estimation method is by identifying
peaks in the phase-correlation function. Assume that two image frames are related
by a pure translation, so that
1( x) = 2(x + d): (6.4.3)
Taking the Fourier transform on both sides and using the Fourier shift theorem, we
get
1 (f ) = 2(f )  ej2d f : (6.4.4)
T

The normalized cross power spectrum between 1( x) and x) is


2(

~ (f ) = j 1 (f )  2 (f ) = ej 2d f ;
T
(6.4.5)
(f )   (f )j
1 2
where the superscript  indicates complex conjugation. Taking the inverse Fourier
transform results in the phase correlation function (PCF):4
PCF(x) = F 1 f ( ~ f )g = Æ(x + d): (6.4.6)
4 The name comes from the fact that it is the cross correlation between the phase portions of

the functions 1 (x) and 2 (x), respectively.


Section 6.4. Block Matching Algorithm 171
We see that the PCF between two images that are translation of each other is an
impulse function, with the impulse located at a position exactly equal to the trans-
lation between the two images. By identifying the peak location of the PCF, one
can estimate the translation between two images. This approach was rst used by
Kuglin [23] for image alignment.
The above derivation assumes that both images are continuous in space and have
in nite size. On the other hand, the actual image signal is discrete and nite. In
practice, we apply DSFT over the available image domain, which is equivalent to the
periodic extension of the CSFT over an in nite image that is zero outside the given
image domain (see Secs. 2.1 and 2.2). In order to suppress the aliasing e ect caused
by sampling, a frequency-domain weighting function W (f ) is usually applied when
computing (6.4.5). In [13], a Kaiser window with = 0:2 is used as the weighting
function. To reduce the e ect of boundary samples, a space domain weighting
function w(x) can also be applied to 1 (x); 2 (x) before computing DSFT.
Phase correlation is used extensively in image registration, where entire images
have to be aligned [33][10]. For motion estimation, the underlying two frames are
usually not related by a global translation. To handle this situation, the phase
correlation method is more often applied at the block level. For motion estimation
over non-overlapping blocks of size N  N , both frames are usually divided into
overlapping range blocks of size L  L. For a search range of R, the range block
size should be L  N + 2R. To determine the MV of a block in 1 (x), a size L  L
discrete Fourier transform (DFT) is applied to both this block and its corresponding
block in 2 (x). Then the PCF is computed using inverse DFT of the same size, and
the peak location is identi ed. To enable the use of fast Fourier transform (FFT)
algorithms, L is usually chosen to be a power of 2. For example, if N = 16; R = 16,
L = 64 would be appropriate.
The above method assumes that there is a global translation between the two
corresponding range blocks. This assumption does not hold for general video se-
quences. When there are several patches in the range block in 1 (x) that undergo
di erent motions, we will observe several peaks in the PCF. Each peak corresponds
to the motion of one patch. The location of the peak indicates the MV of the patch,
whereas the amplitude of the peak is proportional to the size of the patch [40]. In
this sense, the PCF reveals similar information as the MV histogram over a block.
To estimate the dominant MV of the block, we rst extract local maxima of the
PCF. We then examine the DFD at the corresponding MV's. The MV yielding
the minimum DFD will be considered the block MV. Since only a small number of
candidate MV's are examined, signi cant savings in computational complexity may
be achieved compared to the full-search method.
This approach can be extended to motion estimation with fractional pel accu-
racy. In [13], the integer-pel candidate motion vectors are augmented by varying the
length of the candidate motion vectors by up to 1 pel. In [37] and [40], alternative
methods are suggested.
An advantage of the phase correlation method for motion estimation is its in-
sensitivity to illumination changes (see Sec. 5.2). This is because changes in the
172 Two Dimensional Motion Estimation Chapter 6

mean value of an image or multiplication of an image with a constant do not a ect


the phase information. This is not true for the DFD-based methods.
6.4.6 Binary Feature Matching
In this scheme, known as Hierarchical Feature Matching Motion Estimation Scheme
(HFM-ME) [25], a Sign Truncated Feature (STF) is de ned and used for block
template matching, as opposed to the pixel intensity values used in conventional
block matching methods. Using the STF de nition, a data block is represented
by a mean and a binary bit pattern. The block matching motion estimation is
divided to mean matching and binary phase matching. This technique enables a
signi cant reduction in computational complexity compared with EBMA because
binary phase matching only involves Boolean logic operation. The use of STF also
signi cantly reduces the data transfer time between the frame bu er and motion
estimator. Tests have shown that HFM-ME can achieve similar prediction accuracy
as EBMA under the same search ranges, but can be implemented about 64 times
faster. When the search range is doubled for HFM-ME, it can achieve signi cantly
more accurate prediction than EBMA, still with nontrivial time savings [25].
The STF vector of a block of size 2N x2N consists of two parts. The rst part is
the multiresolution mean vectors, and the second part is the sign-truncated binary
vectors. The mean vectors are determined recursively as follows
j P P1 k
Meann (i; j ) = 14 1p=0 q=0 Mean
n+1
(2i + p; 2j + q) ; 0  i; j  2n 1; 0  n  N 1;
(6.4.7)
MeanN (i; j ) = (i; j ); 0  i; j  2N 1;

where f (i; j ); 0  i; j  2N 1g are the pixel intensity values of the original block.
The sign-truncated vectors are obtained by

ST patternn (i; j ) = 0 if Meann (i; j)  Meann 1 (b 2i c; b 2j c); (6.4.8)
1 otherwise:
The STF vectors, decomposed to the n-th level for a 2N x2N block can then be
represented as
STFVNn = fST patternN ; ST patternN 1 ; :::ST patternN n+1 ; meanN n g: (6.4.9)
When n=N, a block is fully decomposed with the following STF vector
STFVNN = fST patternN ; ST patternN 1 ; :::ST pattern1 ; mean0 g: (6.4.10)
All the intermediate mean vectors are only used to generate ST patterns and can
be discarded. Therefore, the nal STF representation consists of a multiresolution
binary sequence with 34 (4N 1) bits and a one-byte mean. This represents a much
reduced data set compared to the original 4N bytes of pixel values. Also, this feature
set allows binary Boolean operation for the block matching purpose.
Section 6.5. Deformable Block Matching Algorithms (DBMA)* 173
As an example, let us consider how to form the STF vector for a 4x4 block with
2 layers. First, the mean pyramid is formed as
2 3
158 80 59 74  
6 80 69 59 74 77 =) 97 67
6
4 87 86 65 62 5 97 64 =) 81
116 100 72 58
The STF vectors is then obtained as:
0 1
0 1 1 0  
B
B 1 1 1 0 C
C; 0 1 ; 81
@ 0 0 1 0 A 0 1
0 0 0 1
The STF vector decomposed to one layer for the above example is f 0110 1110 0010 0001,
(97; 67; 97; 64)g. The completely-decomposed STF vector is f 0101, 0110 1110 0010 0001,
81 g. It consists of a 20-bit binary pattern, which includes a 2x2 second layer
sign-pattern and a 4x4 rst layer sign-pattern, and a mean value. In practical im-
plementations, either completely-decomposed or mixed-layer STF vectors can be
used.
Comparison of two STF vectors is accomplished by two parallel decision proce-
dures: i) calculating the absolute error bewteen the mean values, and ii) determining
the Hamming distance between the binary patterns. The later can be accomplished
extremely fast by using an XOR Boolean operator. Therefore, the main compu-
tational load of the HFM-ME lies in the computation of the mean pyramid for
the current and all candidate matching blocks. This computation can however be
done in advance, only once for every possible block. For a detailed analysis of the
computation complexity and a fast algorithm using logrithmic search, see [25].

6.5 Deformable Block Matching Algorithms (DBMA)*


In the block matching algorithms introduced previously, each block is assumed to
undergo a pure translation. This model is not appropriate for blocks undergoing
rotation, zooming, etc. In general, a more sophisticated model, such as the aÆne,
bilinear, or projective mapping, can be used to describe the motion of each block.
Obviously, it will still cover the translational model as a special case. With such
models, a block in the anchor frame is in general mapped to a non-square quadran-
gle, as shown in Fig. 6.11. Therefore, we refer to the class of block-based motion
estimation methods using higher order models as deformable block matching algo-
rithms (DBMA) [24]. It is also known as generalized block matching algorithm [36].
In the following, we rst discuss how to interpolate the MV at any point in a block
using only the MVs at the block corners (called nodes), and then we present an
algorithm for estimating nodal MVs.
174 Two Dimensional Motion Estimation Chapter 6

xm, 3 xm, 2
Bm
xm, 4 xm, 1 Tracked Frame

Anchor Frame

Figure 6.11. The deformable block matching algorithm nds the best matching quadran-
gle in the tracked frame for each block in the anchor frame. The allowed block deformation
depends on the motion model used for the block. Adapted from [39, Fig. 6.9].

6.5.1 Node-Based Motion Representation


In Sec. 5.5, we described several 2D motion models corresponding to di erent 3D
motions. All these models can be used to characterize the motion within a block.
In Sec. 5.5.4, we showed how the most general model, the projective mapping, can
be approximated by a polynomial mapping of di erent orders. In this section, we
introduce a node-based block motion model [24], which can characterize the same
type of motions as the polynomial model, but is easier to interpret and specify.
In this model, we assume that a selected number of control nodes in a block
can move freely and that the displacement of any interior point can be interpolated
from nodal displacements. Let the number of control nodes be denoted by K , and
the MVs of the control nodes in Bm by dm;k , the motion function over the block is
described by
K
X
dm(x) = m;k (x)dm;k ; x 2 Bm: (6.5.1)
k=1
The above equation expresses the displacement at any point in a block as an in-
terpolation of nodal displacements, as shown in Fig. 6.12. The interpolation kernel
m;k (x) depends on the desired contribution of the k-th control point in Bm to x.
One way to design the interpolation kernels is to use the shape functions associated
with the corresponding nodal structure. We will discuss more about the design of
Section 6.5. Deformable Block Matching Algorithms (DBMA)* 175
dm, 3 dm, 2
fm, 3
fm, 2
x

fm, 4 fm, 1
dm, 1
dm, 4

Figure 6.12. Interpolation of motion in a block from nodal MVs.

shape functions in Sec. 6.6.1.


The translational, aÆne, and bilinear models introduced previously are special
cases of the node-based model with one, three, and four nodes, respectively. A
model with more nodes can characterize more complex deformation. The inter-
polation kernel in the one-node case (at the block center or a chosen corner) is a
pulse function, corresponding to nearest neighbor interpolation. The interpolation
functions in the three-node (any three corners in a block) and four-node (the four
corners) cases are aÆne and bilinear functions, respectively. Usually, to use an
aÆne model with a rectangular block, the block is rst divided into two triangles,
and then each triangle is modeled by the three-node model.
Compared to the polynomial-based representation introduced previously, the
node-based representation is easier to visualize. Can you picture in your head the
deformed block given 8 coeÆcients of a bilinear function? But you certainly can,
given locations of the four corner points in the block! Furthermore, the nodal
MVs can be estimated more easily and speci ed with a lower precision than for
the polynomial coeÆcients. First, it is easy to determine appropriate search ranges
and search stepsizes for the nodal MVs than the polynomial coeÆcients, based on
the a priori knowledge about the dynamic range of the underlying motion and the
desired estimation accuracy. Secondly, all the motion parameters in the node-based
representation are equally important, while those in the polynomial representation
cannot be treated equally. For example, the estimation of the high order coeÆcients
is much harder than the constant terms. Finally, speci cation of the polynomial
coeÆcients requires a high degree of precision: a small change in a high order
coeÆcient can generate a very di erent motion eld. On the other hand, to specify
nodal MVs, integer or half-pel accuracy is usually suÆcient. These advantages are
important for video coding applications.

6.5.2 Motion Estimation Using Node-Based Model


Because the estimation of nodal movements are independent from block to block,
we skip the subscript m which indicates which block is being considered. The
176 Two Dimensional Motion Estimation Chapter 6

following derivation applies to any block B: With the node-based motion model,
the motion parameters for any block are the nodal MVs, i.e., a = [dk ; k 2 K];
where K = f1; 2; : : : ; K g: They can be estimated by minimizing the prediction error
over this block, i.e.,
X
E (a) = j 2 (w(x; a)) 1 (x)jp ; (6.5.2)
x2B
where X
w(x; a) = x + k (x)dk : (6.5.3)
k2K
As with the BMA, there are many ways to minimize the error in Eq. (6.5.2),
including exhaustive search and a variety of gradient-based search method. The
computational load required of the exhaustive search, however, can be unacceptable
in practice, because of the high dimension of the search space. Gradient-based
search algorithms are more feasible in this case. In the following, we derive a
Newton-Ralphson search algorithm, following the approach in [24].
De ne a = [aTx ; aTy ]T with ax = [dx;1; dx;2 ; : : : ; dx;K ]T ; ay = [dy;1 ; dy;2 ; : : : ; dy;K ]T :
One can show that
 T
@E
@a
(a) = @@E @E
ax (a); @ ay (a) ;
with
@E
( a) = 2
X
e(x; a)
@ 2(w(x; a)) (x);
@ ax x2B
@x
@E
( a) = 2
X
e(x; a)
@ 2(w(x; a)) (x):
@ ay x2B
@y
In the above equations, e(x; a) = 2 (w(x; a)) 1 (x) and (x) = [1 (x); 2 (x); : : : ; K (x)]T :
By dropping the terms involving second order gradients, the Hessian matrix can be
approximated as

H ( a )
[H(a)] = H (a) H (a) ;
xx H xy (a )


xy yy
with


X @ 2 2
Hxx(a) = 2 @x


(x)(x)T ;
x2B w(x;a)
 2
X @ 2
Hyy (a) = 2 @y

(x)(x)T :
x2B w(x;a)

X @ 2 @ 2
Hxy (a) = 2 @x @y (x)(x)T :
x2B w(x;a)
Section 6.6. Mesh-Based Motion Estimation
 177
The Newton-Ralphson update algorithm is:
@E
al+1 = al [H(al )] 1 (al ):
@a
(6.5.4)
The update at each iteration thus requires the inversion of the 2K  2K symmetric
matrix [H].
To reduce numerical computations, we can update the displacements in x and
y directions separately. Similar derivation will yield:
@E l
alx+1 = alx [Hxx (al )] 1
@ ax
(a ); (6.5.5)
@E l
aly+1 = aly [Hyy (al )] 1
@ ay
(a ): (6.5.6)
In this case we only need to invert two K  K matrices in each update. For the
four-node case, [H] is an 8  8 matrix, while [Hxx], and [Hyy ] are 4  4 matrices.
As with all gradient-based iterative processes, the above update algorithm may
reach a bad local minimum that is far from the global minimum, if the initial solution
is not chosen properly. A good initial solution can often be provided by the EBMA.
For example, consider the four-node model with four nodes at the corners of each
block. One can use the average of the motion vectors of the four blocks attached
to each node as the initial estimate of the MV for that node. This initial estimate
can then be successively updated by using Eq. (6.5.4).
Note that the above algorithm can also be applied to polynomial-based motion
representation. In that case, ax and ay would represent the polynomial coeÆcients
associated with horizontal and vertical displacements, respectively, and k () would
correspond to the elementary polynomial basis functions. However, it is diÆcult to
set the search range for ax and ay and check the feasibility of the resulting motion
eld.

6.6 Mesh-Based Motion Estimation


With the block-based model used either in BMA or DBMA, motion parameters in
individual blocks are independently speci ed. Unless motion parameters of adja-
cent blocks are constrained to vary smoothly, the estimated motion eld is often
discontinuous and sometimes chaotic, as sketched in Fig. 6.13(a). One way to over-
come this problem is by using mesh-based motion estimation. As illustrated in
Figs. 6.13(b), the anchor frame is covered by a mesh, and the motion estimation
problem is to nd the motion of each node so that the image pattern within each
element in the anchor frame matches well with that in the corresponding deformed
element in the tracked frame. The motion within each element is interpolated from
nodal MVs. As long as the nodes in the tracked frame still form a feasible mesh,
the mesh-based motion representation is guaranteed to be continuous and thus be
free from the blocking artifacts associated with block-based representation. Another
178 Two Dimensional Motion Estimation Chapter 6

bene t of the mesh-based representation is that it enables continuous tracking of


the same set of nodes over consecutive frames, which is desirable in applications
requiring object tracking. As shown in Fig. 6.13(c), one can generate a mesh for
an initial frame, and then estimate the nodal motions between every two frames.
At each new frame (the anchor frame), the tracked mesh generated in the previous
step is used, so that the same set of nodes are tracked over all frames. This is
not possible with the block-based representation, because it requires that each new
anchor frame be reset to a partition consisting of regular blocks.
Note that the inherent continuity with the mesh-based representation is not al-
ways desired. The type of motion that can be captured by this representation can be
visualized as the deformation of a rubber-sheet, which is continuous everywhere. In
real-world video sequences, there are often motion discontinuities at object bound-
aries. A more accurate representation would be to use separate meshes for di erent
objects. As with the block-based representation, the accuracy of the mesh-based
representation depends on the number of nodes. A very complex motion eld can
be reproduced as long as a suÆcient number of nodes are used. To minimize the
number of nodes required, the mesh should be adapted to the imaged scene, so that
the actual motion within each element is smooth (i.e., can be interpolated accu-
rately from the nodal motions). If a regular, non-adaptive mesh is used, a larger
number of nodes are usually needed to approximate the motion eld accurately.
In the following, we rst describe how to specify a motion eld using a mesh-
based representation. We then present algorithms for estimating nodal motions in
a mesh.

6.6.1 Mesh-Based Motion Representation


With the mesh-based motion representation, the underlying image domain in the
anchor frame is partitioned into non-overlapping polygonal elements. Each element
is de ned by a few nodes and links between the nodes, as shown in Fig. 6.14. Such a
mesh is also known as a control grid. In the mesh-based representation, the motion
eld over the entire frame is described by MVs at the nodes only. The MVs at the
interior points of an element are interpolated from the MVs at the nodes of this
element. The nodal MVs are constrained so that the nodes in the tracked frame
still form a feasible mesh, with no ip-over elements.
Let the number of elements and nodes be denoted by M and N respectively,
and the number of nodes de ning each element by K . For convenience, we de ne
the following index sets: M = f1; 2; : : : ; M g; N = f1; 2; : : : ; N g; K = f1; 2; : : : ; K g:
Let the m-th element and n-th node in frame t (t=1 for the anchor frame and t=2
for the tracked frame) be denoted by Bt;m ; m 2 M and xt;n ; n 2 N ; and the MV of
the n-th node by dn = x2;n x1;n . The motion eld in element B1;m is related to
the nodal MVs dn by:
X
dm(x) = m;k (x)dn(m;k) ; x 2 B1;m; (6.6.1)
k2K
Section 6.6. Mesh-Based Motion Estimation
 179

(a)

(b)

(c)

Figure 6.13. Comparison of block-based and mesh-based motion representations: (a)


Block-based motion estimation between two frames, using a translational model within
each block in the anchor frame; (b) Mesh-based motion estimation between two frames,
using a regular mesh at the anchor frame; (d) Mesh-based motion tracking, using the
tracked mesh for each new anchor frame.
180 Two Dimensional Motion Estimation Chapter 6

xn(m, 1) dn(m, 1)
dn(m, 2)
xn(m, 2) Dm
Dm

xn(m, 3) dn(m, 3)

(a)

dn(m, 1) dn(m, 2)
xn(m, 1) xn(m, 2)

Dm Dm

xn(m, 3) xn(m, 4) Dn(m, 4)


dn(m, 3)

(b)

Figure 6.14. Illustration of mesh-based motion representation: (a) using a triangular


mesh, with 3 nodes attached to each element, (b) using a quadrilateral mesh, with 4 nodes
attached to each element. In the shown example, the two meshes have the same number of
nodes, but the triangular mesh has twice the number of elements. The left column shows
the initial mesh over the anchor frame, the right column the deformed mesh in the tracked
frame.

where n(m; k) speci es the global index of the k-th node in the m-th element (cf.
Fig. 6.14). The function m;k (x) is the interpolation kernel associated with node k
in element m. It depends on the desired contribution of the k-th node in B1;m to
the MV at x. This interpolation mechanism has been shown previously in Fig. 6.12.
To guarantee continuity across element boundary, the interpolation kernels should
satisfy:
X
0  m;k (x)  1; m;k (x) = 1; 8x 2 Bm;
k
and 
m;k (xl ) = Æk;l = 1 k = l;
0 k=6 l:
Section 6.6. Mesh-Based Motion Estimation
 181
y y
3 1 2
1 2

x
-1 1

3 1
x
1
4 -1 1

(a) (b)

Figure 6.15. (a) A standard triangular element; (b) A standard quadrilateral element (a
square).

In the nite element method (FEM) analysis, these functions are called shape func-
tions [45]. If all the elements have the same shape, then all the shape functions are
equal, i.e., m;k (x) = k (x):
Standard triangular and quadrilateral elements are shown in Fig. 6.15. The
shape functions for the standard triangular element are:
t1 (x; y) = x; t2 (x; y) = y; t3 (x; y) = 1 x y: (6.6.2)
The shape functions for the standard quadrilateral element are:
q1 (x; y) = (1 + x)(1 y)=4; q2 (x; y) = (1 + x)(1 + y)=4; (6.6.3)
q3 (x; y) = (1 x)(1 + y)=4; q4 (x; y) = (1 x)(1 y)=4:
We see that the shape functions for these two cases are aÆne and bilinear functions,
respectively. The readers are referred to [41] for the shape functions for arbitrary
triangular elements. The coeÆcients of these functions depend on the node posi-
tions.
Note that the representation of the motion within each element in Eq. (6.6.1) is
the same as the node-based motion representation introduced in Eq. (6.5.1), except
that the nodes and elements are denoted using global indices. This is necessary
because the nodal MVs are not independent from element to element. It is impor-
tant not to confuse the mesh-based model with the node-based model introduced in
the previous section. There, although several adjacent blocks may share the same
node, the nodal MVs are determined independently in each block. Going back to
Fig. 6.14(b), in the mesh-based model, node n is assigned a single MV, which will
a ect the interpolated motion functions in four quadrilateral elements attached to
this node. In the node-based model, node n can have four di erent MVs, depending
on in which block it is considered.
182 Two Dimensional Motion Estimation Chapter 6

B1, m B2, m

Figure 6.16. Mapping from a master element B~ to two corresponding elements in the
anchor and tracked frames B1;m and B2;m :

6.6.2 Motion Estimation Using Mesh-Based Model


With the mesh-based motion representation, there are in general two sets of prob-
lems to be solved: 1) Given a mesh or equivalently nodes in the current frame, how
to determine the nodal positions in the tracked frame. This is essentially a motion
estimation problem. 2) How to set up the mesh in the anchor frame, so that the
mesh conforms to the object boundaries. Note that a mesh in which each element
corresponds to a smooth surface patch in a single object can lead to more accurate
motion estimation, than an arbitrarily con gured mesh (e.g., a regular mesh). An
object adaptive mesh would also be more appropriate for motion tracking over a
sequence of frames. In this book, we only consider the rst problem. For solutions
to the mesh generation problem, see, e.g., [42, 3].
With the mesh-based motion representation described by Eq. (6.6.1), the motion
parameters include the nodal MVs, i.e., a = fdn ; n 2 Ng: To estimate them, we can
again use an error minimization approach. Under the mesh-based motion model,
the DFD error in Eq. (6.2.1) becomes
X X
E (dn ; n 2 N ) = j 2 (wm (x)) 1(x)jp ; (6.6.4)
m2M x2B1;m

where, following Eq. (6.6.1),


X
wm(x) = x + m;k (x)dn(m;k) ; x 2 B1;m:
k2K

In general, the error function in Eq. (6.6.4) is diÆcult to calculate because of the
irregular shape of B1;m . To simplify the calculation, we can think of Bt;m; t = 1; 2;
as being deformed from a master element with a regular shape. In general, the
master element for di erent elements could di er. Here, we only consider the case
when all the elements have the same topology that can be mapped from the same
master element, denoted by B~. Fig. 6.16 illustrates such a mapping.
Section 6.6. Mesh-Based Motion Estimation
 183
Let ~k (u) represent the shape function associated with the k-th node in B~; the
mapping functions from B~ to Bt;m can be represented as
X
w~ t;m(u) = ~k (u)xt;n(m;k) ; u 2 B~; t = 1; 2; (6.6.5)
k2K

Then the error in Eq. (6.6.4) can be calculated over the master element as
X X
E (dn ; n 2 N ) = je~m (u)jp Jm (u); (6.6.6)
m2M u2B~

where
e~m (u) = 2 (w
~ 2;m (u)) w
1 ( ~ 1;m ( u))
(6.6.7)
represents the error between the two image frames at points that are
both
 mapped 
from u in the master element (cf. Fig 6.16). The function Jm (u) = det @ w~ 1@ u (u)
;m

denotes the Jacobian of the transformation:5


w~ 1;m(u): (6.6.8)
For motion tracking over a set of frames, because the mesh used in each new
anchor frame is the tracked mesh resulting from a previous step, the shape of B1;m
is in general irregular (cf. Fig. 6.13(c)). Consequently the mapping function w~ 1 (u)
and the Jocobian Jm (u) depend on the nodal positions in B1;m. On the other hand,
for motion estimation between two frames, to reduce the complexity, one can use a
regular mesh for the anchor frame so that each element is itself equivalent to the
master element (cf. Fig. 6.13(b)). In this case, the mapping function in the anchor
frame is trivial, i.e., w1;m(u) = u, and Jm (u) = 1.
The gradient of the error function in Eq. (6.6.6) is, when p = 2,
@ (x)

@Ep X X
= 2 e~m (u)~k(m;n) (u) 2 Jm (u); (6.6.9)
@ dn m2M
@ x w~ 2 (u)
n u2B~ ;m

where Mn includes the indices of the elements that are attached to node n, and
k(m; n) speci es the local index of node n in the m-th adjacent element. Figure 6.17
illustrates the neighboring elements and shape functions attached to node n in the
quadrilateral mesh case.
It can be seen that the gradient with respect to one node only depends on the
errors in the several elements attached to it. Ideally, in each iteration of a gradient-
based search algorithm, to calculate the above gradient function associated with
any node, one should assume the other nodes are xed at the positions obtained
in the previous iteration. All the nodes should be updated at once at the end of
5 Strictly speaking, the use of Jacobian is correct only when the error is de ned as an integral over
B~. Here we assume the sampling over B~ is suÆciently dense when using the sum to approximate
the integration.
184 Two Dimensional Motion Estimation Chapter 6

Bm1
f1 Bm2
xn f4

f2
f3

Bm4
Bm3

Figure 6.17. Neighborhood structure in a quadrilateral mesh: For a given node n, there
are four elements attached to it, each with one shape function connected to this node.

the iteration, before going to the next iteration. But in reality, to speed up the
process, one can update one node at a time, while xing its surrounding nodes.
Of course, this sub-optimal approach could lead to divergence or convergence to
a local minimum that is worse than the one obtained by updating all the nodes
simultaneously. Instead of updating the nodes in the usual raster order, to improve
the accuracy and convergence rate, one can order the nodes so that the nodes whose
motion vectors can be estimated more accurately be updated rst. Because of the
uncertainty in motion estimation in smooth regions, it may be better to update the
nodes with large edge magnitude and small motion compensation error rst. This
is known as highest con dence rst [7] and this approach was taken in [2]. Another
possibility is to divide all the nodes into several groups so that the nodes in the same
group do not share same elements, and therefore are independent in their impact
on the error function. Sequential update of the nodes in the same group is then
equivalent to simultaneous updates of these nodes. This is the approach adopted
in [42]. Either the rst order gradient descent method or the second order Newton-
Ralphson type of update algorithm could be used. The second order method will
converge much faster, but it is also more liable to converge to bad local minima.
The newly updated nodal positions based on the gradient function can lead
to overly deformed elements (including ip-over and obtuse elements). In order
to prevent such things from happening, one should limit the search range where
the updated nodal position can fall into. If the updated position goes beyond this
region, then it should be projected back to the nearest point in the de ned search
range. Figure 6.18 shows the legitimate search range for the case of a quadrilateral
mesh.
The above discussion applies not only to gradient-based update algorithms, but
also exhaustive search algorithms. In this case, one can update one node at a
time, by searching for the nodal position that will minimize the prediction errors
Section 6.6. Mesh-Based Motion Estimation
 185
xn xn

x’n

Search
region

(a) (b)

Figure 6.18. The search range for node n given the positions of other nodes: the diamond
region (dash line) is the theoretical limit, the inner diamond region (shaded) is used in
practice. When xn falls outside the diamond region (a), at least one element attached to
it becomes obtuse. By projecting xn onto the inner diamond (b), all four elements would
not be overly deformed.

in elements attached to it in the search range illustrated in Fig. 6.18. For each
candidate position, one calculates the error accumulated over the elements attached
to this node, i.e., replacing n 2 M by n 2 Mn in Eq. (6.6.4). The optimal position
is the one with the minimal error. Here again, the search order is very important.
Example 6.2: Figure 6.8 shows the motion estimation result obtained by an ex-
haustive search approach for backward motion estimation using a rectangular mesh
at each new frame [43]. Figure 6.8(e) is the deformed mesh overlaid on top of the
tracked frame, and Figure 6.8(f) is the predicted image for the anchor frame. Note
that each deformed quadrangle in Fig. 6.8(e) corresponds to a square block in the
anchor frame. Thus a narrow quadrangle in the right side of the face indicates it
is expanded in the anchor frame. We can see that the mesh is deformed smoothly,
which corresponds to a smooth motion eld. The predicted image does not su er
from the blocking artifact associated with the EBMA (Fig. 6.8(d) vs. Fig. 6.8(f))
and appears to be a more successful prediction of the original. A careful comparison
between the predicted image (Fig. 6.8(f)) and the actual image (Fig. 6.8(b)) however
will reveal that the eye closing and mouth movement are not accurately reproduced,
and there are some arti cial warping artifacts near the jaw and neck. In fact, the
PSNR of the predicted image is lower than that obtained by the EBMA.
Until now, we have assumed that a single mesh is generated (or propagated from
the previous frame in the forward tracking case) for the entire current frame, and
every node in this mesh is tracked to one and only one node in the tracked frame, so
that the nodes in the tracked frame still form a mesh that covers the entire frame.
In order to handle newly appearing or disappearing objects in a scene, one should
186 Two Dimensional Motion Estimation Chapter 6

allow the deletion of nodes corresponding to disappeared objects, and creation of


new nodes in newly appearing objects. For a solution to such problem, see [3].

6.7 Global Motion Estimation


In Sec. 5.5, we showed that, depending on the camera and object motion and the
object surface geometry, the mapping function between the two images of the same
imaged object can be described by a translation, a geometric transformation, an
aÆne mapping, or a projective mapping. Such a model can be applied to the entire
frame if the entire motion eld is caused by a camera motion, or if the imaged scene
consists of a single object that is undergoing a rigid 3D motion.6
In practice, one can hardly nd a video sequence that contains a single object.
There are usually at least two objects: a stationary background and a moving
foreground. More often, there are more than one foreground objects. Fortunately,
when the foreground object motion is small compared to the camera motion, and
the camera does not move in the Z-direction, the motion eld can be approximated
well by a global model. This will be the case, for example, when the camera pans
over a scene or zooms into a particular subject in a relatively fast speed. Such
camera motions are quite common in sports video and movies. Even when the
actual 2D motion eld cannot be represented by a single global motion accurately,
as long as the e ect of the camera motion dominates over other motions (motion of
individual small objects), determination of this dominant global motion is still very
useful. In this section, we discuss the estimation of global motions.
There are in general two approaches for estimating the global motion. One is to
estimate the global motion parameters directly by minimizing the prediction errors
under a given set of motion parameters. The other is to rst determine pixel-wise or
block-wise motion vectors, using the techniques described previously, and then using
a regression method to nd the global motion model that best ts the estimated
motion eld. The latter method can also be applied to motion vectors at selected
feature points, such as points with strong edges.

6.7.1 Robust Estimator


A diÆculty in estimating the global motion is that a pixel may not experience only
the global motion. Usually, the motion at any pixel can be decomposed into a global
motion (caused by camera movement) and a local motion because of the movement
of the underlying object. Therefore, the prediction error obtained by using the
global motion model alone may not be small, even if the correct global motion
parameters are available. In other instances, not all the pixels in the same frame
experience the global motion and ideally one should not apply the same motion
6 Recall that in the case that the camera or the object moves in the Z-direction, the motion

eld can be represented by a projective mapping only if the object surface is planar. When the
object surfaces is spatially varying, the mapping function at any point also depends on the surface
depth of that point and cannot be represented by a global model.
Section 6.7. Global Motion Estimation 187
model to the entire frame. These problems can be overcome by a robust estimation
method [35], if the global motion is dominant over other local motions, in the sense
that the pixels that experience the same global motion and only the global motion
occupy a signi cantly larger portion of the underlying image domain than those
pixels that do not.
The basic idea in robust estimation is to consider the pixels that are governed
by the global motion as inliers, and the remaining pixels as outliers. Initially, one
assume that all the pixels undergo the same global motion, and estimate the motion
parameters by minimizing the prediction or tting error over all the pixels. This
will yield an initial set of motion parameters. With this initial solution, one can
then calculate the prediction or tting error over each pixel. The pixels where the
errors exceed a certain threshold will be classi ed as outliers and be eliminated
from the next iteration. The above process is then repeated to the remaining inlier
pixels. This process iterates until no outlier pixels exist. This approach is called
Hard Threshold Robust Estimator.
Rather than simply labeling a pixel as either inlier or outlier at the end of each
iteration, one can also assign a di erent weight to each pixel, with a large weight
for a pixel with small error, and vice verse. At the next minimization or tting
iteration, a weighted error measure is used, so that the pixels with larger errors in
the previous iteration will have less impact than those with smaller errors. This
approach is known as Soft Threshold Robust Estimator.
6.7.2 Direct Estimation
In either the hard or soft threshold robust estimator, each iteration involves the
minimization of an error function. Here we derive the form of the function when
the model parameters are directly obtained by minimizing the prediction error. We
only consider the soft-threshold case, as the hard-threshold case can be considered
as a special case where the weights are either one or zero. Let the mapping function
from the anchor frame to the tracked frame be denoted by w(x; a), where a is the
vector that contains all the global motion parameters. The prediction error can be
written as, following Eq. (6.2.1):
X
EDFD = w(x) j 2 (w(x; a)) 1 (x)jp (6.7.1)
x2
where w(x) are the weighting coeÆcients for pixel x. Within each iteration of
the robust estimation process, the parameter vector a is estimated by minimizing
the above error, using either gradient-based method or exhaustive search. The
weighting factor at x, w(x), in a new iteration will be adjusted based on the DFD
at x calculated based on the motion parameters estimated in the previous iteration.
6.7.3 Indirect Estimation
In this case, we assume that the motion vectors d(x) have been estimated at a set
of suÆciently dense points x 2 0  ; where  represent the set of all pixels.
188 Two Dimensional Motion Estimation Chapter 6

This can be accomplished, for example, using either the block-based or mesh-based
approaches described before. One can also choose to estimate the motion vectors at
only selected feature points, where the estimation accuracy is high. The task here is
to determine a so that the model d(x; a) can approximate the pre-estimated motion
vectors d(x); x 2 0 well. This can be accomplished by minimizing the following
tting error:
X
E tting = w(x) jd(x; a) d(x)jp (6.7.2)
x20

As shown in Sec. 5.5.4, a global motion can usually be described or approximated


by a polynomial function. In this case, a consists of the polynomial coeÆcients and
d(x; a) is a linear function of a, i.e., d(x; a) = [A(x)]a. If we choose p = 2; then the
above minimization problem becomes a weighted least squares problem. By setting
@Efitting
@a = 0, we obtain the following solution
! 1 !
X X
a= w(x)[A(x )]T [ A(x)] w(x)[A(x )]T d(x) : (6.7.3)
x20 x20

As an example, consider the aÆne motion model given in Eq. (5.5.16). The
motion parameter vector is a = [a0 ; a1 ; a2 ; b0; b1 ; b2 ]T ; and the matrix [A(x)] is
 
[A(x)] = 10 x0 y0 01 x0 y0 :
In fact, the parameters for the x and y dimensions are not coupled and can be
estimated separately, which will reduce the matrix sizes involved. For example, to
estimate the x dimensional parameter ax = [a0 ; a1 ; a2 ], the associated matrix is
[Ax (x)] = [1; x; y], and the solution is:
! 1 !
X X
ax = w(x)[Ax (x)]T [ Ax(x)] w(x)[Ax (x )]T dx ( x) : (6.7.4)
x20 x20

6.8 Region-Based Motion Estimation


As already pointed out in the previous section, there are usually multiple types of
motions in the imaged scene, which correspond to motions associated with di erent
objects. By region-based motion estimation, we mean to segment the underlying
image frame into multiple regions and estimate the motion parameters of each
region. The segmentation should be such that a single parametric motion model
can represent well the motion in each region. Obviously, region segmentation is
dependent on the motion model used for characterizing each region. The simplest
approach is to require each region undergo the same translational motion. This
requirement however can result in too many small regions, because the 2-D motion
Section 6.8. Region-Based Motion Estimation
 189
in a region corresponding to a physical object can rarely be modeled by a simple
translation. Such a region would have to be split into many small sub-regions
for each sub-region to have a translational motion. For a more eÆcient motion
representation, an aÆne or bilinear or perspective motion model should be used.
In general, there are three approaches for accomplishing region-based motion
estimation. With the rst approach, one rst segments the image frame into di er-
ent regions based on texture homogeneity, the edge information, and sometimes the
motion boundary obtained by analyzing the di erence image between two frames,
and then estimates the motion in each region. The latter can be accomplished by
applying the global motion estimation method described in Sec. 6.7 to each region.
We call such a method region- rst. With the second approach, one rst estimates
the motion eld over the entire frame, and then segments the resulting motion eld
so that the motion in each region can be modeled by a single parametric model. We
call this method motion- rst. The resulting region can be further re ned subject
to some spatial connectivity constraints. The rst step can be accomplished using
the various motion estimation methods described previously, including the pixel-,
block-, and mesh-based methods. The second problem involves motion-based seg-
mentation, which is discussed further in Sec. 6.8.1. Yet the third approach is to
jointly estimate region partition and motion in each region. In general, this is ac-
complished by an iterative process, which performs region segmentation and motion
estimation alternatively. This approach is described in Sec. 6.8.2.
6.8.1 Motion-Based Region Segmentation
As described already, motion-based segmentation refers to the partitioning of a
motion eld into multiple regions so that the motion within each region can be
described by a single set of motion parameters. Here we present two approaches for
accomplishing this task. The rst approach uses a clustering technique to identify
similar motion vectors. The second approach uses a layered approach to estimate
the underlying region and associated motions sequentially, starting from the region
with the most dominant motion.
Clustering Consider the case when the motion model for each region is a pure
translation. Then the segmentation task is to group all spatially connected pixels
with similar motion vectors into one region. This can be easily accomplished by
an automated clustering method, such as the K-means or the ISODATA method
[8]. This is an iterative process: Starting from an initial partitioning, the mean
motion vector, known as the centroid, of each region is calculated. Then each pixel
is reclassi ed into the region whose centroid is closest to the motion vector of this
pixel. This results in a new partition and the above two steps can be repeated until
the partition does not change any more. In this process, the spatial connectivity
is not considered. Therefore, the resulting regions may contain pixels that are
not spatially connected. Postprocessing steps may be applied at the end of the
iterations to improve the spatial connectivity of the resulting regions. For example,
a single region may be split into several sub-regions so that each region is a spatially
190 Two Dimensional Motion Estimation Chapter 6

connected subset, isolated pixels may be merged into its surrounding region, and
nally region boundaries can be smoothed using morphological operators.
When the motion model for each region is not a simple translation, motion-based
clustering is not as straight forward. This is because one cannot use the similarity
between motion vectors as the criterion for performing clustering. One way is to
nd a set of motion parameters for each pixel by tting the motion vectors in its
neighborhood into a speci ed model. Then one can employ the clustering method
described previously, by replacing the raw motion vector with the motion parameter
vector. If the original motion eld is given in the block-based representation using
a higher order model, then one can cluster blocks with similar motion parameters
into the same region. Similarly, with the mesh-based motion representation, one can
derive a set of motion parameters for each element based on its nodal displacements,
and then clustering the elements with similar parameters into the same region. This
is the parallel approach described in [44].

Layered Approach Very often, the motion eld in a scene can be decomposed into
layers, with the rst layer representing the most dominant motion, the second layer
the less dominant one, and so on. Here, the dominance of a motion is determined
by the area of the region undergoing the corresponding motion. The most dominant
motion is often a re ection of the camera motion, which a ects the entire imaged
domain. For example, in a video clip of a tennis play, the background will be the
rst layer, which usually undergoes a coherent global camera motion; the player the
second layer (which usually contains several sub-object level motions corresponding
to the movements of di erent parts of the body), the racket the third, and the
ball the fourth layer. To extract the motion parameters in di erent layers, one
can use the robust estimator method described in Sec. 6.7.1 recursively. First, we
try to model the motion eld of the entire frame by a single set of parameters,
and continuously eliminate outlier pixels from the remaining inlier group, until all
the pixels within the inlier group can be modeled well. This will yield the rst
dominant region (corresponding to the inlier region) and its associated motion.
The same approach can then be applied to the remaining pixels (the outlier region)
to identify the next dominant region and its motion. This process continues until
no more outlier pixels exist. As before, postprocessing may be invoked at the end
of the iterations to improve the spatial connectivity of the resulting regions. This
is the sequential approach described in [44].
For the above scheme to work well, the inlier region must be larger than the
outlier region at any iteration. This means that the largest region must be greater
than the combined area of all other regions, the second largest region must be
greater than the combined area of the remaining regions, and so on. This condition
is satis ed in most video scenes, which often contain a stationary background that
covers a large portion of the underlying image, and di erent moving objects with
varying sizes.
Section 6.9. Multi-Resolution Motion Estimation 191
6.8.2 Joint Region Segmentation and Motion Estimation
Theoretically, one can formulate the joint estimation of the region segmentation map
and the motion parameters of each region as an optimization problem. The function
to be minimized could be a combination of the motion compensated prediction
error and a region smoothness measure. The solution of the optimization problem
however is diÆcult because of the very high dimension of the parameter space
and the complicated interdependence between these parameters. In practice, a
sub-optimal approach is often taken, which alternates between the estimation of
the segmentation and motion parameters. Based on an initial segmentation, the
motion of each region is rst estimated. In the next iteration, the segmentation
is re ned, e.g., by eliminating outlier pixels in each region where the prediction
errors are large, and by merging pixels sharing similar motion models. The motion
parameters for each re ned region are then be re-estimated. This process continues
until no more changes in the segmentation map occur.
An alternative approach is to estimate the regions and their associated motions
in a layered manner, similar to the layered approach described previously. There,
we assume that the motion vector at every point is already known, and the identi -
cation of the region with the most dominant motion (i.e. the inlier) is accomplished
by examining the tting error induced by representing individual MVs using a set
of motion parameters. This is essentially the indirect robust estimator presented
in Sec. 6.7.3. In the joint region segmentation and motion estimation approach,
to extract the next dominant region and its associated motion from the remaining
pixels, we can use the direct robust estimator. That is, we directly estimate the
motion parameters, by minimizing the prediction errors at these pixels. Once the
parameters are determined, we determine whether a pixel belongs to the inlier by
examining the prediction error at this pixel. We then re-estimate the motion pa-
rameters by minimizing the prediction errors at the inlier pixels only. This approach
has been taken by Hsu [18].

6.9 Multi-Resolution Motion Estimation


As can be seen from previous sections, various motion estimation approaches can be
reduced to solving an error minimization problem. Two major diÆculties associated
with obtaining the correct solution are: i) the minimization function usually has
many local minima and it is not easy to reach the global minimum, unless it is close
to the chosen initial solution; and ii) the amount of computation involved in the
minimization process is very high. Both problems can be combated by the multi-
resolution approach, which searches the solution for an optimization problem in
successively ner resolutions. By rst searching the solution in a coarse resolution,
one can usually obtain a solution that is close to the true motion. In addition, by
limiting the search in each ner resolution to a small neighborhood of the solution
obtained in the previous resolution, the total number of searches can be reduced,
compared to that required by directly searching in the nest resolution over a large
192 Two Dimensional Motion Estimation Chapter 6

d1,0,0

Y1 , 1 Y2 , 1

d 2,0,1

Y1 , 2 Y2 , 2
~
d 2,0,1 q 2,0,1

~
d3,1,2
d3,1,2

q3,1,2
Y1 , 3 Y2 , 3

Figure 6.19. Illustration of the Hierarchical Block Matching Algorithm.

range.
In this section, we rst describe the multi-resolution approach for motion esti-
mation in a general setting, which is applicable to any motion models. We then
focus on the block-translation model, and describe a hierarchical block matching
algorithm.

6.9.1 General Formulation


As illustrated in Fig. 6.19, pyramid representations of the two raw image frames are
rst derived, in which each level is a reduced resolution representation of the lower
level, obtained by spatial low-pass ltering and sub-sampling. The bottom level
is the original image. Then the motion eld between corresponding levels of the
two pyramids is estimated, starting from the top coarsest level, and progressing to
the next ner level repeatedly. At each new ner resolution level, the motion eld
obtained at the previous coarser level is interpolated to form the initial solution for
the motion at the current level. The most common pyramid structure is the one
in which the resolution is reduced by half both horizontally and vertically between
successive levels. Usually, a simple 2  2 averaging lter is used for low-pass ltering.
For better performance, a Gaussian lter can be employed.
Assume that the number of levels is L, with the L-th level being the original
image. Let the l-th level images of the anchor and tracked frames be represented
by t;l (x); x 2 l ; t = 1; 2; where l is the set of pixels at level l. Denote the
total motion eld obtained from levels 1 to l 1 by dl 1 (x). At the l-th level, we
rst interpolate dl 1 (x) to the resolution of level l, to produce an initial motion
estimate d~ l (x) = U (dl 1 (x)); where U represents the interpolation operator. We
Section 6.9. Multi-Resolution Motion Estimation 193
then determine the update ql (x) at this level so that the error
X
2;l ( x + d~ l (x) + ql (x)) x p
1;l ( )
x2l

is minimized. The new motion eld obtained after this step is dl (x) = ql (x)+ d~ l (x):
Upon completion of successive re nements, the total motion at the nest resolution
is
d(x) = qL(x) + U (qL 1 (x) + U (qL 2 (x) +    + U (q1 (x) + d0(x))   )):
The initial conditions for the above procedure is d0 (x) = 0: One can either di-
rectly specify the total motion d(x), or the motion updates at all levels, ql (x); l =
1; 2; : : : ; L: The latter represents the motion in a layered structure, which is desired
in applications requiring progressive retrieval of the motion eld.
The bene ts of the multi-resolution approach are two folds. First, the minimiza-
tion problem at a coarser resolution is less ill-posed than at a ner resolution, and
therefore, the solution obtained at a coarser level is more likely to be close to the
true solution at that resolution. The interpolation of this solution to the next reso-
lution level provides a good initial solution that is close to the true solution at that
level. By repeating this step successively from the coarsest to the nest resolution
level, the solution obtained at the nest resolution is more likely to be close to the
true solution (the global minimum). Second, the estimation at each resolution level
can be con ned to a signi cantly smaller search range than the true motion range
at the nest resolution, so that the total number of searches to be conducted is
smaller than the number of searches required in the nest resolution directly. The
actual number of searches will depend on the search ranges set at di erent levels.
The use of multi-resolution representation for image processing was rst intro-
duced by Burt and Adelson [6]. Application to motion estimation depends on the
motion model used. In the above presentation, we have assumed that motion vec-
tors at all pixels are to be estimated. The algorithm can be easily adapted to
estimate block-based, mesh-based, global or object level motion parameters. Be-
cause the block-wise translational motion model is the most popular for practical
applications, we consider this special case in more detail in the following.
6.9.2 Hierarchical Block Matching Algorithm (HBMA)
As indicated earlier in Sec. 6.4.1, using an exhaustive search scheme to derive block
MVs requires an extremely large number of computations. In addition, the esti-
mated block MVs often lead to a chaotic motion eld. In this section, we introduce
a hierarchical block matching algorithm (HBMA), which is a special case of the
multi-resolution approach presented before. Here, the anchor and tracked frames
are each represented by a pyramid, and the EBMA or one of its fast variants is
employed to estimate MVs of blocks in each level of the pyramid. Fig. 6.19 illus-
trates the process when the spatial resolution is reduced by half both horizontally
194 Two Dimensional Motion Estimation Chapter 6

and vertically at each increasing level of the pyramid. Here, we assume that the
same block size is used at di erent levels, so that the number of blocks is reduced
by half in each dimension as well. Let the MV for block (m; n) at level l be denoted
by dl;m;n . Starting from level 1, we rst nd the MVs for all blocks in this level,
d1;m;n. At each new level l > 1, for each block, its initial MV d~ l;m;n is interpolated
from a corresponding block in level l 1 by
d~ l;m;n = U (dl 1;bm=2c;bn=2c ) = 2dl 1;bm=2c;bn=2c : (6.9.1)
Then a correction vector ql;m;n is searched, yielding the nal estimated MV
dl;m;n = d~ l;m;n + ql;m;n: (6.9.2)
Example 6.3: In Fig. 6.20, we show two video frames, of size 32  32, in which a
gray block in the anchor frame moved by a displacement of (13,11). We show how
to use a three level HBMA to estimate the block motion eld. The block size used at
each level is 4  4, and the search stepsize is 1 pixel. Starting from level 1, for block
(0,0), the MV is found to be d1;0;0 = d1 = (3; 3): When going to level 2, for block
(0; 1), it is initially assigned the MV d~ 2;0;1 = U (d1;0;0 ) = 2d1 = (6; 6): Starting
with this initial MV, the correction vector is found to be q2 = (1; 1), leading to the
nal estimated MV d2;0;1 = d2 = (7; 5). Finally at level 3, block (1,2) is initially
assigned a MV of d~ 3;1;2 = U (d2;0;1 ) = 2d2 = (14; 10). With a correction vector of
q3 = ( 1; 1), the nal estimated MV is d3;1;2 = d3 = (13; 11):
Note that using a block width N at level l corresponds to a block width of 2L lN
at the full resolution. The same scaling applies to the search range and stepsize.
Therefore, by using the same block size, search range and stepsize at di erent levels,
we actually use a larger block size, a larger search range, as well as a larger stepsize
in the beginning of the search, and then gradually reduce (by half) these in later
steps.
The number of operations involved in the HBMA depends on the search range
at each level. If the desired search range is R in the nest resolution, then with a
L-level pyramid, one can set the search range to be R=2L 1 at the rst level. For
the remaining levels, because the initial MV interpolated from the previous level is
usually quite close to the true MV, the search range for the correction vector do not
need to be very large. However, for simplicity, we assume every level uses a search
range of R=2L 1. If the image size is M  M and block size is N  N at every level,
the number of blocks at the l-th level is (M=2L l N )2 , and the number of searches
is (M=2L lN )2  (2R=2L 1 + 1)2 : Because the number of operations required for
each search is N 2 , the total number of operations is
L
(M=2L l)2 (2R=2L 1 + 1)2 = 43 (1 4 (L 1))M 2 (2R=2L 1 + 1)2
X

l=1
 13 4 (L 2)4M 2R2 :
Section 6.9. Multi-Resolution Motion Estimation 195

d1
Y1 , 1 Y2 , 1

d2
~
d2

Y1 , 2 Y2 , 2 q2

~
d3
d3

q3

Y1 , 3 Y2 , 3

Figure 6.20. An example of using 3-level HBMA for block motion estimation. See
Example 6.3.

Recall that the operation count for EBMA is M 2 (2R +1)2  4M 2 R2 (cf. Sec. 6.4.1).
Therefore, the hierarchical scheme using the above parameter selection will reduce
the computation by a factor of 3  4L 2. Typically, the number of levels L is 2 or 3.
Example 6.4: Figure 6.21 shows the estimation results obtained with the HBMA
approach, for the same pair of video frames given in Figure 6.8. For this example,
a three-level pyramid is used. The search range in each level is set to 4, so that the
equivalent search range in the original resolution is R = 16. Integer-pel accuracy
search is used in all the levels. The nal integer-accuracy solution is further re ned
to the half-pel accuracy by using a half-pel stepsize search in a search range of one
pixel. Comparing the result in the nal level with the ones shown in Figs. 6.8(c)
and 6.8(d), we can see that the multi-resolution approach indeed yields a smoother
motion eld than EBMA. Visual observation also reveals that this motion eld rep-
resents more truthfully the motion between the two image frames in Figs. 6.8(a) and
6.8(b). This is true in spite of the fact that the EBMA yields a higher PSNR. In
terms of computational complexity, the half-pel accuracy EBMA algorithm used for
196 Two Dimensional Motion Estimation Chapter 6

(a) (b)

(c) (d)

(e) (f)
Figure 6.21. Example motion estimation results by HBMA for the two images shown
in Fig. 6.8: (a-b) the motion eld and predicted image at level 1; (c-d) the motion eld
and predicted image at level 2; (e-f) the motion eld and predicted image at the nal level
(PSNR=29.32); A three-level HBMA algorithm is used. The block size is 16  16 at all
levels. The search range is 4 at all levels with integer-pel accuracy. The result in the nal
level is further re ned by a half-pel accuracy search in the range of 1.

Fig. 6.8(c-d) requires 4.3E+8 operations, while the three level algorithm here uses
only 1.1E+7 operations, if we neglect the nal re nement step using half-pel search.

There are many variants to implementation of HBMA. Bierling is the rst who
applied this idea to block-based motion model [5]. A special case of hierarchical
BMA is known as variable size or quad-tree BMA, which starts with a larger block
size, and then repeatedly divides a block into four if the matching error for this
block is still larger than a threshold. In this case, all the processings are done on
the original image resolution.
Section 6.10. Summary 197
6.10 Summary
Relation between Image Intensity and Motion
Almost all motion estimation algorithms are based on the constant intensity as-
sumption (Eq. (6.1.1) or Eq. (5.2.11)), or the optical ow equation (Eq. (6.1.3))
derived based on this assumption. This enables us to estimate motion by identi-
fying pixels with similar intensity, subject to some motion models. Note that this
assumption is valid only when the illumination source is ambient and temporally
invariant, and that the object surface is di usely re ecting (Sec. 5.2).
When the motion direction is orthogonal to image intensity gradient, or if the
image gradient is zero, motion does not induce changes in image intensity. This is
the inherent limit of intensity based motion estimation methods.
Key Components in Motion Estimation
Motion representation: This depends on the partition used to divide a frame
(pixel-based, block-based, mesh-based, region-based, global), the motion model used
for each region of the partition (block, mesh-element, object-region, or entire frame),
and the constraint between motions in adjacent regions.
Motion estimation criterion: We presented three criteria for estimating the mo-
tion parameters over each region: i) minimizing DFD (when the motion is small,
this is equivalent to the method based on the optical- ow equation); ii) making the
resulting motion eld as smooth as possible across regions, while minimizing DFD;
and iii) maximizing the a posterior probability of the motion eld given the observed
frames. We showed that iii) essentially requires i) and ii). Instead of minimizing
DFD, one can also detect peaks in the PCF, when the motion in a region is a pure
translation.
Optimization methods: For chosen representation and criterion, the motion esti-
mation problem is usually converted to an optimization (minimization or maximiza-
tion) problem, which can be accomplished by using exhaustive search or gradient-
based search. To speech up the search and avoid being trapped in local minima, a
multi-resolution procedure can be used.
Application of Motion Estimation in Video Coding
Motion estimation is a key element in any video coding system. As will be discussed
in Sec. 9.3.1, an e ective video coding method is to use block-wise temporal predic-
tion, by which a block in a frame to be coded is predicted from its corresponding
block in a previously coded frame, then the prediction error is coded. To minimize
the bit rate for coding the prediction error, the appropriate criterion for motion
estimation is to minimize the prediction error. The fact that the estimated motion
eld does not necessarily resemble the actual motion eld is not problematic in such
applications. Therefore, block matching algorithms described in Sec. 6.4 (EBMA
and its fast variant including HBMA) o er simple and e ective solutions. Instead of
198 Two Dimensional Motion Estimation Chapter 6

using the MV estimated for each block directly for the prediction of that block, one
can use a weighted average of the predicted values based on the MV's estimated for
its neighboring blocks. This is known as overlapping block motion compensation,
which will be discussed in Sec. 9.3.2.
Note that in the above video coding method, the motion vectors also need to
be coded, in addition to the prediction error. Therefore, minimizing the prediction
error alone is not the best criterion to use. Since a smoother motion eld requires
fewer bits to code, imposing smoothness in the estimated motion eld, if done
properly, can help improve the overall coding eÆciency. More advanced motion
estimation algorithms therefore operate by minimizing the total bit rate used for
coding the MVs and the prediction errors. This subject is discussed further in
Sec. 9.3.3.
To overcome the blocking artifacts produced by block-based motion estimation
methods, high-order block-based (DBMA), mesh-based or a combination of block-
based, mesh-based, and/or DBMA can be applied. However, these more compli-
cated schemes usually do not lead to signi cant gain in the coding eÆciency.
In more advanced video coding schemes (Chap 10), global motion estimation is
usually applied to the entire frame, prior to block-based motion estimation, to
compensate the e ect of camera motion. Moreover, an entire frame is usually
segmented into several regions or objects, and the motion parameters for each region
or object are estimated using the global motion estimation method discussed here.

6.11 Problems
6.1 Describe the pros and cons of di erent motion representation methods (pixel-
based, block-based, mesh-based, global).
6.2 Describe the pros and cons of the exhaustive search and gradient descent
methods. Also compare between rst order and second order gradient descent
methods.
6.3 What are the main advantages of the multi-resolution estimation method,
compared to an approach using a single resolution? Are there any disadvan-
tages?
6.4 In Sec. 6.3.2, we derived the multipoint neighborhood method using the gra-
dient descent method. Can you nd a closed form solution using the optical
ow equation? Under what condition, will your solution be valid?
6.5 In Sec. 6.4.1, we described an exhaustive search algorithm for determine block
MVs in the block-based motion representation. Can you nd a closed form
solution using the optical ow equation? Under what condition, will your
solution be valid?
6.6 In Eq. (6.2.7), we showed that, if the motion eld is a constant, one can use the
optical ow equation to set up a least squares problem, and obtain a closed-
Section 6.11. Problems 199
form solution. Suppose that the motion eld is not a constant, but can be
modeled by a polynomial mapping. Can you nd a closed-form solution for
the polynomial coeÆcients?
Hint: any polynomial mapping function can be represented as d(x; a) = [A(x)]a,
where a contains all the polynomial coeÆcients.
6.7 In Sec. 6.4.5, we said that when there are several patches in a range block
in 1 (x) that undergo di erent motions, there will be several peaks in the
PCF. Each peak corresponds to the motion of one patch. The location of the
peak indicates the MV of the patch, whereas the amplitude of the peak is
proportional to the size of the patch. Can you prove this statement at least
qualitatively? You can simplify your derivation by considering the 1-D case
only.
6.8 With EBMA, does the computational requirement depends on the block size?
6.9 In Sec. 6.9.2, we derived the number of operations required by HBMA, if the
search range at every level is R=2L 1. What would be the number if one uses
a search range of 1 pel in every level, except at the rst level, where the
search range is set to R=2L 1? Is this parameter set-up appropriate?
6.10 Consider a CCIR601 format video, with Y-component frame size of 720  480.
Compare the required computation by an EBMA algorithm (integer-pel) with
block size 16  16 and that by a two-level HBMA algorithm. Assume the
maximum motion range is 32: You can compare the computation by the op-
eration number with each operation including one subtraction, one absolute
value computation, and one addition. You can make your own assumption
about the search range at di erent levels with HBMA. For simplicity, ig-
nore the computations required for generating the pyramid and assume only
integer-pel search.
6.11 Repeat the above for a three-level HBMA algorithm.
6.12 Write a C or Matlab code for implementing EBMA with integer-pel accuracy.
Use a block size of 16  16. The program should allow the user to choose
the search range, so that you can compare the results obtained with di erent
search ranges. Note that the proper search range depends on the extent of
the motion in your test images. Apply the program to two adjacent frames
of a video sequence. Your program should produce and plot the estimated
motion eld, the predicted frame, the prediction error image. It should also
calculate the PSNR of the predicted frame compared to the original tracked
frame. With Matlab, you can plot the motion eld using the function quiver.
6.13 Repeat the above exercise for EBMA with half-pel accuracy. Compare the
PSNR of the predicted image obtained using integer-pel and that using half-
pel accuracy estimation. Which method gives you more accurate prediction?
200 Two Dimensional Motion Estimation Chapter 6

Which requires more computation time?


6.14 You can obtain a dense (i.e., pixel-based) motion eld from a block-based one
by spatial interpolation. Write a C or Matlab code that can interpolate the
motion eld resulting from Prob. 6.12 by assuming the MV for each block
is actually the MV of the block center. Use bilinear interpolation. Using
the interpolated pel-wise motion eld, you can again produce the predicted
image and the prediction error image. Compare the motion eld, predicted
image, and prediction error image obtained in Probs. 6.12 and 6.13 with those
obtained here. Which method gives you more accurate prediction? Which
requires more computation time?
6.15 Implement the HBMA method in C or Matlab. You can choose to use either
two or three levels of resolution. You can use integer-pel search at all levels, but
re ne your nal result by half-pel accuracy search within a 1 neighborhood.
Use a block size of 16  16 at all levels. The search range should be chosen so
that the equivalent search range in the original resolution is 32. Compare the
results with that obtained in Probs. 6.12 and 6.13, in terms of both accuracy
and computation time.
6.16 In Sec. 6.7, we say that the tting error in Eq. (6.7.2) is minimized with the
solution given in Eq. (6.7.3). Prove this result yourself.
6.17 Assume the motion between two frames can be modeled by a global aÆne
model. We want to estimate the aÆne parameters directly based on the DFD
criterion. Set up the optimization problem, and derive an iterative algorithm
for solving the optimization problem. You can use either the rst order gra-
dient descent or the Newton-Ralphson method. Write a C or Matlab code for
implementing your algorithm. Apply to two video frames that are undergoing
predominantly camera motion (e.g., , the ower garden sequence). Compare
the resulting motion eld and predicted frame with that obtained with EBMA.
6.18 Repeat Prob. 6.17, but uses an indirect method to derive the aÆne param-
eters from given block motion vectors. Derive the regression equation and
the closed-form solution. Write a C or Matlab code for implementing your
algorithm. You can use your previous code for integer-pel EBMA to generate
block MVs. Compare the result obtained with the direct method (Prob. 6.17).

6.12 Bibliography
[1] J. K. Aggarwal and N. Nandhahumar. On the computation of motion from
sequences of images - a review. Proceedings of The IEEE, 76:917{935, 1988.
[2] Y. Altunbasak and A. M. Tekalp. Closed-form connectivity-preserving solutions
for motion compensation using 2-d meshes. IEEE Trans. Image Process., 6:1255{
1269, Sept. 1997.
Section 6.12. Bibliography 201
[3] Y. Altunbasak and A. M. Tekalp. Occlusion-adaptive, content-based mesh de-
sign and forward tracking. IEEE Trans. Image Process., 6:1270{1280, Sept. 1997.
[4] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical ow
techniques. International Journal of Computer Vision, 12:43{77, 1994.
[5] M. Bierling. Displacement estimation by hierarchical block matching. In Proc.
SPIE: Visual Commun. Image Processing, volume SPIE-1001, pages 942{951,
Cambridge, MA, Nov. 1988.
[6] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code.
IEEE Trans. Commun., COM-31:532{540, 1983.
[7] P. Chou and C. Brown. The theory and practice of bayesian image labeling.
International Journal of Computer Vision, 4:185{210, 1990.
[8] R. O. Duda and P. E. Hart. Patterns classi cation and Scene analysis. John
Wiley & Sons, 1973.
[9] D. J. Fleet and A. D. Jepson. Computation of component image velocity from
local phase information. International Journal of Computer Vision, 5:77{104,
1990.
[10] D.J. Fleet. Disparity from local weighted phase-correlation. In IEEE Interna-
tional Conference on Systems, Man, and Cybernetics: Humans, Information and
Technology, pages 48 {54, 1994.
[11] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the
baysian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., 6:721{
741, Nov. 1984.
[12] B. Girod. Motion-compensating prediction with fractional-pel accuracy. IEEE
Trans. Commun., 41:604{612, 1993.
[13] B. Girod. Motion-compensating prediction with fractional-pel accuracy. IEEE
Transactions on Communications, 41(4):604{612, April 1993. -.
[14] H.-M. Hang, Y.-M. Chou, and S.-Chih. Cheng. Motion estimation for video
coding standards. Journal of VLSI Signal Processing Systems for Signal, Image,
and Video Technology, 17(2/3):113{136, Nov. 1997.
[15] R. M. Haralick and J. S. Lee. The facet approach to optical ow. In Proc.
Image Understanding Workshop, 1993.
[16] B. K. Horn and B. G. Schunck. Determining optical ow. Arti cial intelligence,
17, 1981.
[17] B. K. P. Horn. Robot Vision. MIT Press, Cambridge, MA, 1986.
202 Two Dimensional Motion Estimation Chapter 6

[18] S. Hsu, P. Anandan, and S. Peleg. Accurate computation of optical ow using


layered motion rpresentations. In Proc. Int. Conf. Patt. Recog., pages 743{746,
Jerusalem, Israel, Oct. 1994.
[19] J. R. Jain and A. K. Jain. Displacement measurement and its application in
interframe image coding. IEEE Trans. Commun., COM-29:1799{1808, Dec. 1981.
[20] T. Koga et al. Motion-compensated interframe coding for video conferencing.
In Proc. Nat. Telecommun. Conf., pages G5.3.1{G5.3.5, New Orleans, LA, Nov.
1981.
[21] T. Komarek and P. Pirsch. Array architecture for block matching algorithms.
IEEE Trans. Circuits and Systems, 36:269{277, Oct. 1989.
[22] J. Konrad and E. Dubois. Baysian estimation of motion vector elds. IEEE
Trans. Pattern Anal. Machine Intell., 14:910{927, Sept. 1992.
[23] C. Kuglin and D. Hines. The phase correlation image alignment method. In
Proc. IEEE Int. Conf. Cybern. Soc., pages 163{165, 1975.
[24] O. Lee and Y. Wang. Motion compensated prediction using nodal based de-
formable block matching. Journal of Visual Communications and Image Repre-
sentation, 6:26{34, Mar. 1995.
[25] X. Lee and Y.-Q. Zhang. A fast hierarchical motion compensation scheme for
video coding using block feature matching. IEEE Trans. Circuits Syst. for Video
Technology, 6:627{635, Dec. 1996.
[26] A. Mitchie, Y. F. Wang, and J. K. Aggarwal. Experiments in computing optical
ow with gradient-based multiconstraint methods. Pattern Recognition, 16, June
1983.
[27] H. G. Musmann, M. Hotter, and J. Ostermann. Object oriented analysis-
synthesis coding of moving images. Signal Processing: Image Commun., 1:119{138,
Oct. 1989.
[28] H.G. Musmann, P. Pirsch, and H.-J. Grallert. Advances in picture coding.
Proceedings of IEEE, 73(4):523{548, Apr. 1985.
[29] H. H. Nagel. Displacement vectors derived from second-order intensity varia-
tions in images sequences. Computer Graphics and Image Processing, 21:85{117,
1983.
[30] H. H. Nagel and W. Enklemann. An investigation of smoothness constraints for
the estimation of displacement vector elds from image sequences. IEEE Trans.
Pattern Anal. Machine Intell., 8:565{593, Sept. 1986.
[31] A. N. Netravali and J. D. Robbins. Motion compensated coding: some new
results. Bell System Technical Journal, Nov. 1980.
Section 6.12. Bibliography 203
[32] P. Pirsch, N. Demassieux, and W. Gehrke. Vlsi architecture for video compres-
sion - a survey. Proc. IEEE, 83:220{246, Feb. 1995.
[33] B.S. Reddy and B.N. Chatterji. An t-based technique for translation, rotation,
and scale-invariant image registration. IEEE Transactions on Image Processing,
5(8):1266{71, August 1996.
[34] J. Rissanen. A universal prior for intergers and estimation by minimum de-
scription length. The Annuals of Statistics, 11(2):416{431, 1983.
[35] P.J. Rousseeuw and A.M. Leroy. Robust Regression and Outlier Detection.
John Wiley & Sons, New York, 1987.
[36] V. Seferidis and M. Ghanbari. General approach to block matching motion
estimation. Optical Engineering, 32:1464{1474, July 1993.
[37] H. Shekarforoush, M. Berthod, and J. Zerubia. Subpixel image registration
by estimating the polyphase decomposition of cross power spectrum. In Proceed-
ings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition 1996. IEEE, Los Alamitos, CA, pages 532{537, 1996.
[38] C. Stiller and J. Konrad. Estimating motion in image sequences. IEEE Signal
Processing Magazine, 16:70{91, July 1999.
[39] A. M. Tekalp. Digital Video Processing. Prentice Hall PTR, Upper Saddle
River, NJ, 1995.
[40] G. A. Thomas. Television motion measurements for datv and other applica-
tions. Research report 1987/11, BBC, September 1987.
[41] Y. Wang and O. Lee. Use of 2d deformable mesh structures for video com-
pression. part i | the synthesis problem: Mesh based function approximation and
mapping. IEEE Trans. Circuits Syst. for Video Technology, 6:636{646, Dec. 1996.
[42] Y. Wang and O. Lee. Active mesh | a feature seeking and tracking image
sequence representation scheme. IEEE Trans. Image Process., 3:610{624, Sept.
1994.
[43] Y. Wang and J. Ostermann. Evaluation of mesh-based motion estimation in
h.263 like coders. IEEE Trans. Circuits Syst. for Video Technology, 8:243{252,
June 1998.
[44] Y.Wang, X.-M. Hsieh, J.-H. Hu, and O.Lee. Region segmentation based on
active mesh representation of motion: comparison of parallel and sequential ap-
proaches. In Proc. Second IEEE International Conference on Image Processing
(ICIP'95), pages 185{188, Washington DC, Oct. 1995.
[45] O. C. Zienkewicz and R. L. Taylor. The nite element method, volume 1.
Prentice Hall, 4th edition, 1989.

You might also like