
100 Rubner, Tomasi and Guibas
compressed representation. A signature is a set of the
main clusters or modes of a distribution, each repre-
sented by a single point (the cluster center) in the un-
derlying space, together with a weight that denotes the
size of that cluster. Simple images have short signa-
tures, complex images have long ones. Of course, in
some applications, fixed-size histograms may still be
adequate, and can be considered as special cases of
signatures.
In addition to histograms and signatures which are
based on global or local tessellation of the space into
non-overlapping regions, there are other techniques to
describe non-parametric distributions. For example, in
kernel density estimation (Duda and Hart, 1973), each
data point is replaced by some kernel and the density
estimations is regarded as the superposition of all these
kernels. These techniques are out of the scope of this
paper.
Given two distributions, it is often useful to define a
quantitative measure of their dissimilarity, with the in-
tent of approximating perceptual dissimilarity as well
as possible. This is particularly important in image re-
trieval applications, but has fundamental implications
alsofor the understandingoftexturediscriminationand
color perception. Defining a distance between two dis-
tributions requires first a notion of distance between
the basic features that are aggregated into the distribu-
tions. We call this distance the ground distance.For
instance, in the case of color, the ground distance mea-
sures dissimilarity between individual colors. Fortu-
nately, color ground distance has been carefully stud-
ied in the literature of psychophysics, and has led to
measures like the CIE-Lab color space (Wyszecki and
Stiles, 1982). To be sure, thisspacewasdesignedbased
on psychophysical experiments where colors were pre-
sented in pairs and ona neutral background. While this
limits the appropriateness of this space for the more
complex situations encountered in retrieval, we believe
that it is hard to do better than CIE-Lab without explic-
itly modelling thegeometric layoutof colors inimages.
While RGB space has proven clearly inadequate in our
experiments, it is possible that other spaces, such as
HSV, may lead to performance similar to that obtained
with CIE-Lab.
In this paper, we address the problem of lifting these
distances from individual features to full distributions.
In other words, we want to define a consistent mea-
sure of distance, or dissimilarity, between two distri-
butions of mass in a space that is itself endowed with a
grounddistance.Forcolor, thismeansfindingdistances
between image color distributions. For texture, we
locally describe the texture content of a small neigh-
borhood in an image as distribution of energy in the
frequency domain. The “lifted” distance is a distance
between distributions of such local descriptors over the
entire images, regarded as distribution of textures.
Mathematically, it would be convenient if these dis-
tribution distances were true metrics, which would
lead to more efficient data structures and search al-
gorithms (Bozkaya and Ozsoyoglu, 1997; Clarkson,
1997). Practically, it is important that distances be-
tween distributions correlate with human perception.
In this paper we strive to achieve both goals. For the
first we have proof, for the second we show experi-
ments. We also would like these distances to allow
for partial matches when one distribution is compared
to a subset of the other. For partial matches, the dis-
tances we define are not metric. Concerning this point,
we refer to Tversky’s discussion (Tversky, 1977) of
the non-metric nature of perceptual distances. From
a practical point of view, our measure deals naturally
both with full, metric matches and with partial, non-
metric matches.
In this paper we capitalize on the old transportation
problem (Rachev, 1984; Hitchcock, 1941) from linear
optimization, which was first introduced into computer
vision by Peleg et al. (1989) to measure the distance
between two gray-scale images. For image retrieval,
weuse this distancemeasuretocompare two signatures
in coloror texture space. As discussedin moredetail in
the next section, this leads to very different computa-
tional properties, mainly because signaturesrather than
pixels are compared to each other. We give the name
of Earth Mover’s Distance (EMD), suggested by Stolfi
(1994), to this metric in this new context. The trans-
portation problem is to find the minimal cost that must
be paid totransformone distribution into theother. The
EMD is based on a solution to the transportation prob-
lem for which efficient algorithms are available, and
it has many desirable properties for image retrieval, as
we will see. It is also more robust in comparison to
other histogram matching techniques, in that it suffers
from no arbitrary quantization problems due to rigid
binning, and it tolerates well some amount of defor-
mations that shift features in the feature space. This
robustness results in increased precision for image re-
trieval. It allows for partial matching, and hence natu-
rally supports partial image retrieval queries. It can be
applied to signatures with different sizes, which leads
to better storage utilization. When used to compare