
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, JANUARY XX 3
a geodesic-flow kernel. While these methods have
the advantage of providing easily computable out-
of-sample extensions (by projecting unseen samples
onto the latent space eigenvectors), the transformation
defined remains global and is applied in the same way
to the whole target domain. An approach combining
sample reweighting logic with representation trans-
fer is found in [53], where authors extend the sam-
ple re-weighing to reproducing kernel Hilbert space
through the use of surrogate kernels. The transforma-
tion achieved is again a global linear transformation
that helps in aligning domains.
Our proposition strongly differs from those re-
viewed above, as it defines a local transformation
for each sample in the source domain. In this sense,
the domain adaptation problem can be seen as a
graph matching problem [35], [10], [11] as each source
sample has to be mapped on target samples under the
constraint of marginal distribution preservation.
Optimal Transport and Machine Learning. The op-
timal transport problem has first been introduced
by the French mathematician Gaspard Monge in the
middle of the 19th century as a way to find a mini-
mal effort solution to the transport of a given mass
of dirt into a given hole. The problem reappeared
in the middle of the 20th century in the work of
Kantorovitch [30] and found recently surprising new
developments as a polyvalent tool for several funda-
mental problems [49]. It was applied in a wide panel
of fields, including computational fluid mechanics [3],
color transfer between multiple images or morphing
in the context of image processing [40], [20], [5], inter-
polation schemes in computer graphics [6], and eco-
nomics, via matching and equilibriums problems [12].
Despite the appealing properties and application
success stories, the machine learning community has
considered optimal transport only recently (see, for
instance, works considering the computation of dis-
tances between histograms [15] or label propagation
in graphs [45]); the main reason being the high com-
putational cost induced by the computation of the
optimal transportation plan. However, new comput-
ing strategies have emerged [15], [17], [5] and made
possible the application of OT distances in operational
settings.
2 OPTIMAL TRANSPORT AND APPLICATION
TO DOMAIN ADAPTATION
In this section, we present the general unsupervised
domain adaptation problem and show how it can be
addressed from an optimal transport perspective.
2.1 Problem and theoretical motivations
Let Ω ∈ R
d
be an input measurable space of di-
mension d and C the set of possible labels. P(Ω)
denotes the set of all probability measures over Ω. The
standard learning paradigm assumes the existence of
a set of training data X
s
= {x
s
i
}
N
s
i=1
associated with
a set of class labels Y
s
= {y
s
i
}
N
s
i=1
, with y
s
i
∈ C, and
a testing set X
t
= {x
t
i
}
N
t
i=1
with unknown labels. In
order to infer the set of labels Y
t
associated with
X
t
, one usually relies on an empirical estimate of the
joint probability distribution P(x, y) ∈ P(Ω × C) from
(X
s
, Y
s
), and assumes that X
s
and X
t
are drawn from
the same distribution P(x) ∈ P(Ω).
2.2 Domain adaptation as a transportation prob-
lem
In domain adaptation problems, one assumes the
existence of two distinct joint probability distributions
P
s
(x
s
, y) and P
t
(x
t
, y), respectively related to a source
and a target domains, noted as Ω
s
and Ω
t
. In the
following, µ
s
and µ
t
are their respective marginal
distributions over X. We also denote f
s
and f
t
the true
labeling functions, i.e. the Bayes decision functions in
each domain.
At least one of the two following assumptions is
generally made by most domain adaptation methods:
• Class imbalance: Label distributions are different
in the two domains (P
s
(y) 6= P
t
(y)), but the con-
ditional distributions of the samples with respect
to the labels are the same (P
s
(x
s
|y) = P
t
(x
t
|y));
• Covariate shift: Conditional distributions of
the labels with respect to the data are equal
(P
s
(y|x
s
) = P
t
(y|x
t
), or equivalently f
s
= f
t
=
f). However, data distributions in the two do-
mains are supposed to be different (P
s
(x
s
) 6=
P
t
(x
t
)). For the adaptation techniques to be ef-
fective, this difference needs to be small [2].
In real world applications, the drift occurring between
the source and the target domains generally implies a
change in both marginal and conditional distributions.
In our work, we assume that the domain drift is due
to an unknown, possibly nonlinear transformation of
the input space T : Ω
s
→ Ω
t
. This transformation
may have a physical interpretation (e.g. change in the
acquisition conditions, sensor drifts, thermal noise,
etc.). It can also be directly caused by the unknown
process that generates the data. Additionnally, we
also suppose that the transformation preserves the
conditional distribution, i.e.
P
s
(y|x
s
) = P
t
(y|T(x
s
)).
This means that the label information is preserved by
the transformation, and the Bayes decision functions
are tied through the equation f
t
(T(x)) = f
s
(x).
Another insight can be provided regarding the
transformation T. From a probabilistic point of view,
T transforms the measure µ in its image measure, noted
T#µ, which is another probability measure over Ω
t
satisfying
T#µ(x) = µ(T
−1
(x)), ∀x ∈ Ω
t
(1)