\newmdtheoremenv

theoremTheorem \newmdtheoremenvdefinitionDefinition \newmdtheoremenvpropositionProposition[theorem]

Randomized Transport Plans via Hierarchical Fully Probabilistic Design

Sarah Boufelja Y.¹ Anthony Quinn^1,2 Robert Shorten¹
¹Imperial College London Corresponding author. Dyson School of Design Engineering
²Trinity College Dublin School of Engineering
{s.boufelja21,a.quinn,r.shorten}@imperial.ac.uk

Abstract

An optimal randomized strategy for design of balanced, normalized mass transport plans is developed. It replaces—but specializes to—the deterministic, regularized optimal transport (OT) strategy, which yields only a certainty-equivalent plan. The incompletely specified—and therefore uncertain—transport plan is acknowledged to be a random process. Therefore, hierarchical fully probabilistic design (HFPD) is adopted, yielding an optimal hyperprior supported on the set of possible transport plans, and consistent with prior mean constraints on the marginals of the uncertain plan. This Bayesian resetting of the design problem for transport plans —which we call HFPD-OT—confers new opportunities. These include (i) a strategy for the generation of a random sample of joint transport plans; (ii) randomized marginal contracts for individual source-target pairs; and (iii) consistent measures of uncertainty in the plan and its contracts. An application in fair market matching is outlined, in which HFPD-OT enables the recruitment of a more diverse subset of contracts—than is possible in classical OT—into the delivery of an expected plan.

Keywords— Optimal transport, Bayesian hierarchical modelling, Fully probabilistic design, Convex optimization, Algorithmic fairness, Market matching

1 Main Contributions

Optimal transport (OT) refers to the classical design of a deterministic transport plan, $\pi$ , for taking a unit¹¹1Throughout this paper, we address only the balanced, normalized transport problem. mass—distributed across a source domain, $\mathbb{\mathbb{\Omega}}_{X}$ —and redistributing it across a target domain, $\mathbb{\mathbb{\Omega}}_{Y}$ . The transport plan is expressed as an unknown, deterministic, joint distribution, $\pi$ , with support in $\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{Y}$ . The distributed source and target are therefore the marginals of $\pi$ , and are specified a priori by $\mu_{0}$ and $\nu_{0}$ on $\mathbb{\mathbb{\Omega}}_{X}$ and $\mathbb{\mathbb{\Omega}}_{Y}$ , respectively. Consequently, $\pi$ is confined to the space, $\mathbb{\Pi}(\mu_{0},\nu_{0})$ , of distributions on $\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{Y}$ , with $\mu_{0}$ and $\nu_{0}$ as its marginals. An optimal choice, $\pi^{o}$ , of $\pi$ —called the OT plan—is achieved by minimizing the expected value, under $\pi$ , of a pre-specified cost of transport, ${\mathsf{C}}(x,y)$ , from $\mathbb{\mathbb{\Omega}}_{X}$ to $\mathbb{\mathbb{\Omega}}_{Y}$ .

In this paper, we reformulate the design of transport maps in the Bayesian (i.e. fully probabilistic) way. In particular, deterministic optimization—yielding $\pi^{o}$ —is replaced by the hierarchical fully probabilistic design (HFPD) of an optimal randomized decision-making strategy, $\pi\sim\mathsf{S}^{o}$ (i.e. a hyperprior), for choosing $\pi$ . This approach recognizes that the unknown transport plan, $\pi$ , is a (generally nonparametric) random process. We therefore equip it with a prior, $\mathsf{S}(\pi|K)$ , where $K$ denotes marginal (mean) knowledge constraints which will be detailed in the sequel. Following the axioms of FPD at this hierarchical level (i.e. HFPD), we equip the space, ${\mathbb{S}}_{K}$ , of $\mathsf{S}$ —being the randomized strategy for choosing the transport plan, $\pi$ —with an appropriately formulated loss function, and we minimize the expected value of the latter under $\mathsf{S}$ . This yields the optimal randomized strategy, $\mathsf{S}^{o}(\pi|K)$ , for choosing $\pi$ , being also the optimal hyperprior for uncertain $\pi$ . We show that this procedure is equivalent to minimization of a Kullback-Leibler divergence (KLD), leading to a Gibbs form for $\mathsf{S}^{o}(\pi|K)$ :

\mathsf{S}^{o}(\pi|K)\propto\mathsf{S}_{\mathsf{I}}(\pi|K)e^{-{\mathsf{D}_{% \mathsf{KL}}}(\pi||\pi_{\mathsf{I}})}e^{-\lambda_{1}^{o}{\mathsf{D}_{\mathsf{% KL}}}(\mu||\mu_{0})}e^{-\lambda_{2}^{o}{\mathsf{D}_{\mathsf{KL}}}(\nu||\nu_{0}% )}\in{\mathbb{S}}_{K}.

(1)

Here, $\mu$ and $\nu$ are the uncertain (i.e. random) marginals of the random transport plan, $\pi$ . The KLDs, ${\mathsf{D}_{\mathsf{KL}}}(\cdot||\cdot)$ , act as Gibbs energies. Meanwhile, $\mathsf{S}_{\mathsf{I}}$ and $\pi_{\mathsf{I}}$ are the freely but necessarily pre-specified zero-loss choices of $\mathsf{S}$ and $\pi$ , respectively, referred to as the ideal or target choices.

By resetting OT as a problem of Bayesian decision making via HFPD, we achieve the following principal goals:

(i)

The deterministic, regularized OT choice, $\pi^{o}$ , obtained via constrained optimization at the base level of modelling, $(x,y)\sim\pi$ , is replaced by an optimal generator of randomized plans (i.e. a randomized strategy for designing transport plans, $\pi$ ) at the hierarchical level of complete modelling, $\pi\sim\mathsf{S}^{o}(\pi|K)$ .
(ii)

In the parametric case, in which the support set, ${\mathbb{\mathbb{\Omega}}_{X}}\times{\mathbb{\mathbb{\Omega}}_{Y}}$ , is finite, we can compute optimal univariate (marginal and/or conditional) distributions, $\pi_{i,j}\sim\mathsf{S}^{o}_{i,j}$ , for modelling and randomization of the transport contract, $\pi_{ij}\in(0,1)$ , from the agent (at) $x_{i}$ to the agent (at) $y_{j}$ .
(iii)

In line with all Bayesian decision-making strategies, we can summarize $\pi\sim\mathsf{S}^{o}(\pi|K)$ via a certainty-equivalent (CE) transport plan, $\hat{\pi}\in\mathbb{\Pi}_{K}$ —such as its expected or maximally probable value—and equip this with a summary of our uncertainty in $\pi$ (e.g. via the Bayesian standard intervals for the contracts, $\pi_{i,j}\sim\mathsf{S}^{o}_{i,j}$ ).

By equipping transport plans with an optimal hyperprior from which candidate plans can be generated, we are able to encode our prior knowledge and our ranking of preferences. This HFPD resetting of OT can have significant impact in applications. We consider one such application, in algorithmic fairness. Specifically, we address the problem of labour market matching, in which fairness is induced by optimally randomizing the matching strategy (a transport plan) via HFPD-OT, thereby increasing a diversity index among contracts.

2 Introduction to transport plan design and optimal transport

Optimal Transport (OT) techniques have received increasing attention in the past decade, in a wide range of domains such as machine learning and generative adversarial learning [Arjovsky et al., 2017], domain adaptation [Courty et al., 2017], image processing and watermarking [Mathon et al., 2014], hallucination detection in neural translation machines [Guerreiro et al., 2023], etc. In addition to traditional applications in economics and market matching [Galichon, 2016], fluid mechanics and diffusion processes [Saumier et al., 2015], it has also been used to perform sampling and Bayesian inference [El Moselhy and Marzouk, 2012].

OT is concerned with the least costly transport plan (in expectation) between a source and a target probability measure. The unregularized OT plan induces a natural distance in the space of probability measures (the Kantorovitch-Rubinstein distance) [Villani, 2008], introducing a rich topological structure by lifting key geometric properties associated with the ground metric to the space of probability measures [Villani, 2008, Peyré and Cuturi, 2019]. For example, if the ground space is Euclidean, concepts like gradient, barycentre and convexity are naturally extended to the space of probability measures.

Notwithstanding the wide range of applications, the classical formalism of OT confines it to a purely deterministic setting, which regards the transport plan as a crisp object and assumes perfect knowledge of the marginals (Figure 1(a)). It fails to model and (critically) translate uncertainty in the marginals to uncertainty in the design of transport plans. In this regard, classical OT is an instance of certainty-equivalence (CE) decision making, which produces myopic transport strategies that do not account for the uncertain and random nature of many real systems. One might think that recasting the classical OT problem in terms of robust optimization might address these issues. A robust optimization formulation relies on a deterministic, unknown but bounded description of the uncertainty in the marginals [Ben-Tal et al., 2009]. Such a design choice may be overly conservative: it indeed considers all possible outcomes in the uncertainty set, but may assign non-negligible weights even to plans that are highly improbable. Furthermore, the robust design is not equipped with a quantifier of the intrinsic uncertainty of the transport plan.

In this manuscript, we propose the HFPD-OT approach to the design of uncertain transport plans. It departs from the conventional OT setting by considering the transport plan as a random process. Consistent hierarchical Bayesian modelling endows the uncertain plan with its own hyperprior (Figure 1(b)). Its optimal choice provides a randomized strategy for choosing transport plans in the space of plans consistent with prior-imposed knowledge constraints on its marginals. It also acts as a generative model for random sampling of transport plans. By treating transport plans as random processes, we effectively recast the transport design problem as one of inference. This contrasts with the OT literature, which is only concerned with deterministic optimization strategies for choosing deterministic plans. As we will see in the literature review, next, the tools provided by HFPD-OT—intended for modelling and reasoning about uncertainty in transport plans—are not available in the classical OT setting.

2.1 Approaches to modelling uncertainty in OT

There are precedents in eliciting and processing uncertainty in OT, but they are generally couched in terms of base-level modelling, and not in terms of the hierarchical Bayesian approach developed here. Specifically, (i) our method is primarily concerned with the design of a fully probabilistic model over the space of transport plans; (ii) as such, the transport plan is modeled as a (generally nonparametric) process endowed with its own (hyper)prior; and (iii) we rely on randomization techniques for choosing plans, in contrast to existing methods which are mainly based on deterministic optimization techniques.

Copulas [Sklar, 1959] are historically among the first methods proposed for the design of multivariate distributions with arbitrary, but perfectly known marginals. Other techniques relaxed this assumption to address situations where exactly one marginal is uncertain. This is the case in [Goodman, 1953], for instance, where the authors model the uncertainty in one marginal with a Gaussian noise. In ecological inference (a case of parametric transport design on a finite support), [Wakefield, 2004] studied the case where one marginal is uncertain, adopting a hierarchical multinomial-Dirichlet-based model. We highlight two distinctions in our work: (i) we do not impose a parametric constraint in general, and we allow for uncertainty in both marginals; and (ii) the authors of the earlier paper pursue markedly different statistical inference objectives from OT .

Interestingly, the connection between ecological inference and OT was not established until later, in [Frogner and Poggio, 2019], where the authors extended the previous model and studied the case where both marginals are uncertain. The questions we address in this paper again differ from those in [Frogner and Poggio, 2019] in the following ways: (i) they solve a base-level MAP optimization problem using a Bregman projection method, once again recovering a certainty-equivalent OT plan, whereas our primary goal is to depart from such a certainty-equivalence setting and design an optimal hierarchical Bayesian model from which random transport plans can be generated and used in lieu of an OT plan. If required (as we will see), the expected plan takes the place of the MAP plan as the Bayesian minimum-risk decision (i.e. estimate) of the uncertain plan, with asymptotic convergence to the MAP plan; and (ii) the derivations in [Frogner and Poggio, 2019] rely on parametric and structural assumptions, mainly full separability. Separability is a strong assumption in that it excludes the modelling of rich structures and interactions that may exist in real-world data. We do not require these assumptions in our hyperprior, and we leave it to the designer—via the specification of ideal designs (to be explained in the sequel)—to impose any relevant structural requirements.

Uncertainty in the cost matrix in the finite case is considered by [Mallasto et al., 2021]. Given a finite sample of these cost matrices, they model the induced uncertainty in the (finitely supported) OT plan. They do not allow for any uncertainty in the marginals, and so their distribution over OT plans is geometrically constrained to the OT polytope. They impose various standard parametric priors over this set, without any optimality claims for them. Our work significantly extends this treatment by modelling uncertainty in the marginals, so that our hierarchical model has support in the geometrically unconstrained space of transport plans, and extends to the nonparametric setting of continuously supported plans. Importantly—and in contrast to [Mallasto et al., 2021]—we do not impose an optimality constraint on the base-level plans themselves, but, rather, on the hierarchical (generative) distribution of (all possible) plans, $\mathsf{S}^{o}(\pi|K)$ (1). In this way, the random generator of the plans, $\mathsf{S}^{o}(\pi|K)$ , is optimal, and not the uncertain transport plan, $\pi$ , itself (although subsequent projections of $\mathsf{S}^{o}(\pi|K)$ can yield optimal Bayesian decisions about $\pi$ , in the conventional manner of Bayesian decision-making). The main contribution of our work is to deduce this optimal hyperprior for transport design (1) via the foundational methods of fully probabilistic design (FPD) [Kárný and Kroupa, 2012]. We show how this HFPD-OT hyperprior concentrates to the classical regularized OT solution as uncertainty in the marginals diminishes (28).

An interesting line of work on unbalanced OT (UOT) in [Séjourné et al., 2023] relaxes the strict marginal constraints $\mathbb{\Pi}(\mu_{0},\nu_{0})$ , and replaces them by a soft penalization, using Kullback-Leibler balls centred on the nominal marginals (as we do in this paper). This ensures feasibility of the UOT problem, allowing transport between unequal (non-probability) measures (which we do not allow in our work). Once again, their solution involves a base-level deterministic optimization.

Finally, entropy-regularized OT (EOT) [Cuturi, 2013] is a foundational work on deterministic OT that will be recovered asymptotically via HFPD-OT. In EOT, the classical OT linear program is relaxed by means of an entropy regularization term, yielding a strictly convex problem, which is amenable to efficient matrix scaling algorithms, notably Sinkhorn-Knopp [Cuturi, 2013]. In our own recent paper [Quinn et al., 2025], we formally establish the relationship between base-level EOT under the usual deterministic marginal constraints—therefore yielding a certainty-equivalent (i.e. singular) OT plan, $\pi^{o}$ , in the conventional manner—and fully probabilistic design (FPD). In this paper, our goal is to extend the base-level EOT setting by deriving an optimal hyperprior, $\mathsf{S}^{o}(\pi|K)$ (1), over the set of uncertain transport plans.

2.2 Notational conventions, technical preliminaries for non-hierarchical OT, and outline of the paper

In the following, we will review the key mathematical conventions used throughout the paper. Specifically, all probability measures will be referred to as (probability) distributions. The context will make clear whether the distribution in question is a probability density function (pdf) or a probability mass function (pmf). A superscript $o$ refers to optimal distributions, e.g. $\mathsf{S}^{o}$ , whereas a subscript $\mathsf{I}$ designates ideal distributions, e.g. $\mathsf{S}_{\mathsf{I}}$ . Moreover, all fixed and prior-elicited quantities are referred to using a subscript $0$ ( $\mu_{0}$ , $\nu_{0}$ , etc.). Sets will be denoted by a blackboard typeface (e.g. $\mathbb{\Omega}_{X},\mathbb{\Omega}_{Y},\mathbb{M}$ , etc.), and deterministic functionals will be denoted by a math sans serif typeface (e.g. $\mathsf{S}$ , $\mathsf{C}$ , $\mathsf{D}$ , etc.). Instantiated distributions will be assigned a math calligraphic typeface ( $\mathcal{U}$ , $\mathcal{N}$ , etc).

•

The conventional non-hierarchical—which we call the base-level—probability space (triple) is ( $\mathbb{\mathbb{\Omega}}$ , $\mathcal{F}$ , $\mathcal{P}$ ), where $\mathbb{\mathbb{\Omega}}$ is the sample space, $\mathcal{F}$ denotes the ( $\sigma$ -)algebra of measurable subsets of $\mathbb{\mathbb{\Omega}}$ , and $\mathcal{P}$ is a probability measure defined on $\mathcal{F}$ .
•

Consider two random variables (rvs), $X$ : $\mathbb{\mathbb{\Omega}}\mapsto\mathbb{\mathbb{\Omega}}_{X}$ and $Y$ : $\mathbb{\mathbb{\Omega}}\mapsto\mathbb{\mathbb{\Omega}}_{Y}$ , whose images, $\mathbb{\mathbb{\Omega}}_{X}$ and $\mathbb{\mathbb{\Omega}}_{Y}$ , are, respectively, compact subsets of topological spaces of unspecified dimensions. In the standard setting of optimal transport (OT) [Villani, 2008, Peyré and Cuturi, 2019], their marginal distributions under $(\mathbb{\mathbb{\Omega}},\mathcal{F},\mathcal{P})$ are prior-specified (i.e. known) to be $\mu_{0}\in\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X})$ and $\nu_{0}\in\mathbbm{P}(\mathbb{\Omega}_{Y})$ , respectively, while their joint distribution, $\pi\in\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{% Y})$ , is unknown, and is the subject of design.
•

The reference measure in $(\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{Y})$ is denoted by $\lambda(x,y)$ . Depending on the context, $\lambda$ interchangeably denotes the Lebesgue measure (in the continuous case) or the counting measure (in the discrete case). $\pi$ , $\mu_{0}$ and $\nu_{0}$ are absolutely continuous w.r.t. $\lambda$ . We do not distinguish notationally between a probability measure and its Radon-Nikodym derivative w.r.t. to $\lambda$ , e.g. $\frac{d\pi}{d\lambda}\equiv\pi$ , etc., and we refer to all as distributions.

•

The prior-specified marginal constraints, $\mu_{0}$ and $\nu_{0}$ , constrain $\pi$ to the following knowledge-constrained set:

\mathbb{\Pi}(\mu_{0},\nu_{0})\equiv\Bigl{\{}\pi\in\mathbbm{P}(\mathbb{\Omega}_% {X}\times\mathbb{\Omega}_{Y})\;|\;\int_{\mathbb{\Omega}_{y}}\pi d\lambda(y)% \equiv\mu_{0},\;\;\int_{\mathbb{\Omega}_{X}}\pi d\lambda(x)\equiv\nu_{0}\Bigr{\}}

•

Consider an alternative distribution, $\zeta\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ . The Kullback-Leibler divergence (KLD) of $\zeta$ to $\pi$ is:

\mathsf{D}_{\mathsf{KL}}(\pi||\zeta)\equiv\left\{\begin{aligned} \int_{\mathbb% {\Omega}_{X}\times\mathbb{\Omega}_{Y}}\pi(x,y)\log\Bigl{(}\frac{\pi(x,y)}{% \zeta(x,y)}\Bigl{)}d\lambda(x,y)&\;\;\text{if}\;\;\pi\ll\zeta,\\ +\infty&\;\;\text{otherwise,}\\ \end{aligned}\right.

(2)

where $\pi\ll\zeta$ indicates the absolute continuity (a.c.) of $\pi$ w.r.t. $\zeta$ .

•

If $\mathsf{q}$ is an integrable function with domain $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ , then its expectation w.r.t $\pi$ is defined as

\mathsf{E}_{\pi}\left[\mathsf{q}\right]\equiv\int_{\mathbb{\Omega}_{X}\times% \mathbb{\Omega}_{Y}}\mathsf{q}(x,y)\pi(x,y)d\lambda(x,y)\;<\infty

•

$K$ —in, for example, $\mathsf{S}(\pi|K)$ —is Jeffreys’ notation [Jeffreys, 1939], encoding the knowledge which acts as a condition on a probability model. It effectively confines $\mathsf{S}$ to a particular knowledge-constrained set, ${\mathbb{S}}_{K}$ . Its specific meaning will be defined in context, at both the base level and hierarchical level, as appropriate.
•

$\mathsf{supp(\mu)}$ denotes the support of the distribution, $\mu$ .
•

$<\!\!\cdot,\!\cdot\!\!>$ denotes the standard inner product between vectors in a Euclidean space. When required, it will be generalized to the canonical duality pairing in the infinite dimensional setting.
•

$\succeq$ denotes an element-wise comparison between vectors $u,v\in\mathbb{R}^{p}$ : $u\succeq v\iff u_{i}\geq v_{i},\;\;\forall i\in\{1,2,\dots,p\}$ . Other relational operators between vectors should also be understood element-wise.

•

The indicator function of a set $\mathbb{A}$ is:

\chi_{\mathbb{A}}(x)\equiv\left\{\begin{aligned} 1&\;\;\text{if}\;\;x\in% \mathbb{A},\\ 0&\;\;\text{otherwise.}\\ \end{aligned}\right.

•

$\delta_{x_{0}}(x)$ denotes the distribution that is singular at $x=x_{0}$ , being the Dirac delta-function w.r.t. Lebesgue measure in the case of continuous $x$ .
•

$\Delta_{q}$ , $1\leq q<\infty$ , denotes the open probability simplex of dimension $q$ . If $q>1$ and $x\in\Delta_{q}$ , then the support of the conditional distribution, $\mathsf{F}(x_{\backslash i}|x_{i})$ , is denoted by $(1-x_{i})\Delta_{q-1}$ , $0<x_{i}<1$ .

The outline of the paper is as follows. In Section 3, we state the mathematical problem and establish the duality result in the infinite dimensional case, hence deriving a formal characterization of the optimal Bayesian hyperprior (1). Section 4 introduces the parametric hyperprior, and we provide a descriptive analysis in a low dimensional setting in Section 4.1. Meanwhile, Section 4.2 proposes an algorithm for the computation of the optimal Kantorovitch potentials in this parametric setting. In Section 5, we apply the HFPD-OT formalism to a market matching problem in order to improve a contract diversity index, before closing the paper with our main conclusions in Section 6.

Refer to caption — (a) In the conventional base-level OT setting, the transport plan, $\pi$ , is deterministic, and so all the contracts, $\pi_{i,j}\in[0,1]$ , are as well. Their respective (marginal) distributions are therefore singular at $\pi^{o}_{i,j}$ , where $\pi^{o}$ denotes the OT plan (5).

3 Hierarchical Fully Probabilistic Design for (Optimal) Transport: HFPD-OT

The classical OT setting contemplates the transport plan as a purely deterministic object and frames the OT problem solely from an optimization perspective (Figure 1(a)). More precisely, FPD-OT [Quinn et al., 2025], which is a generalization of the classical EOT problem [Cuturi, 2013], is built upon the following optimization problem:

\pi^{o}_{\mathsf{OT},\epsilon,\phi}(x,y|K)=\operatorname*{argmin}_{\pi\in% \mathbb{\Pi}(\mu_{0},\nu_{0})}\mathsf{D}_{\mathsf{KL}}(\pi(x,y)||\pi_{\mathsf{% I}}(x,y|K)),

(3)

where the base-level ideal design, $\pi_{\mathsf{I}},$ with support in $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ , is defined as the following extended Gibbs kernel:

\pi_{\mathsf{I}}(x,y|K)\propto\exp\Bigl{(}\frac{-\mathsf{C}(x,y)}{\epsilon}% \Bigr{)}\phi(x,y).

(4)

$\mathsf{C}:\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}\rightarrow\mathbbm{R}^% {+}$ denotes a continuous cost function, $\epsilon>0$ is a smoothness (i.e. regularizing) parameter, and $\phi$ is a fixed distribution, which may be used to encode additional structural preferences in the design of the OT plan. $K$ (Section 2.2) denotes the deterministic, domain-specific knowledge constraints, consisting of external or side-information gathered from the environment, and any other prior knowledge related to the problem being modeled. In the conventional base-level (i.e. deterministic) EOT setting, we impose these knowledge constraints in the form of deterministic marginal constraints $\mathbb{\Pi}(\mu_{0},\nu_{0})$ (• ‣ 2.2). Importantly, when $\phi$ is instantiated as the uniform distribution, $\mathcal{U}$ , with support in $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ , the resulting EOT solution converges in the $\Gamma$ -sense to the Monge-Kantorovitch solution [Carlier et al., 2017]:

\pi^{o}_{\mathsf{OT},\epsilon,\mathcal{U}}(x,y|K)\xrightarrow[]{\epsilon% \rightarrow 0}\pi^{o}_{\mathsf{OT}}(x,y|K)\equiv\operatorname*{argmin}_{\pi\in% \mathbb{\Pi}(\mu_{0},\nu_{0})}\int_{\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{% Y}}\mathsf{C(x,y)}\pi(x,y)d\lambda(x,y).

(5)

In the sequel, we will denote the base-level OT solution simply by $\pi^{o}(x,y)$ , and will not distinguish between EOT and OT solutions, unless required by the context.

In contrast to conventional, base-level OT—in which the transport plan, $\pi(x,y)$ , is a deterministic object—HFPD-OT acknowledges that $\pi(x,y)$ is uncertain (i.e. a random process), and needs to be equipped with an appropriate hierarchical probability model (i.e. triple) (Figure 1(b)). Next, we deduce this optimal model, $\pi\sim\mathsf{S}^{o}$ (1), using the axiomatic Bayesian decision-making framework of hierarchical fully probabilistic design (HFPD).

3.1 The HFPD formulation of optimal transport

Consider a probability model in the hierarchical measurable space, $(\mathbb{\Omega}_{\mathsf{H}},\mathcal{F}_{\mathbb{\Omega}_{\mathsf{H}}})$ , where $\mathbb{\Omega}_{\mathsf{H}}\equiv\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{% \mathbb{\Omega}}_{Y}\times\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times% \mathbb{\mathbb{\Omega}}_{Y})$ and $\mathcal{F}_{\mathbb{\Omega}_{\mathsf{H}}}$ is the $\sigma$ -algebra of measurable sets in $\mathbb{\Omega}_{\mathsf{H}}$ . Then, $\pi(x,y)\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ is a random process endowed with its own distribution, called the hyperprior, and denoted by $\mathsf{S}(\pi|K)$ . The notation $\pi\sim\mathsf{S}(\pi|K)$ means that $\pi$ is distributed according to a hyperprior, $\mathsf{S}(\pi|K)$ , which is shaped by the knowledge constraints, $K$ (specified below). Moreover, let $\mathcal{L}(\pi)$ denote the reference measure at the hierarchical level of the probability space. In the discrete case—when $\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ specializes to the probability simplex, $\Delta$ — $\mathcal{L}(\pi)$ is instantiated as the Lebesgue measure, $\lambda(\pi)$ . As in the conventional base-level OT setting, we assume that $\mathsf{S}(\pi|K)$ is absolutely continuous with respect to $\mathcal{L}(\pi)$ , and we overload $\mathsf{S}(\pi|K)$ to denote its Radon-Nikodym derivative with respect to $\mathcal{L}(\pi)$ .

Let $\mathbb{M}_{\mathsf{H}}$ be the set of joint hierarchical Bayesian models with support in $\mathbb{\Omega}_{\mathsf{H}}$ . The joint hierarchical Bayesian model $\mathsf{M}(x,y,\pi|\mathsf{S},K)\in\mathbb{M}_{\mathsf{H}}$ —our new variational object—reads as follows:

\begin{split}\mathsf{M}(x,y,\pi|\mathsf{S},K)&=\mathsf{M}(x,y|\pi,\mathsf{S},K% )\mathsf{M}(\pi|\mathsf{S},K)\\ &=\pi(x,y|K)\mathsf{S}(\pi|K)\end{split}

(6)

(6) is a direct consequence of the conditional independence structure intrinsic to hierarchical modelling (Figure 2), and the fundamental definitions of $\pi$ and $\mathsf{S}$ .

{definition}[Expected transport plan] The random transport plan, $\pi\sim\mathsf{S}(\pi|K)$ (6), has the expected value,

\hat{\pi}_{\mathsf{S}}(x,y|K)\equiv\mathsf{E}_{\mathsf{S}}[\pi]\equiv\int_{% \mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})}\pi(x,y|K)\mathsf{S}(% \pi|K)d\mathcal{L}(\pi).

(7)

Hence, the marginal model of $(x,y)$ —and, therefore, the base-level transport plan induced by $\mathsf{S}$ —is $\hat{\pi}_{\mathsf{S}}$ (7), as may be seen by integrating both sides of (6) over $\pi\in\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{% Y})$ :

\mathsf{M}(x,y|K)=\hat{\pi}_{\mathsf{S}}(x,y|K).

(8)

This is a necessary condition for consistent hierarchical Bayesian modelling, and arises because of the deterministic mapping, $(x,y)\rightarrow\pi(x,y)$ , imposed by any realization of $\pi\sim\mathsf{S}(\pi|K)$ .

From the foregoing, it is evident that the problem of hierarchical transport model design is one of optimization of deterministic $\mathsf{S}\in{\mathbb{S}}(\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times% \mathbb{\mathbb{\Omega}}_{Y}))$ , noting that $\mathsf{S}$ appears as a condition in (6). The challenge in designing the optimal hierarchical model over the set of transport plans in (6) is to optimally process the stochastic knowledge constraints imposed by the uncertain environment while being close to an ideal design $\mathsf{M}_{\mathsf{I}}$ , which is used by the modeler to encode additional inductive biases and preferences in the HFPD-OT problem.

The generalized Bayesian inference framework considered here for the purpose of designing the optimal hierarchical model is Fully Probabilistic Design (FPD), introduced in [Kárný and Kroupa, 2012] and extended later to the hierarchical setting in [Quinn et al., 2016]. Generalized Bayesian inference (GBI) is a set of techniques that extend the classical Bayesian inference method by updating the prior belief distribution using a loss function rather than the traditional likelihood function. Under incomplete model specification, the latter may indeed not exist [Bissiri et al., 2016]. However, FPD differs from other GBI techniques in two ways. First, FPD relies on the concept of ideal design in place of a prior, and allow the designer to elicit their personal preferences in the design process through an ideal, and usually unattainable, distribution $\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)\in{\mathbb{M}}_{\mathsf{H}}^{c}\equiv% \mathbb{M}\smallsetminus\mathbb{M}_{\mathsf{H}}$ (Figure 3). More precisely, we assume that the ideal design factorizes as follows:

\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)\equiv\pi_{\mathsf{I}}(x,y|K)\mathsf{S}_{% \mathsf{I}}(\pi|K)

(9)

In other words, the joint ideal design, $\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)$ , is the base-level ideal design $\pi_{\mathsf{I}}$ , modulated by the hierarchical ideal design $\mathsf{S}_{\mathsf{I}}$ . Note that $\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)$ is unattainable because $\pi_{\mathsf{I}}$ and $\mathsf{S}_{\mathsf{I}}$ are statistically independent models, and, as such, they may be conflicting in the following sense:

\mathsf{E}_{\mathsf{S}_{\mathsf{I}}}(\pi)\neq\pi_{\mathsf{I}}(x,y)

(10)

This is reasonable when we recall that the ideal design is an entirely subjective object used to encode the designer’s preferences (and representing their unattainable, zero-loss state of knowledge). By ranking the designer’s preferences against this ideal design, (hierarchical) FPD is consistent with Savage’s framework for Bayesian decision making [Savage, 1971]. The consistent ranking of knowledge-constrained models (6) is via the KLD referenced to $\mathsf{M}_{\mathsf{I}}$ . Hence, the optimal hierarchical design, $\mathsf{M}^{o}(x,y,\pi|K)$ , is formulated as follows:

(P):\;\;\;\;\;\mathsf{M}^{o}\in\operatorname*{argmin}_{\mathsf{M}\in\mathbb{M}% _{\mathsf{H}}}\left\{\mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{M}(x,y,\pi|K)||% \mathsf{M}_{\mathsf{I}}(x,y,\pi)\bigr{)}\right\},

(11)

subject to:

\left\{\begin{aligned} \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0}))\leq\eta\\ \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))\leq\zeta\end{% aligned}\right.

We note the following:

1.

Since $\mathsf{D}_{\mathsf{KL}}\bigl{(}\cdot\;||\;\mathsf{M}_{\mathsf{I}}\bigr{)}$ is continuous, the space of joint hierarchical Bayesian distributions $\mathbb{M}_{\mathsf{H}}$ is compact in the weak-* topology (see Appendix 7) and the constraint set is nonempty (we can for instance choose $\mathsf{S}\equiv\delta_{\mu_{0}\otimes\nu_{0}}$ ), then the minimum is attained.
2.

Moreover, the optimal joint hierarchical model $\mathsf{M}^{o}$ is unique up to a set of measure 0.

The ideal design $\mathsf{M}_{\mathsf{I}}$ enters the KL divergence as the second fixed argument against which all feasible Bayesian hierarchical models are ranked. Importantly, note that the marginals in (11) are no longer modeled as deterministic, crisp objects. This assumption is now relaxed, allowing the modeler to express their uncertainty by viewing the marginals as random realizations of some underlying stochastic process. In particular, we describe this uncertainty in the form of moment constraints: the random marginals belong to uncertainty sets in the form of Kullback-Leibler balls, centered on $\mu_{0}\in\mathbbm{P}(\mathbb{\Omega}_{X})$ and $\nu_{0}\in\mathbbm{P}(\mathbb{\Omega}_{Y})$ . The new knowledge-constrained set of consistent hierarchical Bayesian models—denoted by $\mathbb{M}_{K}\subseteq\mathbb{M}_{\mathsf{H}}$ —is augmented with the following linear moment constraints over the marginals:

\begin{split}\mathbb{M}_{K}\equiv\Bigl{\{}\mathsf{M}(x,y,\pi|K)\;|\;\mathsf{M}% (x,y,\pi|K)\in\mathbb{M}_{\mathsf{H}}\;,\;\mu\in\bbmu\;\text{and}\;\nu\in\bbnu% \Bigr{\}}\end{split}

(12)

with the sets $\bbmu$ and $\bbnu$ defined as follows (Figure 1(b)):

\bbmu\equiv\{\mu\in\mathbb{P}(\mathbb{\Omega}_{X})\;|\;\mathsf{E}_{\mathsf{S}}% \left[\mathsf{D}_{\mathsf{KL}}(\mu||\mu_{0})\right]\leq\eta\}

(13)

\bbnu\equiv\{\nu\in\mathbb{P}(\mathbb{\Omega}_{Y})\;|\;\mathsf{E}_{\mathsf{S}}% \left[\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0})\right]\leq\zeta\}

(14)

where $\eta\geq 0$ and $\zeta\geq 0$ are prior-elicited KL radii, that express the degree of uncertainty the designer is placing over the marginals.

As we will see in the sequel, the interaction between the base-level and hierarchical ideals, on one hand, and the knowledge constraints on the other, is what gives rise to the Gibbsian form of the hyperprior in (1).

We now state the main result of the paper.

{theorem}

Let (P) be the HFPD-OT Primal problem, defined in (11).

(P) is equivalent to the following optimization problem over the set of hierarchical Bayesian models $\mathbb{M}_{\mathsf{H}}$ (12):

(P):\;\;\;\;\;\mathsf{M}^{o}(x,y,\pi)\in\operatorname*{argmin}_{\mathsf{M}\in% \mathbb{M}_{\mathsf{H}}}\left\{\mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{M}(x,y,% \pi|K)||\hat{\pi}_{\tilde{\mathsf{S}}}(x,y)\tilde{\mathsf{S}}(\pi|K)\bigr{)}% \right\},

(15)

subject to

\left\{\begin{aligned} \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0}))\leq\eta\\ \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))\leq\zeta\end{% aligned}\right.

where

\tilde{\mathsf{S}}(\pi|K)\equiv\mathsf{S}_{\mathsf{I}}(\pi)\exp\Bigl{(}-% \mathsf{D}_{\mathsf{KL}}\bigl{(}\pi(x,y)||\pi_{\mathsf{I}}(x,y)\bigr{)}\Bigr{)}.

(16)

The optimal hyperprior $\mathsf{S}^{o}(\pi|K)$ reads as follows:

\mathsf{S}^{o}(\pi|K)\propto\exp\left(-\lambda_{1}^{o}{\mathsf{D}_{\mathsf{KL}% }}(\mu||\mu_{0})\right)\tilde{\mathsf{S}}(\pi|K)\exp\left(-\lambda_{2}^{o}{% \mathsf{D}_{\mathsf{KL}}}(\nu||\nu_{0})\right),\;\;\mathcal{L}\text{-a.e.}

(17)

The Dual program associated with the primal $(P)$ (15) reads

\begin{split}(D):\;\;\;\;\sup_{\boldsymbol{\lambda}\succeq 0}\left\{\log\left(% \mathsf{N}(\boldsymbol{\lambda})\right)-\boldsymbol{\lambda}^{\intercal}% \boldsymbol{\theta}\right\},\end{split}

(18)

where

\mathsf{N}(\boldsymbol{\lambda})\equiv\left(\int_{\mathbbm{P}(\mathbb{\Omega}_% {X}\times\mathbb{\Omega}_{Y})}\tilde{\mathsf{S}}(\pi|K)\exp\left(-<\boldsymbol% {\lambda},\mathsf{R}(\pi)>-1\right)d\mathcal{L}(\pi)\right)^{-1},

(19)

and $\boldsymbol{\lambda}$ $\equiv$ $\begin{bmatrix}\lambda_{1}\\ \lambda_{2}\end{bmatrix}$ , $\boldsymbol{\theta}\equiv\begin{bmatrix}\eta\\ \zeta\end{bmatrix}$ , $\mathsf{R}(\pi)$ $\equiv$ $\begin{bmatrix}\mathsf{D}_{\mathsf{KL}}(\mu||\mu_{0})\\ \mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0})\end{bmatrix}$ .

Moreover, strong duality holds, i.e. the optimal Kantorovitch potentials, $\boldsymbol{\lambda}^{o}$ in (17), are the solution of the dual problem (18),

\boldsymbol{\lambda}^{o}\equiv\operatorname*{argmax}_{\boldsymbol{\lambda}% \succeq 0}\left\{\log\left(\mathsf{N}(\boldsymbol{\lambda})\right)-\boldsymbol% {\lambda}^{\intercal}\boldsymbol{\theta}\right\},

(20)

and the maximum of the dual problem is attained: $\min_{\mathsf{M}}(P)=\max_{\boldsymbol{\lambda}}(D)$ .

Proof method.

Results (1) and (2) of the Theorem can be proved using basic algebraic manipulations. However, we opt here for a derivation based on information processing arguments, so as to gain more intuition about the design of the hyperprior in the hierarchical setting.

Given the factorized joint ideal design in (9), the optimal hyperprior $\mathsf{S}^{o}(\pi|K)$ emerges via two sequential knowledge-processing steps (Figure 3), addressed in the first two of the following items:

Adapting the ideal design and processing the hyperprior without knowledge constraints $K$ . The purpose of this first step is to guide the optimization problem $(P)$ in (11) from a possibly inconsistent ideal, $\mathsf{M}_{\mathsf{I}}$ (10), to a new consistent target (step 1 in Figure 3). The adapted hyperprior, $\tilde{\mathsf{S}}$ (16), expresses the best compromise between possibly conflicting ideals. It involves the Gibbs-type modulation of the hierarchical ideal design $\mathsf{S}_{\mathsf{I}}$ via a term that depends on the base-level ideal design $\pi_{\mathsf{I}}$ (Theorem 1 in [Quinn et al., 2016]). The optimal hierarchical model $\tilde{\mathsf{M}}\in\mathbb{M}_{\mathsf{H}}$ is a boundary point in the convex set $\mathbb{M}_{\mathsf{H}}$ and is inferred from (6) as follows:

\tilde{\mathsf{M}}(x,y,\pi|K)=\hat{\pi}_{\tilde{\mathsf{S}}}(x,y|K)\tilde{% \mathsf{S}}(\pi|K)

(21)

where $\hat{\pi}_{\tilde{\mathsf{S}}}$ is the expected transport plan w.r.t $\tilde{\mathsf{S}}$ and follows from (7).

Processing the two marginal constraints specified in the knowledge set $K$ . This step leads to the new optimization problem stated in (15), which results in the optimal hyperprior (17) (Theorem 3 in [Quinn et al., 2016]). Each of the marginal constraints induces a MaxEnt Gibbs term that modulates the hyperprior obtained in Step 1. And the resulting optimal hierarchical model $\mathsf{M}^{o}\in\mathbb{M}_{K}\subseteq\mathbb{M}_{\mathsf{H}}$ —which is also a boundary point in the convex set $\mathbb{M}_{K}$ —reads as follows:

\mathsf{M}^{o}(x,y,\pi|K)=\hat{\pi}_{\mathsf{S}^{o}}(x,y|K)\mathsf{S}^{o}(\pi|K)

(22)

where $\hat{\pi}_{\mathsf{S}^{o}}$ follows similarly from (7).

3.

It remains to prove the strong duality result and formally characterize the Kantorovitch potentials in (18). The details of this proof are provided in Appendix 7. There, we prove strong duality in the infinite dimensional case by relying on the classical Fenchel-Rockafellar duality theorem [Rockafellar, 1967], [Villani, 2008]. More precisely, we demonstrate that the conditions required by the theorem are satisfied in the hierarchical Bayesian setting of HFPD-OT, and we derive the dual problem $(D)$ .

∎

By sampling random realizations from our optimal hyperprior, we can design randomized and diverse transport policies in lieu of an immutable and fixed OT plan. This randomization principle is depicted in Figure 4. More precisely, the design of the optimal hyperprior over the space of transport plans is a twofold process:

1.

The knowledge constraints $K$ are processed to yield the optimal hyperprior $\eqref{eq:hyperprior}$ . This mainly requires conditioning the Kantorovitch potentials on the uncertainty radii, $(\eta,\zeta)$ (Figure 4(a)).
2.

Once the optimal hyperprior is available, random transport strategies are sampled and used in subsequent transport problems, in lieu of a crisp OT plan. Importantly, having access to a generative model over the space of transport plans provides us with the statistical devices to assess and reason about the intrinsic uncertainty in the transport problem (Figure 4(b)). The expected transport $\hat{\pi}_{\mathsf{S}^{o}}$ plan is obtained from (7).

Remark 1.

The Kantorovitch potentials $\lambda_{1}=\lambda_{1}(\eta,\zeta)$ and $\lambda_{2}=\lambda_{2}(\eta,\zeta)$ express the degree of uncertainty in the input data—i.e. the marginals. Depending on their values, they give rise to two interesting extremal modalities, that vary from high uncertainty to perfect characterization of the marginals:

•

If $\eta\rightarrow\infty$ and $\zeta\rightarrow\infty$ , it is straightforward from (18) that the solution of the dual is achieved when $\boldsymbol{\lambda}^{o}=\boldsymbol{0}$ . This is also a direct consequence of complementary slackness. It follows that

\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to\infty,\;\zeta\to\infty}\tilde{% \mathsf{S}}(\pi|K).

(23)

In other words, when the uncertainty around the marginals is unbounded, the optimal hyperprior is mainly characterized—see (16)—by the hierarchical ideal design modulated by a Gibbsian term that depends on $\pi_{\mathsf{I}}$ .

•

If $\eta\rightarrow 0$ and $\zeta\rightarrow 0$ , the uncertainty in the marginals vanishes and learning²²2In the context of (H)FPD, learning (i.e. inductive inference) refers to the optimal processing of knowledge constraints into the hyperprior: $K\rightarrow\mathsf{S}^{o}(\pi|K)$ . For more discussion on the role of FPD in furnishing generalized settings of Bayes’ rule, see [Kracík and Kárný, 2005]. is maximal, leading to $\mu\rightarrow\mu_{0}$ and $\nu\rightarrow\nu_{0}$ , or equivalently $\pi\rightarrow\tilde{\pi}\in\mathbb{\Pi}(\mu_{0},\nu_{0})$ . It follows from the dual (18) that the maximum is attained when $\boldsymbol{\lambda}^{o}\rightarrow\infty$ , and we achieve the limit,

\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to 0,\;\zeta\to 0}\tilde{\mathsf{S}}(% \pi|K)\chi_{\mathbb{\Pi}(\mu_{0},\nu_{0})}(\pi).

(24)

In other words, the hyperprior concentrates on the OT manifold, $\mathbb{\Pi}(\mu_{0},\nu_{0})$ (• ‣ 2.2). This concentration behaviour is reminiscent of the Laplace-Bernstein-Von Mises convergence theorem [Kolmogorov and Sarmanov, 1960].

Remark 2.

Conventional Base-level OT Consider further the regime of perfect specification of the marginals, i.e. $\eta\rightarrow 0$ , $\zeta\rightarrow 0$ (Remark 1). The conjugate choice of the ideal hyperprior, $\mathsf{S}_{\mathsf{I}}$ , has the following Gibbs form:

\mathsf{S}_{\mathsf{I}}(\pi|K)\propto\exp(-\alpha\mathsf{D}_{\mathsf{KL}}(\pi(% x,y)||\pi_{\mathsf{I}}(x,y))).

(25)

Here, $\alpha>0$ plays the role of the inverse-temperature. Substituting (25) into (24), the optimal hyperprior becomes

\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to 0,\;\zeta\to 0}\exp(-(\alpha+1)% \mathsf{D}_{\mathsf{KL}}(\pi||\pi_{\mathsf{I}}))\chi_{\mathbb{\Pi}(\mu_{0},\nu% _{0})}(\pi).

(26)

When $\pi_{\mathsf{I}}$ is the extended Gibbs kernel (4)—where we instantiate $\phi$ as the uniform distribution with support in $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ —the minimum of $\mathsf{D}_{\mathsf{KL}}(\pi||\pi_{\mathsf{I}})$ in (26) is exactly achieved at the EOT solution (5):

\pi^{o}(x,y|K)=\operatorname*{argmin}_{\pi\in\mathbb{\Pi}(\mu_{0},\nu_{0})}% \mathsf{D}_{\mathsf{KL}}(\pi(x,y)||\pi_{\mathsf{I}}(x,y)).

(27)

The latter can be recovered when $\alpha\rightarrow\infty$ , for example by simulated annealing [Delahaye et al., 2019]:

\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to 0,\;\zeta\to 0,\alpha\to\infty}% \delta_{\pi^{o}}(\pi).

(28)

4 The HFPD-OT hyperprior in the parametric case

As already noted, no special assumptions have been made in respect of the hierarchical transport model (6), and so (17) is the HFPD-OT hyperprior for the nonparametric (transport) process, $\pi\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ . The finite case—i.e. $\#(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})<\infty$ —induces the parametric setting of HFPD-OT, with $\pi$ defined in the usual way w.r.t. the counting measure, and $\mathsf{S}^{o}(\pi|K)$ defined on a $K$ -constrained subset (3.1) of the simplex. This allows us to easily visualize key properties of $\mathsf{S}^{o}(\pi|K)$ in a low dimensional setting, and, importantly, to develop algorithms for computing random draws (Figure 4(b)), $\pi^{(k)}\sim\mathsf{S}^{o}(\pi|K)$ , from the HFPD-OT parametric hyperprior (17), via approximation of the Kantorovitch potentials (20).

4.1 Descriptive analysis of the parametric HFPD-OT hyperprior, $\mathsf{S}^{o}(\pi|K)$

In the finite, parametric case—which we will pursue in the rest of this paper— $x\in\mathbb{\Omega}_{X}\equiv\{x_{1},\ldots,x_{i},\ldots,x_{m}\}$ and $y\in\mathbb{\Omega}_{Y}\equiv\{y_{1},\ldots,y_{j},\ldots,y_{n}\}$ , with $2\leq m<\infty$ and $2\leq n<\infty$ . We refer to $\mathbb{\Omega}_{X}$ and $\mathbb{\Omega}_{Y}$ as the sets of source agents and target agents, respectively. Then, the base-level distributions are uncertain multinomials, with densities $\mu=\sum_{i=1}^{m}\mu_{i}\delta_{x_{i}}$ , $\nu=\sum_{j=1}^{n}\nu_{j}\delta_{y_{j}}$ and $\pi=\sum_{j=1}^{n}\sum_{i=1}^{m}\pi_{i,j}\delta_{x_{i},y_{j}}$ . The associated pmfs are structured as vector-matrix objects, and also denoted by the same symbols: $\mu\in\Delta_{m-1}$ , $\nu\in\Delta_{n-1}$ and $\pi\in\Delta_{mn-1}$ . Without loss of generality, we consider the following class of conjugate³³3We consider a weak form of conjugacy [Quinn, 2012], where the processing of the ideal design, $\mathsf{S}_{\mathsf{I}}(\pi|K)$ , via hierarchical FPD yields an optimal hyperprior, $\mathsf{S}^{o}(\pi|K)$ , of the same functional form. hierarchical ideal designs, parameterized by fixed ${\boldsymbol{\lambda}}_{\mathsf{I}}\succeq 0$ (we absorb the parameter conditions—here, ${\boldsymbol{\lambda}}_{\mathsf{I}}$ , $\mu_{0}$ and $\nu_{0}$ —into the Jeffreys’ notation, $K$ ):

\mathsf{S}_{\mathsf{I}}(\pi|K)\propto\prod_{i=1}^{m}\Bigl{(}\frac{\mu_{i}}{\mu% _{0,i}}\Bigr{)}^{-\lambda_{\mathsf{I},1}\mu_{i}}\prod_{j=1}^{n}\Bigl{(}\frac{% \nu_{j}}{\nu_{0,j}}\Bigr{)}^{-\lambda_{\mathsf{I},2}\nu_{j}}

(29)

The base-level ideal design, $\pi_{\mathsf{I}}(x,y|K)$ , has the form of the extended Gibbs kernel (4), consistent with the FPD-OT setting. We further specialize $\phi(x,y)$ to the uniform case, $\phi(\cdot)\equiv{\mathcal{U}}$ , yielding the following form of the parametric hyperprior:

{definition}

[HFPD-OT hyperprior for the parametric transport plan] The transport hyperprior (17) in the case of a domain, $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ , of finite cardinality, $m\times n$ , is parametric, with parameters $(\lambda_{1}^{o},\lambda_{2}^{o},\lambda_{\mathsf{I},1},\lambda_{\mathsf{I},2}% ,\mu_{0},\nu_{0},\pi_{\mathsf{I}})$ , and support on the probability simplex $\Delta_{m\times n-1}$ . It is absolutely continuous w.r.t. Lebesgue measure, $\lambda$ , with density

\mathsf{S}^{o}(\pi|K)\propto\prod_{i=1}^{m}\prod_{j=1}^{n}\Bigl{(}\frac{\mu_{i% }}{\mu_{0,i}}\Bigr{)}^{-(\lambda_{\mathsf{I},1}+\lambda_{1}^{o})\mu_{i}}\Bigl{% (}\frac{\nu_{j}}{\nu_{0,j}}\Bigr{)}^{-(\lambda_{\mathsf{I},2}+\lambda_{2}^{o})% \nu_{j}}\Bigl{(}\frac{\pi_{i,j}}{\pi_{\mathsf{I},i,j}}\Bigr{)}^{-\pi_{i,j}}\;% \;\lambda\text{-a.e.},

(30)

with the ideal design having the following Gibbs form:

\pi_{\mathsf{I},i,j}\propto\exp\Bigl{(}-\frac{\mathsf{C}(x_{i},y_{j})}{% \epsilon}\Bigr{)}

The number of prior parameters, encoding $K$ in (30), is $(m+1)\times(n+1)$ . This endows the HFPD-OT hyperprior design with far more expressivity (i.e. degrees-of-freedom (dofs)) than default distributions on the probability simplex. For instance, a Dirichlet distribution of $\pi$ in this finite setting has $m+n+1$ fewer dofs.

Remark 3 (Inference with the HFPD-OT hyperprior, $\mathsf{S}^{o}(\pi|K)$ ).

The normalizing constant of the HFPD-OT hyperprior (30) is not available in closed form. A full study of its numerical approximation will be the subject of future work.

The marginal distribution of $\pi_{1:k,1:l}\in\Delta_{k\times l-1}$ , being the sub-matrix of $\pi$ associated with the contracts, $\pi_{ij}$ , $1\leq i\leq k<m$ and $1\leq j\leq l<n$ , is

\mathsf{S}^{o}(\pi_{1:k,1:l}|K)=\int\displaylimits_{(1-w_{kl})\Delta_{mn-kl-1}% }\mathsf{S}^{o}(\pi|K)d\pi_{\backslash(1:k,1:l)},

(31)

where $w_{kl}\equiv\sum_{j=1}^{l}\sum_{i=1}^{k}\pi_{ij}$ , and $\pi_{\backslash(1:k,1:l)}$ denotes the complement of $\pi_{1:k,1:l}$ in $\pi$ . In particular, the marginal distribution of $\pi_{k,l}\in(0,1)$ —i.e. of the $(k,l)$ th random contract, being the normalized mass (probability) transported from the $k$ th source node and the $l$ th target node—is

\mathsf{S}^{o}(\pi_{kl}|K)=\int\displaylimits_{(1-\pi_{kl})\Delta_{mn-2}}% \mathsf{S}^{o}(\pi|K)d\pi_{\backslash(k,l)}.

(32)

Finally, the HFPD-optimal full conditional distribution of the $(k,l)$ th contract—having fixed all the others at specific probabilities, $\pi_{0}{{}_{\backslash(k,l)}}$ —is

\mathsf{S}^{o}(\pi_{k,l}|\pi_{0}{{}_{\backslash(k,l)}},K)\propto\mathsf{S}^{o}% (\pi_{k,l},\pi_{0}{{}_{\backslash(k,l)}}|K)\;\chi_{(0,1-c_{kl})}(\pi_{kl}),

(33)

where $c_{kl}\equiv\underbrace{\sum_{j=1}^{n}\sum_{i=1}^{m}}_{(i,j)\notin\{(k,l),(m,n% )\}}\pi_{0}{{}_{i,j}}$ .

4.1.1 Illustration in the $m=n=2$ case

To gain further insight into the parametric HFPD-OT hyperprior, $\mathsf{S}^{o}(\pi|K)$ (30), we explore its location and shape in the $m=n=2$ case. Then, $\mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)$ has support in the three-dimensional simplex, i.e. $(\pi_{11},\pi_{12},\pi_{21})\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})\equiv\Delta_{3}$ (Figure 5). We assume that $\boldsymbol{\lambda}^{o}\gg\boldsymbol{\lambda}_{\mathsf{I}}$ , which corresponds to the knowledge-dominated regime [Jeffreys, 1939] in which the ideal in (16) is diffuse in comparison with the $K$ -dependent modulating terms in (17). In this case, (30) specializes to:

\begin{split}\mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)\propto\Bigl{(}\frac{% \pi_{11}+\pi_{12}}{\mu_{0,1}}\Bigr{)}^{-\lambda_{1}^{o}(\pi_{11}+\pi_{12})}% \Bigl{(}\frac{1-\pi_{11}-\pi_{12}}{1-\mu_{0,1}}\Bigr{)}^{-\lambda_{1}^{o}(1-% \pi_{11}-\pi_{12})}\Bigl{(}\frac{\pi_{11}+\pi_{21}}{\nu_{0,1}}\Bigr{)}^{-% \lambda_{2}^{o}(\pi_{11}+\pi_{21})}\times&\\ \Bigl{(}\frac{1-\pi_{11}-\pi_{21}}{1-\nu_{0,1}}\Bigr{)}^{-\lambda_{2}^{o}(1-% \pi_{11}-\pi_{21})}\Bigl{(}\frac{\pi_{11}}{\pi_{\mathsf{I},11}}\Bigr{)}^{-\pi_% {11}}\Bigl{(}\frac{\pi_{12}}{\pi_{\mathsf{I},12}}\Bigr{)}^{-\pi_{12}}\Bigl{(}% \frac{\pi_{21}}{\pi_{\mathsf{I},21}}\Bigr{)}^{-\pi_{21}}\Bigl{(}\frac{1-\pi_{1% 1}-\pi_{12}-\pi_{21}}{1-\pi_{\mathsf{I},11}-\pi_{\mathsf{I},12}-\pi_{\mathsf{I% },21}}\Bigr{)}^{-(1-\pi_{11}-\pi_{12}-\pi_{21})}\end{split}

(34)

Its parameters are $\mu_{0,1}\in(0,1)$ , $\nu_{0,1}\in(0,1)$ , $(\pi_{\mathsf{I},11},\pi_{\mathsf{I},12},\pi_{\mathsf{I},21})\in\Delta_{3}$ and the Kantorovitch potentials, $\boldsymbol{\lambda}^{o}\succeq\boldsymbol{0}$ . The purpose of the following simulations is to study the influence of the Kantorovitch potentials, $\boldsymbol{\lambda}^{o}$ (20), and the nominal marginals, $\mu_{0}$ and $\nu_{0}$ , on the location and shape of the hyperprior. For ease of visualization (in $\Delta_{2}$ ), we focus primarily on the bivariate marginal distribution⁴⁴4All integrals in this section are computed using Gaussian quadrature integration, yielding results with an average integration error of $\approx 1.46\times 10^{-8}$ . (31), i.e. the hyperprior concentrated on the two contracts forming the first row of the uncertain transport plan (Figure 5):

\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)\propto\int_{0}^{1-\pi_{11}-\pi_{12}}% \mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)d\pi_{21}

(35)

Shape parameters:

The cost matrix (4) and nominal marginals are respectively set to the following values:

\mathsf{C}\equiv\begin{bmatrix}0&1\\ 1&0\end{bmatrix},\;\;\;(\mu_{0},\nu_{0})\equiv\left\{\begin{bmatrix}0.3\\ 0.7\end{bmatrix},\begin{bmatrix}0.8\\ 0.2\end{bmatrix}\right\}.

For now, we fix the smoothness parameter $\epsilon=1$ and study its influence on the shape of the hyperprior in a separate section. We examine the influence of the Kantorovitch potentials, $\boldsymbol{\lambda}^{o}$ , on the shape of the hyperprior, by varying their values as follows: $\boldsymbol{\lambda}^{o}\in\{0.05,10,100\}^{2}$ .

As discussed earlier, these potentials—through their connection to the KLD radii, $(\eta,\zeta)$ —quantify the uncertainty in the marginals and induce two asymptotic learning modes. The first is attained when $\boldsymbol{\lambda}^{o}\rightarrow\boldsymbol{0}$ , and coincides with the non-specification of the marginals, and the absence of effective learning. The second is attained when $\boldsymbol{\lambda}^{o}\rightarrow\infty$ , i.e. when there is perfect specification of the marginals. The visualizations in Figure 6—which shows the contour plots of the marginal hyperprior, $\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)$ for the chosen values of $\boldsymbol{\lambda}^{o}$ —illustrate this concentration behaviour, as we progress from the first to the second modality. By increasing the potentials, the contours gradually concentrate on a thin statistical manifold, namely $\mathbb{\Pi}(\mu_{0},\nu_{0})$ . In addition to the marginal hyperprior, we show the first row, ( $\pi_{11}$ , $\pi_{12}$ ), of the expected transport plan, $\hat{\pi}_{\mathsf{S}}$ (7) (red dot). The latter is obtained by averaging samples drawn from the joint hyperprior: $\pi^{(k)}\sim\mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)$ . The blue dot, on the other hand, corresponds to the first row of the EOT plan, $\pi^{o}(x_{i},y_{j}|K)$ (5), computed for the nominal marginals, $(\mu_{0},\nu_{0})$ , using the Sinkhorn-Knopp algorithm [Cuturi, 2013] (and is, of course, invariant with $\boldsymbol{\lambda}^{o}$ ). The expected transport plan gradually converges towards the OT plan, as the support of the marginal hyperprior contracts towards $\mathbb{\Pi}(\mu_{0},\nu_{0})$ when $\boldsymbol{\lambda}^{o}\rightarrow\infty$ , which is consistent with the Laplace-Bernstein concentration theorem.

Location parameters:

The nominal marginals, $(\mu_{0},\nu_{0})$ , play the role of location parameters for the hyperprior. To illustrate this, we fix the Kantorovitch potentials and the smoothness parameter, respectively, to default values: $\boldsymbol{\lambda}^{o}=(1,1)$ , $\epsilon=1$ and vary the nominal marginals as follows:

(\mu_{0},\nu_{0})\in\left\{\begin{bmatrix}(0.9,0.1)\\ (0.1,0.9)\end{bmatrix}^{\intercal},\begin{bmatrix}(0.5,0.5)\\ (0.5,0.5)\end{bmatrix}^{\intercal},\begin{bmatrix}(0.1,0.9)\\ (0.9,0.1)\end{bmatrix}^{\intercal}\right\}.

(36)

For each pair of the nominal marginals in (36), we show in Figure 7 the contour plot of the marginal hyperprior, $\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)$ . Moreover, we plot the first row of the expected transport plan, $\hat{\pi}_{\mathsf{S}}$ , in red and the EOT plan, $\pi^{o}$ in blue. The location of the mode is clearly influenced by the nominal marginals, $(\mu_{0},\nu_{0})$ , and more precisely, by their symmetry and skewness. The expected plan, $\hat{\pi}_{\mathsf{S}}$ (7), is attracted by the mode of the marginal hyperprior; the optimal plan, on the other hand, initially has a low probability under the marginal hyperprior but contracts gradually towards the mode.

Influence of the ideal hyperprior :

Finally, we explore the influence of the ideal design (9), and, more precisely, its smoothness parameter, $\epsilon$ , which enters at the base-level of the ideal specification (4). We hold the nominal marginals, $(\mu_{0},\nu_{0})$ , constant, as indicated. By varying $\epsilon\in\{0.1,0.5,10\}$ , it is clear from Figure 8 that this parameter affects the location of the hyperprior, $\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)$ .

4.2 Stochastic approximation of the optimal Kantorovitch potentials

We now focus on the derivation of the optimal Kantorovitch potentials, $\boldsymbol{\lambda}^{o}$ . This requires processing the knowledge constraints, $(\eta,\zeta)$ , in the hyperprior, by solving the dual program (18). To this end, we leverage a combination of second-order optimization and MCMC techniques.

Computing $\boldsymbol{\lambda}^{o}$ by means of the dual program in (18) is a critical step in the design of the optimal hyperprior, $\mathsf{S}^{o}(\pi|K)$ (30). However, deriving their exact values in high-dimensional settings is not trivial, as it requires manipulating the intractable normalizing constant (19). The methodology proposed herein approximates these potentials using a combination of Quasi-Newton [Nocedal and Wright, 2006], [Nesterov, 2018] and Hamiltonian Monte Carlo (HMC) [Betancourt, 2017], thus circumventing the need to evaluate $\mathsf{N}(\boldsymbol{\lambda})$ . In particular, HMC provides a rigorous and efficient framework for sampling in high-dimensional settings: compared to other MCMC techniques, the number of gradient estimations in HMC is less sensitive to the dimension of the problem [Mangoubi and Smith, 2019], making it a convenient choice when generating random transport plans $\pi^{(k)}\sim\mathsf{S}^{o}$ .

As proved in (18), the optimal Kantorovitch potentials read as follows:

\boldsymbol{\lambda}^{o}=\operatorname*{argmin}_{\boldsymbol{\lambda}\succeq 0% }\left\{\boldsymbol{\lambda}^{\intercal}\boldsymbol{\theta}-\log\left(\mathsf{% N}(\boldsymbol{\lambda})\right)\right\}.

(37)

Let $\varrho(\boldsymbol{\lambda})\equiv\boldsymbol{\lambda}^{\intercal}\boldsymbol% {\theta}-\log\left(\mathsf{N}(\boldsymbol{\lambda})\right)$ denote the optimization objective in (37). Its gradient vector can be written conveniently using the following expectation:

\begin{split}\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda})&=% \boldsymbol{\theta}-\mathsf{E}_{\mathsf{S}}\bigl{[}\mathsf{R}(\pi)\bigr{]}\end% {split}

(38)

We define $s_{t}$ and $n_{t}$ , the Kantorovitch potentials and their gradient differentials respectively, as follows:

s_{t}\equiv\boldsymbol{\lambda}_{t+1}-\boldsymbol{\lambda}_{t}

and

n_{t}\equiv\nabla\varrho(\boldsymbol{\lambda}_{t+1})-\nabla\varrho(\boldsymbol% {\lambda}_{t})

where $t>0$ is the iteration in Quasi-Newton. The recursive approximation of the inverse Hessian can be written as follows [Nocedal and Wright, 2006]:

\mathsf{H}_{t+1}=(\boldsymbol{\mathsf{I}}-\varsigma_{t}s_{t}n_{t}^{\intercal})% \mathsf{H}_{t}(\boldsymbol{\mathsf{I}}-\varsigma_{t}n_{t}s_{t}^{\intercal})+% \varsigma_{t}s_{t}s_{t}^{\intercal}\;\;\;,\;\ \varsigma_{t}\equiv\frac{1}{n_{t% }^{\intercal}s_{t}}

(39)

where $\boldsymbol{\mathsf{I}}$ denotes the identity matrix. We note that the inverse Hessian $\mathsf{H}_{t}$ depends only on the stochastic gradients $\nabla\varrho(\boldsymbol{\lambda}_{t})$ (38). Thus, we avoid stability issues when dealing with ill-conditioned stochastic inverse Hessian approximations, as it is the case with high-variance MC samplers.

Once computed, the gradient and the inverse Hessian are plugged into the usual BFGS iterative updates [Nocedal and Wright, 2006]:

\boldsymbol{\lambda}_{t+1}\xleftarrow{}\boldsymbol{\lambda}_{t}-\rho_{t}% \mathsf{H}_{t}\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t})

(40)

where $\rho_{t}>0$ is the step size at the $t^{th}$ iteration in the search direction given by:

\mathsf{d}(\boldsymbol{\lambda}_{t})\equiv-\mathsf{H}_{t}\nabla_{\boldsymbol{% \lambda}}\varrho(\boldsymbol{\lambda}_{t})

(41)

The step size $\rho_{t}$ should be adapted carefully to ensure convergence to the global minimum $\boldsymbol{\lambda}^{o}$ . It is usually computed by solving an auxiliary line search problem, using techniques such as backtrack line search (BTLS) [Nesterov, 2018]. However, most of line search techniques require the evaluation of the objective $\varrho(\boldsymbol{\lambda})$ at each step. To avoid explicit function evaluations, we propose a simple local approximation that estimates the position of the minimum along the search line (41), based solely on two gradient evaluations [Snyman, 2005].

More precisely, the optimal step size that yields sufficient decrease in the search direction (41) can be found by solving the following problem:

\rho_{t}^{*}=\operatorname*{argmin}_{\rho\in[0,1]}\varrho(\boldsymbol{\lambda}% _{t}+\rho\;\mathsf{d}(\boldsymbol{\lambda}_{t}))

(42)

Assuming that $\varrho$ is locally quadratic at $\boldsymbol{\lambda}_{t}$ , it follows that solving (42) reduces to finding $\rho$ that satisfies:

\varrho(\boldsymbol{\lambda}_{t}+\rho\;\mathsf{d}(\boldsymbol{\lambda}_{t}))=% \varrho(\boldsymbol{\lambda}_{t})

(43)

Which yields the following optimal step size:

\begin{split}\rho_{t}^{*}&=\frac{-\mathsf{d}(\boldsymbol{\lambda}_{t})^{% \intercal}\nabla\varrho(\boldsymbol{\lambda}_{t})}{{\mathsf{d}(\boldsymbol{% \lambda}_{t})}^{\intercal}\nabla^{2}\varrho(\boldsymbol{\lambda}_{t})\mathsf{d% }(\boldsymbol{\lambda}_{t})}\end{split}

(44)

Finally, by a second-order Taylor expansion at $\boldsymbol{\lambda}_{t}$ and $\boldsymbol{\lambda}_{t}+\mathsf{d}(\boldsymbol{\lambda}_{t})$ , the denominator in (44) can be computed using two gradients estimations, as follows:

{\mathsf{d}(\boldsymbol{\lambda}_{t})}^{\intercal}\nabla^{2}\varrho(% \boldsymbol{\lambda}_{t})\mathsf{d}(\boldsymbol{\lambda}_{t})\approx{\mathsf{d% }(\boldsymbol{\lambda}_{t})}^{\intercal}\left[\nabla\varrho(\boldsymbol{% \lambda}_{t}+\mathsf{d}(\boldsymbol{\lambda}_{t}))-\nabla\varrho(\boldsymbol{% \lambda}_{t})\right]

(45)

What remains is to compute the gradient terms, which can be estimated using HMC. If $n_{s}>0$ is the number of independent realizations $\pi^{(i)}\sim\mathsf{S}(\pi|K)$ , then the expectation in (38) can be approximated as follows:

\mathsf{E}_{\mathsf{S}}\bigl{[}\mathsf{R}(\pi)\bigr{]}\approx\frac{1}{n_{samp}% }\sum\limits_{i=1}^{n_{s}}\mathsf{R}(\pi^{(i)})

(46)

At each iteration $t$ , the error (stopping criterion) is measured by means of the following Newton’s decrement, which corresponds to the inverse Hessian norm of the gradient. This quantity provides a good indication of the proximity to the optimal Kantorovitch potentials:

\mathsf{err}\equiv\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t% })^{\intercal}\nabla_{\boldsymbol{\lambda}}^{2}\varrho(\boldsymbol{\lambda}_{t% })^{-1}\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t})

(47)

The optimal potentials $\boldsymbol{\lambda}^{o}$ are then plugged into (17) and the optimal hyperprior can be used to generate random transport plans, by means of another HMC sampler.

Input: nominal marginals

(\mu_{0},\nu_{0})

, KLD radii

(\eta,\zeta)

, target precision

\tau

, base-level ideal design

\pi_{\mathsf{I}}

, hierarchical ideal design

\mathsf{S}_{\mathsf{I}}

, number of samples

n_{\mathsf{samp}}

Result:

\boldsymbol{\lambda}^{o}

1 Initialization:

t=1

\boldsymbol{\lambda}_{t}\succeq 0

\rho_{t}=1

\mathsf{H}_{t}=\boldsymbol{\mathsf{I}}

\mathsf{err}=\infty

;

2 while $\tau<\mathsf{err}$ do

3 Sample

\{\pi^{(l)}_{t}\}_{l=1}^{n_{\mathsf{samp}}}\sim\mathsf{S}(\pi|K_{-\boldsymbol{% \lambda}},\boldsymbol{\lambda}_{t})

\triangleright

HMC sampler.

K_{-\boldsymbol{\lambda}}

denotes all parameters in the knowledge set

K

, except

\boldsymbol{\lambda}

;

4 Estimate

\mathsf{E}_{\mathsf{S}(\pi|K_{-\boldsymbol{\lambda}},\boldsymbol{\lambda}_{t})% }\left[\mathsf{R}(\pi)\right]

;

5 Estimate

\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t})

;

6 Compute

\check{\boldsymbol{\lambda}}_{t+1}\leftarrow\check{\boldsymbol{\lambda}}_{t}-% \mathsf{H}_{t}\nabla\varrho(\boldsymbol{\lambda}_{t})

;

7 Sample

\{\check{\pi}^{(l)}_{t+1}\}_{l=1}^{n_{\mathsf{samp}}}\sim\mathsf{S}(\pi|K_{-% \boldsymbol{\lambda}},\check{\boldsymbol{\lambda}}_{t+1})

;

8 Estimate

\nabla_{\boldsymbol{\lambda}}\varrho(\check{\boldsymbol{\lambda}}_{t+1})

;

9 Compute

\rho^{*}_{t}

;

10 Update

\boldsymbol{\lambda}_{t+1}\xleftarrow{}\boldsymbol{\lambda}_{t}-\rho^{*}_{t}% \mathsf{H}_{t}\nabla\varrho(\boldsymbol{\lambda}_{t})

;

11 Compute

s_{t}

n_{t}

and

\varsigma_{t}

;

12 Update

\mathsf{H}_{t+1}\leftarrow(\boldsymbol{\mathsf{I}}-\varsigma_{t}s_{t}n_{t}^{% \intercal})\mathsf{H}_{t}(\boldsymbol{\mathsf{I}}-\varsigma_{t}n_{t}s_{t}^{% \intercal})+\varsigma_{t}s_{t}s_{t}^{\intercal}

;

13 Update

\mathsf{err}

;

14 Update

t\leftarrow t+1

return

\boldsymbol{\lambda}_{t+1}

Algorithm 1 Approximation of the Kantorovitch potentials

Remark 4.

Computational complexity. In Algorithm 1, we replace each approximation of the normalising constant, $\mathsf{N}(\boldsymbol{\lambda})$ (19), with two gradient approximations. Therefore, the overall computational complexity is mainly driven by the sampling operations in line $3$ and $7$ of the Algorithm, whose complexity is, in turn, contingent upon the number of gradient evaluations used in the leapfrog integrator of the HMC sampler [Betancourt, 2017]. Under certain regularity conditions, this number is of order $\mathcal{O}(\sqrt{mn})$ [Mangoubi and Smith, 2019]. Though these regularity conditions are not fully satisfied here (see Remark 5), this provides us with a good lower bound on the computational complexity. Using a mean-field variational Bayes method at each iteration of the Quasi-Newton method—which assumes that all the parameters (i.e. contracts), $\pi_{i,j}$ , are independent—would result in a linear time complexity in the number of parameters, that is $\mathcal{O}(mn)$ .

Remark 5.

On HMC mixing properties. It is worth noting that the main convergence results of HMC, when sampling from a log-concave function, $\mathsf{e}^{-f}$ , require strongly convex and Lipschitz smooth (i.e. Lipschitz $\nabla f$ ) potential functions, $f$ [Chen and Vempala, 2022]. However, the KLD is not Lipschitz smooth and the theoretical convergence results are not guaranteed in our setting. This results in a longer integration time and biased estimators, especially when $(\eta,\zeta)\rightarrow(0,0)$ . For the time being, we will use HMC while carefully tuning its main parameters (integrator step size, adaptation step, etc.), and will explore specialized samplers in a separate work.

5 HFPD-OT for Algorithmic fairness in market matching

The goal of algorithmic fairness is to detect and mitigate algorithmic biases induced by automated decision-making systems [Barocas et al., 2023]. This is a compelling setting for HFPD-OT, since we can benefit from randomized transport plans to elicit fair transport strategies in the presence of uncertainty. Note that OT for fairness has already been proposed in other works (see [Gordaliza et al., 2019] and references therein), with the focus being on notions of data repair and learning fair models. In contrast, we are concerned, here, with fair OT, whose purpose is to design transport plans that are fair per se. The literature on fair OT is sparse: in [Hughes and Chen, 2021], the authors address the fair OT problem by proposing a dynamic and distributed fair OT algorithm. In this manuscript, we propose a different approach that leverages randomized policies, which are induced naturally by the HFPD-OT setting.

To appreciate the implications for fair OT of the randomization and diversity allowed by HFPD-OT, we study the problem of fair market matching [Galichon, 2021], [Echenique et al., 2024], and more precisely the question of worker-job matching, in which the nominal marginals, $\mu_{0}$ and $\nu_{0}$ , are estimates of the distributions of workers and jobs, respectively. An agent $x_{i}\in\mathbb{\Omega}_{X}$ represents a category of workers or skills, while an agent $y_{j}\in\mathbb{\Omega}_{Y}$ is a job opportunity or a company. In particular, we study vertically-differentiated agents: workers in one category may exhibit skills not available in other categories. Similarly, some companies may differ in their size or may have unique production technologies ⁵⁵5This is in contrast to horizontally differentiated agents, where some hierarchy may exist between agents.. A contract $\pi_{i,j}$ seeks to match some of the workers in category $x_{i}$ with some of the job opportunities offered by $y_{j}$ .

Our purpose is to study the following question: Can randomized transport plans elicit long-term fair matching strategies in a worker-job matching problem, for agents as well as for individual contracts? Our notion of fairness is asymptotic, in the sense that fairness is achieved in the long-run. This is in contrast to the static (i.e. invariant) designs of classical OT, which may, indeed, satisfy a standard fairness metric based on the ensemble of contracts on $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ , but, unfortunately, harms the same individual agents or contracts, either because of:

(i)

misspecification of the marginals for some of the agents, $\mathbb{\Omega}_{X}$ or $\mathbb{\Omega}_{Y}$ ; and/or
(ii)

an invariant and uneven distribution of mass across the contracts, $\pi_{i,j}^{o}$ .

Before addressing the problem of fair labour market matching (Section 5.4), we review the fairness-related concept of diversity.

5.1 Simulation study

We consider the following setting:

•

$m\equiv n\equiv d\equiv 20$
•

$\epsilon\equiv 10^{-3}$
•

$\mathsf{C}_{i,j}\equiv\|x_{i}-y_{j}\|_{2}^{2},\;\;(i,j)\in[\![m]\!]\times[\![n% ]\!]\$
•

$\eta\equiv 2$ , $\zeta\equiv 2$
•

$\lambda_{\mathsf{I},1}\equiv 0.5$ , $\lambda_{\mathsf{I},2}\equiv 0.5$ We simulate the nominal worker and job distributions as $\mu_{0}\sim\mathcal{tN}(2,5)$ and $\nu_{0}\sim\mathcal{tN}(6,3)$ , respectively, where $\mathcal{tN}(a,b)$ denotes the truncated Gaussian distribution with positive support, mean $a$ and variance $b$ .
•
To sample from the hyperprior, $\mathsf{S}^{o}(\pi|K)$ (30), we leverage the Hamiltonian Monte Carlo (HMC) module available in TensorFlow Probability (version 0.24.0)⁶⁶6https://www.tensorflow.org/probability, with the following configuration:
- –
  
  Number of burn-in steps: $8000$
- –
  
  Number of adaptation steps: $0.8\times\text{number of burn-in steps}$
- –
  
  Target acceptance probability (fixed): $0.6$
- –
  
  The length traveled by the leapfrog integrator is adjusted using a No U-Turn Sampler (NUTS) [Hoffman and Gelman, 2011].
- –
  
  The step size is optimized using a dual averaging policy [Hoffman and Gelman, 2011].
- –
  
  The sampler is compiled using XLA (Accelerated Linear Algebra).
- –
  
  The optimal Kantorovitch potentials (37) are computed using Algorithm (1).
•

The base-level EOT model (3) is computed using the POT library [Flamary et al., 2021].

5.2 Quantifying diversity in HFPD-OT

Our definition of long-term fairness—to follow—relies on the notion of diversity, which we quantify using the following diversity index:

{definition}

[Diversity index] Let $m\times n$ be the dimension of the parametric random transport plan, $\pi\sim\mathsf{S}^{o}(\pi|K)$ (30). The 1-diversity index (or perplexity score [Jelinek et al., 1977]) associated with $\mathsf{S}^{o}$ is:

\mathsf{D}(\mathsf{S}^{o}(\pi|K))\equiv\mathsf{E}_{\mathsf{S}^{o}}\left[\exp% \bigl{(}\mathsf{H}(\pi)\bigr{)}\right],

(48)

where $\mathsf{H}(\pi)$ denotes the entropy of $\pi$ :

\mathsf{H}(\pi)\equiv-\sum_{i=1}^{m}\sum_{j=1}^{n}\pi_{i,j}\log(\pi_{i,j})

(49)

In Figure 9(a), we compute and graph $\mathsf{D}(\cdot)$ for different values of the Kantorovitch potentials (37), and we compare the diversity of random HFPD-OT matching polices to that of the base-level EOT policy (3). While increasing the Kantorovitch potentials decreases the diversity, it remains substantially higher than that of the EOT policy, even when the smoothness parameter is fixed at a relatively high value: $\epsilon=0.1$ . In practical terms, a higher $\mathsf{D}(\cdot)$ ensures that a more diverse set of skills is allocated to each company, in expectation. Similarly, workers are expected to have access to a more diverse set of job opportunities. We use this insight in the sequel, to formalize the meanings of diversity and fairness both for agents (Definition 5.3) and contracts (Definition 5.4).

Remark 6.

One might argue that the smoothness parameter, $\epsilon>0$ , in base-level EOT (3) can be used to induce some level of diversity for fair OT (i.e. objective (ii) in Section 5). However, it does not address objective (i). Note that the randomness in HFPD-OT is informed, since it emerges from modelling the uncertainty in the marginals, whereas the smoothness in EOT is mainly a computational convenience that is not informed by a mathematical model of uncertainty.

5.3 Long-term fairness for agents through randomization

We first discuss the notion of fairness for agents (groups of workers and companies, in our application) enabled by a randomized transport strategy and propose the following definition.

{definition}

Fairness for agents via diversified transport plans

A transport policy fulfills fairness for agents if:

1.

It acknowledges that marginals (the supply and demand) may not have been fairly measured and incorporates this knowledge in the design of the transport policy.
2.

It allows asymptomatically for a diversified allocation of resources. This diversification should be proportional to the uncertainty in the marginals.

Underestimating the supply of a category of workers can produce a matching policy in which all workers in that category are unfairly assigned to closer companies (in the sense of the cost $\mathsf{C}$ ). Accounting for uncertainty in the supply, however, would allow, in expectation, for a more diverse mix of skills to be transferred to companies. To illustrate this point further, we analyze the diversity of workers matched to companies and compute the mean diversity index of the random conditional transport policy $\pi(.|Y=y_{0})$ associated with each company, $y_{0}\in\mathbb{\Omega}_{Y}$ . Figure 9(b) shows that the diversity of skills allowed by HFPD-OT remains consistently higher than that of the base-level EOT, thus allowing each company $y_{0}$ to benefit from a more diverse set of skills.

5.4 Long-term fairness for contracts through randomization

In our worker-company matching problem, as in many other transport problems, contracts correspond to a physical infrastructure, deployed to match resources to demand (agencies, recruitment processes, crowd-sourcing labour market platforms, etc.). By design, the OT model yields a sparse transport strategy where the transport burden is supported by a small number of contracts, and though the base-level EOT may allow for smoother, i.e. more diverse transport strategies, this diversity does not emerge from a proper mathematical modelling of uncertainty (Remark 7). In contrast, randomized HFPD-OT strategies allow the activation of a more diverse set of contracts, yielding a fairer utilization of the transport infrastructure. In this regard, HFPD-OT is closely related to maximum diversity problems [Martí et al., 2022].

To formalize the previous point, we start by introducing the notion of eligible contracts:

\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\equiv\bigl{\{}\pi_{i,j},\;(i,j)% \in[\![m]\!]\times[\![n]\!]\;|\;\pi\in\mathsf{supp}(\mathsf{S}^{o})\;\text{and% }\;\mathsf{E}_{\mathsf{S}^{o}}\left[\mathbb{1}(\pi_{i,j}\geq\upsilon)\right]>0% \bigr{\}}.

(50)

Here, $\upsilon>0$ is an activation threshold, imposed by design constraints (technical specifications, design requirements, etc.). Eligible contracts are those with a positive probability of being active under the hyperprior, $\mathsf{S}^{o}(\pi|K)$ . The set $\mathbb{\Pi}_{\mathsf{E}}$ is better understood through its asymptotic behaviour:

•

In the absence of any constraint on the marginals, $\mathbb{\Pi}_{\mathsf{E}}$ is fully determined by the base-level and hierarchical ideal designs (23), and:

\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\xrightarrow{\eta\rightarrow% \infty,\zeta\rightarrow\infty}\bigl{\{}\pi_{i,j},\;(i,j)\in[\![m]\!]\times[\![% n]\!]\;|\;\pi\in\mathsf{supp}(\tilde{\mathsf{S}})\;\text{and}\;\mathsf{E}_{% \tilde{\mathsf{S}}}\left[\mathbb{1}(\pi_{i,j}\geq\upsilon)\right]>0\bigr{\}}.

In particular, if the base-level and hierarchical ideal designs are chosen to be uninformative, it follows that

\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\xrightarrow{\eta\rightarrow% \infty,\zeta\rightarrow\infty}\bigl{\{}\pi_{i,j},\;(i,j)\in[\![m]\!]\times[\![% n]\!]\;|\;\mathsf{E}_{\mathcal{U}}\left[\mathbb{1}(\pi_{i,j}\geq\upsilon)% \right]>0\bigr{\}}.

•

In the case of crisp marginals (i.e. no marginal uncertainties), $\mathbb{\Pi}_{\mathsf{E}}$ contracts to a subset of $\mathbb{\Pi}(\mu_{0},\nu_{0})$ (• ‣ 2.2):

	$\displaystyle\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\xrightarrow{\eta% \rightarrow 0,\zeta\rightarrow 0}$		$\displaystyle\bigl{\{}\pi_{i,j},\;(i,j)\in[\![m]\!]\times[\![n]\!]\;\|\;\pi\in% \mathbb{\Pi}(\mu_{0},\nu_{0})\;\text{and}\;\mathsf{E}_{\mathsf{S}^{o}}\left[% \mathbb{1}(\pi_{i,j}\geq\upsilon)\right]>0\bigr{\}}$
			$\displaystyle\subset\mathbb{\Pi}(\mu_{0},\nu_{0}).$

We use $\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)$ to introduce our definition of fairness for contracts.

{definition}

Fairness for contracts via diversified transport plans
A random transport plan, $\pi\sim\mathsf{S}^{o}(\pi|K)$ , achieves fairness for contracts if it distributes the transport burden over all eligible contracts in $\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)$ . For the purpose of illustration, we fix the optimal Kantorovitch potentials (20) to arbitrarily small values: $\lambda^{o}_{1}=\lambda^{o}_{2}=0.05$ (or, equivalently, large uncertainty radii $(\eta,\zeta)$ ), and the activation threshold to $\upsilon=2\times 10^{-2}$ . Both the base-level and hierarchical ideal designs (9) are chosen to be uniform. We generate a sequence of relative frequency maps, each providing estimates of the probabilities that the respective contracts, $\pi_{i,j}\in\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)$ , are active. We compare these to the base-level EOT matching policy (Figure 10(a)), which – being oblivious to the uncertainty in the marginals – yields a sparse transport policy and thus fails to achieve fairness for contracts (Definition 5.4). In contrast, the random HFPD-OT matching policies enable a greater diversity by ensuring that more of the contracts are active, as shown in Figure 10(b), 10(c) and 10(d). These are the estimated activation probability maps, averaged over $N\in\{10,50,100\}$ randomly sampled transport plans, $\pi^{(i)}\sim\mathsf{S}^{o}(\pi|K)$ , for $i\in[\![{N}]\!]$ . As $N\rightarrow\infty$ , these activation estimates converge to the ergodic limit, in which all eligible contracts have the same probability of being active.

Remark 7.

Another way to appreciate fairness for contracts induced by randomized HFPD-OT plans is to study the random marginal cost, $c_{i,j}$ , associated with the contract $\pi_{i,j}$ (Figure 1(b)):

c_{i,j}\equiv\pi_{i,j}\mathsf{C}_{i,j}\;\;,\;\;\pi\sim\mathsf{S}^{o}

(51)

Recall that the squared 2-Kantorovitch distance,

\mathsf{KD}_{2}^{2}(\mu_{0},\nu_{0})\equiv\min_{\pi\in\mathbb{\Pi}(\mu_{0},\nu% _{0})}\sum_{i,j}\mathsf{C}_{i,j}\pi_{i,j},

(52)

is the minimum expected transport cost between $\mu_{0}$ and $\nu_{0}$ , for the Euclidean cost function, $\mathsf{C}$ (Section 5.1) [Villani, 2008]. The base-level OT objective in (52) yields a fixed optimal solution, where the cost $c_{i,j}$ is immutable. Consequently, the transport burden is supported by the same set of contracts. Let $\pi_{i_{0},j_{0}}$ be one such contract where:

c_{i_{0},j_{0}}>\mathsf{KD}_{2}^{2}(\mu_{0},\nu_{0}),\;\;(i_{0},j_{0})\in[\![m% ]\!]\times[\![n]\!]

(53)

On the other hand, in HFPD-OT, and by virtue of the random nature of $c_{i_{0},j_{0}}$ , we can write the following Markov inequality:

\Pr\left[c_{i_{0},j_{0}}\geq\mathsf{KD}_{2}^{2}(\mu_{0},\nu_{0})\right]\leq% \frac{\mathsf{E}_{\pi\sim\mathsf{S}^{o}}\left[c_{i_{0},j_{0}}\right]}{\mathsf{% KD}_{2}^{2}(\mu_{0},\nu_{0})}

(54)

Hence, this probability upper bound depends on the ratio of the expected marginal transport cost associated with the contract, $\pi_{i_{0},j_{0}}$ (51), to the squared 2-Kantorovitch distance between the nominal marginals (52). Essentially, it provides an upper bound on the probability of a fairness-related proposition (Definition 5.4). Insights such as these may be used to establish operating conditions that are conducive to fairness. Such statistical handles on transport fairness are, of course, unavailable in conventional base-level OT.

6 Conclusions and next steps

This paper recasts the optimal transport problem into a broader class of fully probabilistic design and generalized Bayesian inference techniques. In this new formalism, the transport plan is no longer regarded as a crisp, deterministic object, but is modeled as a random (i.e. uncertain) distribution in a hierarchical Bayesian setting. This is in clear contrast with the existing, certainty-equivalence-based OT paradigm. In this way, we augment the conventional base-level (i.e. deterministic) OT framework with the necessary tools to reason about uncertainty and design robust transport algorithms. In this new hierarchical setting, the object of interest is no longer the optimal transport plan, which may not even exist—since the marginals are themselves noisy, uncertain realizations of some underlying stochastic process—but is rather the optimal hyperprior, which is effectively a generative model over the set of transport plans.

We now recall some key results on HFPD-OT, obtained in this paper:

•

The functional form of the optimal hyperprior has been characterized in both the non-parametric and parametric settings. Importantly, we proved that the HFPD-OT setting is a generalization of the classical EOT in that the optimal transport plan can be recovered asymptotically when uncertainty in the marginals decreases.
•

Considering the parametric setting, we proposed an algorithm to approximate the Kantorovitch potentials and described some of the inferential properties of the hyperprior, highlighting its shape and location parameters.
•

To illustrate the importance of HFPD-OT, we studied the problem of algorithmic fairness as it arises in fair market matching problems. First, we explored the role of randomization and diversification in eliciting fairer transport policies for agents, that is, for specific categories of workers and the companies which need their skills. Second, we investigated the role of randomization in eliciting fair matching policies for individual contracts between agents, by allowing the distribution of the transport burden across a larger set of contracts.

There remain important open questions to be studied and improvements to be implemented in subsequent work. The stochastic algorithms leveraged here enable a first approximation of the optimal hyperprior, but better samplers can be derived. Interestingly, sampling from the hyperprior may require new MCMC techniques that leverage the geometry of the support of $\mathsf{S}^{o}(\pi|K)$ . Moreover, the HFPD-OT application covered in this paper is on algorithmic fairness, however, we contend that the set of possible applications is broader: randomized policies play indeed an important role in a diversity of problems related to generalizability and robustness in machine learning. Finally, a notable contribution of this paper has been to expand duality results from the classical setting in OT to the hierarchical framework of HFPD-OT. However, key theoretical results in base-level deterministic OT—mainly those related to its geometry ([Gangbo and McCann, 1996], [Villani, 2008], etc.)—need careful consideration within the extended framework of HFPD-OT.

Acknowledgement

This work has been supported by the European Union’s Horizon Europe research and innovation programme, under grant agreement no. 101070568. It has also been supported by the European Union’s competitive HORIZON-MSCA-2021-DN-01 (Marie Sklodowska-Curie Doctoral Networks) programme, under grant agreement no. 101073508. The authors also acknowledge the support of Innovate UK in underwriting both of the above grants, under the Horizon Europe Guarantee.

7 Appendix: proof of strong duality in Theorem 3.1 (step 3)

The following additional mathematical definitions are required, supplementing the preliminaries in Section 2.2.

•

Besides being compact, we assume henceforth that $\mathbb{\Omega}_{X}$ and $\mathbb{\Omega}_{Y}$ are Hausdorff sets. This separability property guarantees uniqueness of limits and sequences.
•

From compactness of $\mathbb{\Omega}_{X}$ and $\mathbb{\Omega}_{Y}$ , it follows, by the Riesz-Markov-Kakutani Theorem [Folland, 1999], that the topological dual of $\mathbb{C}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ —the set of continuous functions on $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ —is the set of Radon measures with support in $\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}$ . This also implies that $\mathbb{C}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ is a Banach space. Thus, by the Banach-Alaoglu Theorem, $\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ is compact in the weak-* topology [Billingsley, 1999].

•

The previous compactness result allows us to again invoke the Riesz-Markov-Kakutani representation Theorem, which states that the topological dual of $\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ is the hierarchical space of Radon measures with support in $\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ . We denote this dual space by $\mathbb{S}$ . The canonical duality pairing reads as follows [Folland, 1999]:

<f,\mathsf{S}>\equiv\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}% _{Y})}fd\mathsf{S}

(55)

with $f\in\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ and $\mathsf{S}\in\mathbb{S}$ . Later in the proof, we will constrain $\mathbb{S}$ to the set of hierarchical (probability) distributions.

•

If $\mathsf{O}:\mathbb{S}\rightarrow\mathbbm{R}^{p}$ is a linear map, its adjoint is defined as: $\mathsf{O}^{*}\colon\mathbbm{R}^{p}\rightarrow\mathbb{C}(\mathbb{P}(\mathbb{% \Omega}_{X}\times\mathbb{\Omega}_{Y}))$ such that:

<\mathsf{O}(\mathsf{S}),z>=<\mathsf{S},\mathsf{O}^{*}(z)>

(56)

for $\mathsf{S}\in\mathbb{S}$ and $z\in\mathbb{R}^{p}$ .

•

$f^{*}$ denotes the Legendre-Fenchel transform of $f$ defined in $\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ . It is given by:

f^{*}(u)\equiv\sup_{v\in\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times% \mathbb{\Omega}_{Y}))}(<u,v>-f(v))

(57)

•

$\mathsf{dom}(h)$ denotes the effective domain of the function $h\in\mathbb{C}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ , defined as: $\mathsf{dom}(h)\equiv\{\pi\in\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})\;|\;h(\pi)<\infty\}$ .
•

Our proof relies on the notion of decomposable spaces, as originally defined in Theorem 1 of [Rockafellar, 1971]. A space is decomposable if it is stable under bounded alterations over sets of finite measure.
•
Let $\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ denote the set of integrable functions, defined in $\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ . $\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ is decomposable, since it satisfies the following conditions [Rockafellar, 1971]:
- –
  
  $\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ contains every bounded and measurable functions defined on $\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ .
- –
  
  If $h\in\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ and $\mathbb{l}\in\mathcal{F}_{\mathbb{\Omega}_{\mathsf{H}}}$ is an arbitrary set of finite measure in $\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})$ (3), then $\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ contains $\chi_{\mathbb{l}}\boldsymbol{\cdot}h$ , where $\boldsymbol{\cdot}$ denotes the dot product between the indicator function $\chi_{\mathbb{l}}$ of $\mathbb{l}$ and the function $h$ .

•

The characteristic function of a (convex) set $\mathbb{A}$ is the convex function:

\mathbb{1}_{\mathbb{A}}(x)\equiv\left\{\begin{aligned} 0&\;\;\text{if}\;\;x\in% \mathbb{A},\\ +\infty&\;\;\text{otherwise.}\\ \end{aligned}\right.

For the sake of completeness, we recall the main duality Theorem [Rockafellar, 1974] in the general setting, before specializing it to our problem later in the proof: {theorem}[Fenchel-Rockafellar] Let $(E,E^{*})$ and $(F,F^{*})$ be two topologically paired spaces. Let $\mathsf{O}\colon E\rightarrow F$ be a continuous linear operator and $\mathsf{O}^{*}\colon F^{*}\rightarrow E^{*}$ its adjoint. Let f and g be two lower semi-continuous and proper convex functions defined on E and F, respectively. If the following qualification condition is satisfied: $\exists\;y^{*}\in dom(g^{*})$ s.t. $f^{*}$ is continuous at $\mathsf{O}^{*}(y^{*})$ , then:

\max_{x\in E}-f(-x)-g(\mathsf{O}(x))=\min_{y^{*}\in F^{*}}f^{*}(\mathsf{O}^{*}% (y^{*}))+g^{*}(y^{*})

(58)

Proof.

Let’s consider the Primal problem in (11). Using Fubini’s Theorem and the Bayesian hierarchical modelling consistency condition stated in (7), it is easy to show that this original problem can be formulated equivalently, over the set of hyperpriors $\mathbb{S}$ , as follows:

(P):\;\;\;\;\;\mathsf{S}^{o}\in\operatorname*{argmin}_{\mathsf{S}\in\mathbb{S}% }\left\{\mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{S}(\pi|K)||\tilde{\mathsf{S}}(% \pi|K)\bigr{)}\right\}

subject to:

\left\{\begin{aligned} \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0}))\leq\eta\\ \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))\leq\zeta\end{% aligned}\right.

where $\tilde{\mathsf{{S}}}$ is defined in (16). The constraints involve the following linear map:

\mathsf{I}(\mathsf{S})\equiv\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb% {\Omega}_{Y})}\mathsf{S}(\pi|K)d\mathcal{L}(\pi)

besides our usual moment constraints:

\mathsf{O}_{1}(\mathsf{S})\equiv\mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL% }}(\mu||\mu_{0}))\;\;\;,\;\;\mathsf{O}_{2}(\mathsf{S})\equiv\mathsf{E}_{% \mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))

For convenience, we denote by $\bar{\mathsf{O}}$ the linear map given by:

\bar{\mathsf{O}}(\mathsf{S})\equiv(\mathsf{I}(\mathsf{S}),\mathsf{O}_{1}(% \mathsf{S}),\mathsf{O}_{2}(\mathsf{S}))\in\mathbbm{R}^{3}

(59)

As usual, we can use the characteristic function to encode the constraints directly in the objective $(P)$ , yielding the following equivalent unconstrained problem:

(P^{\prime}):\;\;\;\mathsf{S}^{o}\in\operatorname*{argmin}_{\mathsf{S}}\left\{% \mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{S}(\pi|K)||\tilde{\mathsf{S}}(\pi|K)% \bigr{)}+g_{0}(\bar{\mathsf{O}}(\mathsf{S}))\right\}

(60)

where we define $g_{0}$ as follows:

g_{0}(z_{0},z_{1},z_{2})\equiv\mathbb{1}_{[0,\eta]}(z_{0})+\mathbb{1}_{[0,% \zeta]}(z_{1})+\mathbb{1}_{\{1\}}(z_{2})\;\;,\;\;(z_{0},z_{1},z_{2})\in% \mathbbm{R}^{3}

We begin by deriving the Legendre-Fenchel dual of $\bar{\mathsf{O}}(\cdot)$ , $g_{0}(\cdot)$ and $\mathsf{D}_{\mathsf{KL}}(\cdot||\cdot)$ , respectively. By the definition of the adjoint in (59), it is straightforward to show that $\mathsf{\bar{O}}^{*}$ is given by:

\mathsf{\bar{O}}^{*}(\lambda_{1},\lambda_{2},\lambda_{3})=\lambda_{1}\mathsf{D% }_{\mathsf{KL}}(\mu||\mu_{0})+\lambda_{2}\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}% )+\lambda_{3}\;\;,\;\;(\lambda_{1},\lambda_{2},\lambda_{3})\in\mathbbm{R}^{3}

Moreover, applying the definition of Legendre-Fenchel transform (57) yields the following conjugate of $g_{0}$ :

\begin{split}g_{0}^{*}(\lambda_{1},\lambda_{2},\lambda_{3})=\lambda_{1}\eta+% \lambda_{2}\zeta+\lambda_{3}\end{split}

We now turn our attention to the conjugate of $\mathsf{D}_{\mathsf{KL}}(\cdot||\cdot)$ . To this aim, we first consider the following integral functional [Rockafellar, 1971]:

	$\displaystyle\mathcal{I}_{\mathsf{f}}(u)\colon\mathbb{C}(\mathbbm{P}(\mathbb{% \Omega}_{X}\times\mathbb{\Omega}_{Y}))$	$\displaystyle\longrightarrow\mathbbm{R}$		(61)
	$\displaystyle u$	$\displaystyle\longmapsto\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})}\mathsf{f}(\pi,u(\pi))d\mathcal{L}(\pi)$		(62)

where:

\mathsf{f}(\pi,u(\pi))\equiv\tilde{\mathsf{S}}(\pi|K)\exp\bigl{(}u(\pi)-1\bigr% {)}

$\mathsf{f}(\pi,\cdot)$ is clearly an integrable, proper and convex function. As we saw earlier, the space $\mathbb{L}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))$ is decomposable. Therefore, by Theorem 2 in [Rockafellar, 1971], we can perform the Legendre-Fenchel transform of $\mathcal{I}_{\mathsf{f}}$ through the integral sign and write:

\mathcal{I}^{*}_{\mathsf{f}}(u)=\mathcal{I}_{\mathsf{f}^{*}}(\mathsf{S})\equiv% \int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})}\mathsf{f}^{*}% (\pi,u(\pi))d\mathcal{L}(\pi)

(63)

$\mathsf{f}^{*}$ is obtained using again the definition of the Fenchel-Rockafellar transform (57):

\begin{split}\mathsf{f}^{*}(\pi,\mathsf{S}(\pi|K))&\equiv\sup_{u\in\mathbb{C}(% \mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))}\bigl{\{}u(\pi)% \mathsf{S}(\pi|K)-\mathsf{f}(\pi,u(\pi))\bigr{\}}\\ &=\mathsf{S}(\pi|K)\log\Bigl{(}\frac{\mathsf{S}(\pi|K)}{\tilde{\mathsf{S}}(\pi% |K)}\Bigr{)}\end{split}

It follows that:

\mathcal{I}^{*}_{\mathsf{f}}(x)=\mathsf{D}_{\mathsf{KL}}(\mathsf{S}||\tilde{% \mathsf{S}})

There exists at least one hyperprior $\mathsf{S}\in\mathbb{S}$ s.t. $\mathsf{f}^{*}$ is an integrable function of $\pi$ (consider for instance $\mathsf{S}=\mathsf{\tilde{S}}$ ). It follows, by Theorem 1 in [Rockafellar, 1971], that $\mathcal{I}_{\mathsf{f}}$ is a well-defined convex functional. Thus, the conjugacy operator acts as an involution on $\mathcal{I}_{\mathsf{f}}$ , yielding:

\mathsf{D}_{\mathsf{KL}}^{*}=\mathcal{I}_{\mathsf{f}}^{**}=\mathcal{I}_{% \mathsf{f}}

Going back to our main Theorem in (7), it is obvious that $\mathsf{D}_{\mathsf{KL}}(\cdot||\cdot)$ and $g_{0}(\cdot)$ are lower semicontinuous, proper and convex. Furthermore, $\mathsf{D}_{\mathsf{KL}}^{*}(\cdot||\cdot)$ is continuous everywhere w.r.t the uniform norm (Theorem 4 in [Rockafellar, 1974]). It follows that strong duality holds and that the primal and dual problems are equal, the dual reading as follows:

\begin{split}(D):\;\;\;\;\sup_{(\lambda_{1},\lambda_{2},\lambda_{3})\in% \mathbbm{R}^{3}}\left\{-\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})}\tilde{\mathsf{S}}(\pi|K)\exp\bigl{(}\bar{\mathsf{O}}^{*}(-% \lambda_{1},-\lambda_{2},-\lambda_{3})-1\bigr{)}d\mathcal{L}(\pi)-\lambda_{1}% \eta-\lambda_{2}\zeta-\lambda_{3}\right\}\end{split}

(64)

One can simplify further the previous result by maximizing (64) w.r.t $\lambda_{3}$ for fixed $(\lambda_{1},\lambda_{2})$ , yielding the following value for $\lambda^{o}_{3}$ :

\lambda^{o}_{3}=\log\left(\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})}\tilde{\mathsf{S}}(\pi|K)\exp\left(-\lambda_{1}\mathsf{D}_{% \mathsf{KL}}(\mu||\mu_{0})-\lambda_{2}\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0})-1% \right)d\mathcal{L}(\pi)\right)

(65)

By substituting $\lambda^{o}_{3}$ back into (64), we obtain (18).

The optimality condition:

0\in\partial\mathsf{D}_{\mathsf{KL}}(\mathsf{S}(\pi|K))+\partial{g_{0}}\bigl{(% }\bar{O}(\mathsf{S}\bigr{)})

implies that the primal and dual optimal solutions should satisfy the following extremality conditions [Rockafellar, 1967]:

\left\{\begin{aligned} \mathsf{S}^{o}\in\partial{\mathcal{I}_{\mathsf{f}}(-% \mathsf{O}^{*}(\lambda_{1},\lambda_{2}))}\;\;\;\mathcal{L}-a.e.\\ (-\lambda_{1}^{o},-\lambda_{2}^{o})\in\partial{g_{0}\bigl{(}\mathsf{O}_{1}(% \mathsf{S}),\mathsf{O}_{2}(\mathsf{S})\bigr{)}}\end{aligned}\right.

$\mathcal{I}_{\mathsf{f}}$ being differentiable everywhere, its sub-differential reduces to the usual gradient, leading to the same optimal hyperprior derived earlier using information processing arguments (17):

\mathsf{S}^{o}\propto\exp\left(-\lambda_{1}^{o}\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0})\right)\tilde{\mathsf{S}}(\pi|K)\exp\left(-\lambda_{2}^{o}\mathsf{D}_{% \mathsf{KL}}(\nu||\nu_{0})\right)\;\;\;\mathcal{L}-a.e.

On the other hand, noting that the sub-differential of the indicator function $g_{0}$ is the normal cone $\bar{\mathsf{N}}_{\mathbb{Q}}\bigl{(}\mathsf{O}_{1}(\mathsf{S}),\mathsf{O}_{2}% (\mathsf{S})\bigr{)}$ , defined as follows:

\bar{\mathsf{N}}_{\mathbb{Q}}\bigl{(}\mathsf{O}_{1}(\mathsf{S}),\mathsf{O}_{2}% (\mathsf{S})\bigr{)}\equiv\left\{v\in\mathbbm{R}^{2}\;\;|\;\;v^{\intercal}% \left[\boldsymbol{x}-\begin{bmatrix}\mathsf{O}_{1}(\mathsf{S})\\ \mathsf{O}_{2}(\mathsf{S})\end{bmatrix}\right]\preceq 0,\forall\boldsymbol{x}% \in[0,\eta]\times[0,\zeta]\right\}

(66)

the following optimality conditions are obtained, for the special choice of $\boldsymbol{x}=(\eta,\zeta)$ plugged in (66):

\left\{\begin{aligned} \lambda_{1}^{o}\Bigl{(}\eta-\mathsf{E}_{\mathsf{S}}% \bigl{(}\mathsf{D}_{\mathsf{KL}}(\mu||\mu_{0}\bigr{)}\Bigr{)}\geq 0\\ \lambda_{2}^{o}\Bigl{(}\zeta-\mathsf{E}_{\mathsf{S}}\bigl{(}\mathsf{D}_{% \mathsf{KL}}(\nu||\nu_{0}\bigr{)}\Bigr{)}\geq 0\end{aligned}\right.

Thus: $\boldsymbol{\lambda}\equiv(\lambda_{1}^{o},\lambda_{2}^{o})\succeq 0$ . ∎

References

Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 06–11 Aug 2017.
Courty et al. [2017] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853–1865, 2017. doi: 10.1109/TPAMI.2016.2615921.
Mathon et al. [2014] Benjamin Mathon, François Cayre, Patrick Bas, and Benoit Macq. Optimal transport for secure spread-spectrum watermarking of still images. Image Processing, IEEE Transactions on, 23:1694–1705, 04 2014. doi: 10.1109/TIP.2014.2305873.
Guerreiro et al. [2023] Nuno M. Guerreiro, Pierre Colombo, Pablo Piantanida, and André F. T. Martins. Optimal transport for unsupervised hallucination detection in neural machine translation, 2023.
Galichon [2016] Alfred Galichon. Optimal Transport Methods in Economics. Princeton University Press, 2016.
Saumier et al. [2015] Louis-Philippe Saumier, Boualem Khouider, and Martial Agueh. Optimal transport for particle image velocimetry: Real data and postprocessing algorithms. SIAM Journal on Applied Mathematics, 75(6):2495–2514, 2015. ISSN 00361399.
El Moselhy and Marzouk [2012] Tarek A. El Moselhy and Youssef M. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2012.07.022.
Villani [2008] Cédric Villani. Optimal Transport: Old and New. Springer, 2008.
Peyré and Cuturi [2019] Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
Ben-Tal et al. [2009] Aharon Ben-Tal, Laurent Ghaoui, and Arkadi Nemirovski. Robust Optimization. 08 2009. ISBN 9781400831050. doi: 10.1515/9781400831050.
Sklar [1959] Abe Sklar. Fonctions de répartition à n dimensions et leurs marges. pages 229–231. Publications de l’Institut de Statistique de l’Université de Paris, 8, 1959.
Goodman [1953] Leo Goodman. Ecological regressions and behavior of individuals. American Sociological Review, 18:663, 1953.
Wakefield [2004] Jon Wakefield. Ecological inference for 2x2 tables (with discussion). Journal of the Royal Statistical Society Series A, 167:385–445, 08 2004. doi: 10.1111/j.1467-985x.2004.02046.x.
Frogner and Poggio [2019] Charlie Frogner and Tomaso Poggio. Fast and flexible inference of joint distributions from their marginals. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2002–2011. PMLR, 09–15 Jun 2019.
Mallasto et al. [2021] Anton Mallasto, Markus Heinonen, and Samuel Kaski. Bayesian inference for optimal transport with stochastic cost. In Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research, pages 1601–1616. PMLR, 2021.
Kárný and Kroupa [2012] Miroslav Kárný and Tomáš Kroupa. Axiomatisation of fully probabilistic design. Information Sciences, 186(1):105–113, 2012. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2011.09.018.
Séjourné et al. [2023] Thibault Séjourné, Gabriel Peyré, and François-Xavier Vialard. Chapter 12 - unbalanced optimal transport, from theory to numerics. In Emmanuel Trélat and Enrique Zuazua, editors, Numerical Control: Part B, volume 24 of Handbook of Numerical Analysis, pages 407–471. Elsevier, 2023. doi: https://doi.org/10.1016/bs.hna.2022.11.003.
Cuturi [2013] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
Quinn et al. [2025] Anthony Quinn, Sarah Boufelja Yacobi, Martin Corless, and Robert Shorten. Fully probabilistic design for optimal transport. Communications in Optimization Theory, 2025. To appear.
Jeffreys [1939] Harold Jeffreys. Theory of Probability. Clarendon Press, Oxford, England, 1939.
Carlier et al. [2017] Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385–1418, 2017. doi: 10.1137/15M1050264.
Quinn et al. [2016] Anthony Quinn, Miroslav Kárný, and Tatiana V. Guy. Fully probabilistic design of hierarchical Bayesian models. Information Sciences, 369:532–547, 2016. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2016.07.035.
Bissiri et al. [2016] Pier Giovanni Bissiri, Chris Holmes, and Stephen Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):1103–1130, feb 2016. doi: 10.1111/rssb.12158.
Savage [1971] Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971. doi: 10.1080/01621459.1971.10482346.
Rockafellar [1967] Ralph Tyrrell Rockafellar. Duality and stability in extremum problems involving convex functions. Pacific Journal of Mathematics, 21(1):167 – 187, 1967.
Kracík and Kárný [2005] Jan Kracík and Miroslav Kárný. Merging of data knowledge in Bayesian estimation. In International Conference on Informatics in Control, Automation and Robotics, volume 2, pages 229–232, 2005.
Kolmogorov and Sarmanov [1960] Andrey Nikolaevich Kolmogorov and Oleg Vasilévich Sarmanov. The work of S. N. Bernshtein on the theory of probability. Theory of Probability & Its Applications, 5(2):197–203, 1960. doi: 10.1137/1105017.
Delahaye et al. [2019] Daniel Delahaye, Supatcha Chaimatanan, and Marcel Mongeau. Simulated Annealing: From Basics to Applications, pages 1–35. Springer International Publishing, Cham, 2019. ISBN 978-3-319-91086-4. doi: 10.1007/978-3-319-91086-4_1.
Quinn [2012] Anthony Quinn. Recursive inference for inverse problems using variational Bayes methodology. In 1st International ICST Workshop on New Computational Methods for Inverse Problems. ACM, 2012.
Nocedal and Wright [2006] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, 2e edition, 2006.
Nesterov [2018] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018. ISBN 3319915770.
Betancourt [2017] Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv: Methodology, 2017.
Mangoubi and Smith [2019] Oren Mangoubi and Aaron Smith. Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions 2: Numerical integrators. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 586–595. PMLR, 16–18 Apr 2019.
Snyman [2005] Jan Snyman. A gradient-only line search method for the conjugate gradient method applied to constrained optimization problems with severe noise in the objective function. International Journal for Numerical Methods in Engineering, 62:72 – 82, 01 2005. doi: 10.1002/nme.1189.
Chen and Vempala [2022] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of Hamiltonian Monte Carlo for strongly logconcave distributions. Theory of Computing, 18(9):1–18, 2022. doi: 10.4086/toc.2022.v018a009.
Barocas et al. [2023] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and machine learning: Limitations and opportunities. MIT press, 2023.
Gordaliza et al. [2019] Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, and Jean-Michel Loubes. Obtaining fairness using optimal transport theory. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2357–2365. PMLR, 09–15 Jun 2019.
Hughes and Chen [2021] Jason Hughes and Juntao Chen. Fair and distributed dynamic optimal transport for resource allocation over networks. In 2021 55th Annual Conference on Information Sciences and Systems (CISS), pages 1–6, 2021. doi: 10.1109/CISS50987.2021.9400236.
Galichon [2021] Alfred Galichon. The unreasonable effectiveness of optimal transport in economics. arXiv preprint arXiv:2107.04700, 2021.
Echenique et al. [2024] Federico Echenique, Joseph Root, and Fedor Sandomirskiy. Stable matching as transportation. In Proceedings of the 25th ACM Conference on Economics and Computation, EC ’24, page 418, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400707049. doi: 10.1145/3670865.3673585.
Hoffman and Gelman [2011] Matthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15:1593–1623, 2011.
Flamary et al. [2021] Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python optimal transport. J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435.
Jelinek et al. [1977] Frederick Jelinek, Robert L. Mercer, Lalit R. Bahl, and Janet M. Baker. Perplexity—a measure of the difficulty of speech recognition tasks. Journal of the Acoustical Society of America, 62, 1977.
Martí et al. [2022] Rafael Martí, Anna Martínez-Gavara, Sergio Pérez-Peló, and Jesús Sánchez-Oro. A review on discrete diversity and dispersion maximization from an or perspective. European Journal of Operational Research, 299(3):795–813, 2022. ISSN 0377-2217. doi: https://doi.org/10.1016/j.ejor.2021.07.044.
Gangbo and McCann [1996] Wilfrid Gangbo and Robert J. McCann. The geometry of optimal transportation. Acta Mathematica, 177(2):113 – 161, 1996. doi: 10.1007/BF02392620.
Folland [1999] Gerald Budge Folland. Real Analysis : Modern Techniques and Their Applications. Wiley, New York, 1999.
Billingsley [1999] Patrick Billingsley. Convergence of probability measures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., New York, second edition, 1999. ISBN 0-471-19745-9. A Wiley-Interscience Publication.
Rockafellar [1971] Ralph Tyrrell Rockafellar. Integrals which are convex functionals. II. Pacific Journal of Mathematics, 39(2):439 – 469, 1971.
Rockafellar [1974] Ralph Tyrrell Rockafellar. Conjugate duality and optimization. Society for Industrial and Applied Mathematics, 1974.

Randomized Transport Plans via Hierarchical Fully Probabilistic Design

Abstract

1 Main Contributions

2 Introduction to transport plan design and optimal transport

2.1 Approaches to modelling uncertainty in OT

2.2 Notational conventions, technical preliminaries for non-hierarchical OT, and outline of the paper

3 Hierarchical Fully Probabilistic Design for (Optimal) Transport: HFPD-OT

3.1 The HFPD formulation of optimal transport

Proof method.

Remark 1.

Remark 2.

4 The HFPD-OT hyperprior in the parametric case

4.1 Descriptive analysis of the parametric HFPD-OT hyperprior, 𝖲o⁢(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K )

Remark 3 (Inference with the HFPD-OT hyperprior, 𝖲o⁢(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K )).

4.1.1 Illustration in the m=n=2𝑚𝑛2m=n=2italic_m = italic_n = 2 case

Shape parameters:

Location parameters:

Influence of the ideal hyperprior :

4.2 Stochastic approximation of the optimal Kantorovitch potentials

Remark 4.

Remark 5.

5 HFPD-OT for Algorithmic fairness in market matching

5.1 Simulation study

5.2 Quantifying diversity in HFPD-OT

Remark 6.

5.3 Long-term fairness for agents through randomization

5.4 Long-term fairness for contracts through randomization

Remark 7.

6 Conclusions and next steps

Acknowledgement

7 Appendix: proof of strong duality in Theorem 3.1 (step 3)

Proof.

References

4.1 Descriptive analysis of the parametric HFPD-OT hyperprior, $\mathsf{S}^{o}(\pi|K)$

Remark 3 (Inference with the HFPD-OT hyperprior, $\mathsf{S}^{o}(\pi|K)$ ).

4.1.1 Illustration in the $m=n=2$ case