\newmdtheoremenv

theoremTheorem \newmdtheoremenvdefinitionDefinition \newmdtheoremenvpropositionProposition[theorem]

Randomized Transport Plans via Hierarchical Fully Probabilistic Design

Sarah Boufelja Y.1  Anthony Quinn1,2  Robert Shorten1
1Imperial College London
Corresponding author.
Dyson School of Design Engineering
2Trinity College Dublin
School of Engineering
{s.boufelja21,a.quinn,r.shorten}@imperial.ac.uk

Abstract

An optimal randomized strategy for design of balanced, normalized mass transport plans is developed. It replaces—but specializes to—the deterministic, regularized optimal transport (OT) strategy, which yields only a certainty-equivalent plan. The incompletely specified—and therefore uncertain—transport plan is acknowledged to be a random process. Therefore, hierarchical fully probabilistic design (HFPD) is adopted, yielding an optimal hyperprior supported on the set of possible transport plans, and consistent with prior mean constraints on the marginals of the uncertain plan. This Bayesian resetting of the design problem for transport plans —which we call HFPD-OT—confers new opportunities. These include (i) a strategy for the generation of a random sample of joint transport plans; (ii) randomized marginal contracts for individual source-target pairs; and (iii) consistent measures of uncertainty in the plan and its contracts. An application in fair market matching is outlined, in which HFPD-OT enables the recruitment of a more diverse subset of contracts—than is possible in classical OT—into the delivery of an expected plan.

Keywords— Optimal transport, Bayesian hierarchical modelling, Fully probabilistic design, Convex optimization, Algorithmic fairness, Market matching

1 Main Contributions

Optimal transport (OT) refers to the classical design of a deterministic transport plan, π𝜋\piitalic_π, for taking a unit111Throughout this paper, we address only the balanced, normalized transport problem. mass—distributed across a source domain, ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\mathbb{\Omega}}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT—and redistributing it across a target domain, ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\mathbb{\Omega}}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. The transport plan is expressed as an unknown, deterministic, joint distribution, π𝜋\piitalic_π, with support in ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. The distributed source and target are therefore the marginals of π𝜋\piitalic_π, and are specified a priori by μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\mathbb{\Omega}}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\mathbb{\Omega}}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, respectively. Consequently, π𝜋\piitalic_π is confined to the space, Π(μ0,ν0)double-struck-Πsubscript𝜇0subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), of distributions on ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, with μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as its marginals. An optimal choice, πosuperscript𝜋𝑜\pi^{o}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, of π𝜋\piitalic_π—called the OT plan—is achieved by minimizing the expected value, under π𝜋\piitalic_π, of a pre-specified cost of transport, 𝖢(x,y)𝖢𝑥𝑦{\mathsf{C}}(x,y)sansserif_C ( italic_x , italic_y ), from ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\mathbb{\Omega}}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT to ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\mathbb{\Omega}}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT.

In this paper, we reformulate the design of transport maps in the Bayesian (i.e. fully probabilistic) way. In particular, deterministic optimization—yielding πosuperscript𝜋𝑜\pi^{o}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT—is replaced by the hierarchical fully probabilistic design (HFPD) of an optimal randomized decision-making strategy, π𝖲osimilar-to𝜋superscript𝖲𝑜\pi\sim\mathsf{S}^{o}italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (i.e. a hyperprior), for choosing π𝜋\piitalic_π. This approach recognizes that the unknown transport plan, π𝜋\piitalic_π, is a (generally nonparametric) random process. We therefore equip it with a prior, 𝖲(π|K)𝖲conditional𝜋𝐾\mathsf{S}(\pi|K)sansserif_S ( italic_π | italic_K ), where K𝐾Kitalic_K denotes marginal (mean) knowledge constraints which will be detailed in the sequel. Following the axioms of FPD at this hierarchical level (i.e. HFPD), we equip the space, 𝕊Ksubscript𝕊𝐾{\mathbb{S}}_{K}blackboard_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, of 𝖲𝖲\mathsf{S}sansserif_S—being the randomized strategy for choosing the transport plan, π𝜋\piitalic_π—with an appropriately formulated loss function, and we minimize the expected value of the latter under 𝖲𝖲\mathsf{S}sansserif_S. This yields the optimal randomized strategy, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), for choosing π𝜋\piitalic_π, being also the optimal hyperprior for uncertain π𝜋\piitalic_π. We show that this procedure is equivalent to minimization of a Kullback-Leibler divergence (KLD), leading to a Gibbs form for 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ):

𝖲o(π|K)𝖲𝖨(π|K)e𝖣𝖪𝖫(π||π𝖨)eλ1o𝖣𝖪𝖫(μ||μ0)eλ2o𝖣𝖪𝖫(ν||ν0)𝕊K.\mathsf{S}^{o}(\pi|K)\propto\mathsf{S}_{\mathsf{I}}(\pi|K)e^{-{\mathsf{D}_{% \mathsf{KL}}}(\pi||\pi_{\mathsf{I}})}e^{-\lambda_{1}^{o}{\mathsf{D}_{\mathsf{% KL}}}(\mu||\mu_{0})}e^{-\lambda_{2}^{o}{\mathsf{D}_{\mathsf{KL}}}(\nu||\nu_{0}% )}\in{\mathbb{S}}_{K}.sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) ∝ sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_π | italic_K ) italic_e start_POSTSUPERSCRIPT - sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π | | italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ∈ blackboard_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT . (1)

Here, μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν are the uncertain (i.e. random) marginals of the random transport plan, π𝜋\piitalic_π. The KLDs, 𝖣𝖪𝖫(||){\mathsf{D}_{\mathsf{KL}}}(\cdot||\cdot)sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( ⋅ | | ⋅ ), act as Gibbs energies. Meanwhile, 𝖲𝖨subscript𝖲𝖨\mathsf{S}_{\mathsf{I}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT and π𝖨subscript𝜋𝖨\pi_{\mathsf{I}}italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT are the freely but necessarily pre-specified zero-loss choices of 𝖲𝖲\mathsf{S}sansserif_S and π𝜋\piitalic_π, respectively, referred to as the ideal or target choices.

By resetting OT as a problem of Bayesian decision making via HFPD, we achieve the following principal goals:

  • (i)

    The deterministic, regularized OT choice, πosuperscript𝜋𝑜\pi^{o}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, obtained via constrained optimization at the base level of modelling, (x,y)πsimilar-to𝑥𝑦𝜋(x,y)\sim\pi( italic_x , italic_y ) ∼ italic_π, is replaced by an optimal generator of randomized plans (i.e. a randomized strategy for designing transport plans, π𝜋\piitalic_π) at the hierarchical level of complete modelling, π𝖲o(π|K)similar-to𝜋superscript𝖲𝑜conditional𝜋𝐾\pi\sim\mathsf{S}^{o}(\pi|K)italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ).

  • (ii)

    In the parametric case, in which the support set, ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌{\mathbb{\mathbb{\Omega}}_{X}}\times{\mathbb{\mathbb{\Omega}}_{Y}}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, is finite, we can compute optimal univariate (marginal and/or conditional) distributions, πi,j𝖲i,josimilar-tosubscript𝜋𝑖𝑗subscriptsuperscript𝖲𝑜𝑖𝑗\pi_{i,j}\sim\mathsf{S}^{o}_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, for modelling and randomization of the transport contract, πij(0,1)subscript𝜋𝑖𝑗01\pi_{ij}\in(0,1)italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ ( 0 , 1 ), from the agent (at) xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the agent (at) yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

  • (iii)

    In line with all Bayesian decision-making strategies, we can summarize π𝖲o(π|K)similar-to𝜋superscript𝖲𝑜conditional𝜋𝐾\pi\sim\mathsf{S}^{o}(\pi|K)italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) via a certainty-equivalent (CE) transport plan, π^ΠK^𝜋subscriptdouble-struck-Π𝐾\hat{\pi}\in\mathbb{\Pi}_{K}over^ start_ARG italic_π end_ARG ∈ blackboard_Π start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT—such as its expected or maximally probable value—and equip this with a summary of our uncertainty in π𝜋\piitalic_π (e.g. via the Bayesian standard intervals for the contracts, πi,j𝖲i,josimilar-tosubscript𝜋𝑖𝑗subscriptsuperscript𝖲𝑜𝑖𝑗\pi_{i,j}\sim\mathsf{S}^{o}_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT).

By equipping transport plans with an optimal hyperprior from which candidate plans can be generated, we are able to encode our prior knowledge and our ranking of preferences. This HFPD resetting of OT can have significant impact in applications. We consider one such application, in algorithmic fairness. Specifically, we address the problem of labour market matching, in which fairness is induced by optimally randomizing the matching strategy (a transport plan) via HFPD-OT, thereby increasing a diversity index among contracts.

2 Introduction to transport plan design and optimal transport

Optimal Transport (OT) techniques have received increasing attention in the past decade, in a wide range of domains such as machine learning and generative adversarial learning [Arjovsky et al., 2017], domain adaptation [Courty et al., 2017], image processing and watermarking [Mathon et al., 2014], hallucination detection in neural translation machines [Guerreiro et al., 2023], etc. In addition to traditional applications in economics and market matching [Galichon, 2016], fluid mechanics and diffusion processes [Saumier et al., 2015], it has also been used to perform sampling and Bayesian inference [El Moselhy and Marzouk, 2012].

OT is concerned with the least costly transport plan (in expectation) between a source and a target probability measure. The unregularized OT plan induces a natural distance in the space of probability measures (the Kantorovitch-Rubinstein distance) [Villani, 2008], introducing a rich topological structure by lifting key geometric properties associated with the ground metric to the space of probability measures [Villani, 2008, Peyré and Cuturi, 2019]. For example, if the ground space is Euclidean, concepts like gradient, barycentre and convexity are naturally extended to the space of probability measures.

Notwithstanding the wide range of applications, the classical formalism of OT confines it to a purely deterministic setting, which regards the transport plan as a crisp object and assumes perfect knowledge of the marginals (Figure 1(a)). It fails to model and (critically) translate uncertainty in the marginals to uncertainty in the design of transport plans. In this regard, classical OT is an instance of certainty-equivalence (CE) decision making, which produces myopic transport strategies that do not account for the uncertain and random nature of many real systems. One might think that recasting the classical OT problem in terms of robust optimization might address these issues. A robust optimization formulation relies on a deterministic, unknown but bounded description of the uncertainty in the marginals [Ben-Tal et al., 2009]. Such a design choice may be overly conservative: it indeed considers all possible outcomes in the uncertainty set, but may assign non-negligible weights even to plans that are highly improbable. Furthermore, the robust design is not equipped with a quantifier of the intrinsic uncertainty of the transport plan.

In this manuscript, we propose the HFPD-OT approach to the design of uncertain transport plans. It departs from the conventional OT setting by considering the transport plan as a random process. Consistent hierarchical Bayesian modelling endows the uncertain plan with its own hyperprior (Figure 1(b)). Its optimal choice provides a randomized strategy for choosing transport plans in the space of plans consistent with prior-imposed knowledge constraints on its marginals. It also acts as a generative model for random sampling of transport plans. By treating transport plans as random processes, we effectively recast the transport design problem as one of inference. This contrasts with the OT literature, which is only concerned with deterministic optimization strategies for choosing deterministic plans. As we will see in the literature review, next, the tools provided by HFPD-OT—intended for modelling and reasoning about uncertainty in transport plans—are not available in the classical OT setting.

2.1 Approaches to modelling uncertainty in OT

There are precedents in eliciting and processing uncertainty in OT, but they are generally couched in terms of base-level modelling, and not in terms of the hierarchical Bayesian approach developed here. Specifically, (i) our method is primarily concerned with the design of a fully probabilistic model over the space of transport plans; (ii) as such, the transport plan is modeled as a (generally nonparametric) process endowed with its own (hyper)prior; and (iii) we rely on randomization techniques for choosing plans, in contrast to existing methods which are mainly based on deterministic optimization techniques.

Copulas [Sklar, 1959] are historically among the first methods proposed for the design of multivariate distributions with arbitrary, but perfectly known marginals. Other techniques relaxed this assumption to address situations where exactly one marginal is uncertain. This is the case in [Goodman, 1953], for instance, where the authors model the uncertainty in one marginal with a Gaussian noise. In ecological inference (a case of parametric transport design on a finite support), [Wakefield, 2004] studied the case where one marginal is uncertain, adopting a hierarchical multinomial-Dirichlet-based model. We highlight two distinctions in our work: (i) we do not impose a parametric constraint in general, and we allow for uncertainty in both marginals; and (ii) the authors of the earlier paper pursue markedly different statistical inference objectives from OT .

Interestingly, the connection between ecological inference and OT was not established until later, in [Frogner and Poggio, 2019], where the authors extended the previous model and studied the case where both marginals are uncertain. The questions we address in this paper again differ from those in [Frogner and Poggio, 2019] in the following ways: (i) they solve a base-level MAP optimization problem using a Bregman projection method, once again recovering a certainty-equivalent OT plan, whereas our primary goal is to depart from such a certainty-equivalence setting and design an optimal hierarchical Bayesian model from which random transport plans can be generated and used in lieu of an OT plan. If required (as we will see), the expected plan takes the place of the MAP plan as the Bayesian minimum-risk decision (i.e. estimate) of the uncertain plan, with asymptotic convergence to the MAP plan; and (ii) the derivations in [Frogner and Poggio, 2019] rely on parametric and structural assumptions, mainly full separability. Separability is a strong assumption in that it excludes the modelling of rich structures and interactions that may exist in real-world data. We do not require these assumptions in our hyperprior, and we leave it to the designer—via the specification of ideal designs (to be explained in the sequel)—to impose any relevant structural requirements.

Uncertainty in the cost matrix in the finite case is considered by [Mallasto et al., 2021]. Given a finite sample of these cost matrices, they model the induced uncertainty in the (finitely supported) OT plan. They do not allow for any uncertainty in the marginals, and so their distribution over OT plans is geometrically constrained to the OT polytope. They impose various standard parametric priors over this set, without any optimality claims for them. Our work significantly extends this treatment by modelling uncertainty in the marginals, so that our hierarchical model has support in the geometrically unconstrained space of transport plans, and extends to the nonparametric setting of continuously supported plans. Importantly—and in contrast to [Mallasto et al., 2021]—we do not impose an optimality constraint on the base-level plans themselves, but, rather, on the hierarchical (generative) distribution of (all possible) plans, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (1). In this way, the random generator of the plans, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), is optimal, and not the uncertain transport plan, π𝜋\piitalic_π, itself (although subsequent projections of 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) can yield optimal Bayesian decisions about π𝜋\piitalic_π, in the conventional manner of Bayesian decision-making). The main contribution of our work is to deduce this optimal hyperprior for transport design (1) via the foundational methods of fully probabilistic design (FPD) [Kárný and Kroupa, 2012]. We show how this HFPD-OT hyperprior concentrates to the classical regularized OT solution as uncertainty in the marginals diminishes (28).

An interesting line of work on unbalanced OT (UOT) in [Séjourné et al., 2023] relaxes the strict marginal constraints Π(μ0,ν0)double-struck-Πsubscript𝜇0subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and replaces them by a soft penalization, using Kullback-Leibler balls centred on the nominal marginals (as we do in this paper). This ensures feasibility of the UOT problem, allowing transport between unequal (non-probability) measures (which we do not allow in our work). Once again, their solution involves a base-level deterministic optimization.

Finally, entropy-regularized OT (EOT) [Cuturi, 2013] is a foundational work on deterministic OT that will be recovered asymptotically via HFPD-OT. In EOT, the classical OT linear program is relaxed by means of an entropy regularization term, yielding a strictly convex problem, which is amenable to efficient matrix scaling algorithms, notably Sinkhorn-Knopp [Cuturi, 2013]. In our own recent paper [Quinn et al., 2025], we formally establish the relationship between base-level EOT under the usual deterministic marginal constraints—therefore yielding a certainty-equivalent (i.e. singular) OT plan, πosuperscript𝜋𝑜\pi^{o}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, in the conventional manner—and fully probabilistic design (FPD). In this paper, our goal is to extend the base-level EOT setting by deriving an optimal hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (1), over the set of uncertain transport plans.

2.2 Notational conventions, technical preliminaries for non-hierarchical OT, and outline of the paper

In the following, we will review the key mathematical conventions used throughout the paper. Specifically, all probability measures will be referred to as (probability) distributions. The context will make clear whether the distribution in question is a probability density function (pdf) or a probability mass function (pmf). A superscript o𝑜oitalic_o refers to optimal distributions, e.g. 𝖲osuperscript𝖲𝑜\mathsf{S}^{o}sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, whereas a subscript 𝖨𝖨\mathsf{I}sansserif_I designates ideal distributions, e.g. 𝖲𝖨subscript𝖲𝖨\mathsf{S}_{\mathsf{I}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT. Moreover, all fixed and prior-elicited quantities are referred to using a subscript 00 (μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, etc.). Sets will be denoted by a blackboard typeface (e.g. ΩX,ΩY,𝕄subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝕄\mathbb{\Omega}_{X},\mathbb{\Omega}_{Y},\mathbb{M}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT , blackboard_M, etc.), and deterministic functionals will be denoted by a math sans serif typeface (e.g. 𝖲𝖲\mathsf{S}sansserif_S, 𝖢𝖢\mathsf{C}sansserif_C, 𝖣𝖣\mathsf{D}sansserif_D, etc.). Instantiated distributions will be assigned a math calligraphic typeface (𝒰𝒰\mathcal{U}caligraphic_U, 𝒩𝒩\mathcal{N}caligraphic_N, etc).

  • The conventional non-hierarchical—which we call the base-level—probability space (triple) is (Ωdouble-struck-Ω\mathbb{\mathbb{\Omega}}blackboard_Ω, \mathcal{F}caligraphic_F, 𝒫𝒫\mathcal{P}caligraphic_P), where Ωdouble-struck-Ω\mathbb{\mathbb{\Omega}}blackboard_Ω is the sample space, \mathcal{F}caligraphic_F denotes the (σ𝜎\sigmaitalic_σ-)algebra of measurable subsets of Ωdouble-struck-Ω\mathbb{\mathbb{\Omega}}blackboard_Ω, and 𝒫𝒫\mathcal{P}caligraphic_P is a probability measure defined on \mathcal{F}caligraphic_F.

  • Consider two random variables (rvs), X𝑋Xitalic_X: ΩΩXmaps-todouble-struck-Ωsubscriptdouble-struck-Ω𝑋\mathbb{\mathbb{\Omega}}\mapsto\mathbb{\mathbb{\Omega}}_{X}blackboard_Ω ↦ blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and Y𝑌Yitalic_Y: ΩΩYmaps-todouble-struck-Ωsubscriptdouble-struck-Ω𝑌\mathbb{\mathbb{\Omega}}\mapsto\mathbb{\mathbb{\Omega}}_{Y}blackboard_Ω ↦ blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, whose images, ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\mathbb{\Omega}}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\mathbb{\Omega}}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, are, respectively, compact subsets of topological spaces of unspecified dimensions. In the standard setting of optimal transport (OT) [Villani, 2008, Peyré and Cuturi, 2019], their marginal distributions under (Ω,,𝒫)double-struck-Ω𝒫(\mathbb{\mathbb{\Omega}},\mathcal{F},\mathcal{P})( blackboard_Ω , caligraphic_F , caligraphic_P ) are prior-specified (i.e. known) to be μ0(ΩX)subscript𝜇0subscriptdouble-struck-Ω𝑋\mu_{0}\in\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) and ν0(ΩY)subscript𝜈0subscriptdouble-struck-Ω𝑌\nu_{0}\in\mathbbm{P}(\mathbb{\Omega}_{Y})italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ), respectively, while their joint distribution, π(ΩX×ΩY)𝜋subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\pi\in\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{% Y})italic_π ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ), is unknown, and is the subject of design.

  • The reference measure in (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌(\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{Y})( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) is denoted by λ(x,y)𝜆𝑥𝑦\lambda(x,y)italic_λ ( italic_x , italic_y ). Depending on the context, λ𝜆\lambdaitalic_λ interchangeably denotes the Lebesgue measure (in the continuous case) or the counting measure (in the discrete case). π𝜋\piitalic_π, μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are absolutely continuous w.r.t. λ𝜆\lambdaitalic_λ. We do not distinguish notationally between a probability measure and its Radon-Nikodym derivative w.r.t. to λ𝜆\lambdaitalic_λ, e.g.  dπdλπ𝑑𝜋𝑑𝜆𝜋\frac{d\pi}{d\lambda}\equiv\pidivide start_ARG italic_d italic_π end_ARG start_ARG italic_d italic_λ end_ARG ≡ italic_π, etc., and we refer to all as distributions.

  • The prior-specified marginal constraints, μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, constrain π𝜋\piitalic_π to the following knowledge-constrained set:

    Π(μ0,ν0){π(ΩX×ΩY)|Ωyπ𝑑λ(y)μ0,ΩXπ𝑑λ(x)ν0}double-struck-Πsubscript𝜇0subscript𝜈0conditional-set𝜋subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌formulae-sequencesubscriptsubscriptdouble-struck-Ω𝑦𝜋differential-d𝜆𝑦subscript𝜇0subscriptsubscriptdouble-struck-Ω𝑋𝜋differential-d𝜆𝑥subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})\equiv\Bigl{\{}\pi\in\mathbbm{P}(\mathbb{\Omega}_% {X}\times\mathbb{\Omega}_{Y})\;|\;\int_{\mathbb{\Omega}_{y}}\pi d\lambda(y)% \equiv\mu_{0},\;\;\int_{\mathbb{\Omega}_{X}}\pi d\lambda(x)\equiv\nu_{0}\Bigr{\}}blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≡ { italic_π ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) | ∫ start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π italic_d italic_λ ( italic_y ) ≡ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ∫ start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π italic_d italic_λ ( italic_x ) ≡ italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }
  • Consider an alternative distribution, ζ(ΩX×ΩY)𝜁subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\zeta\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})italic_ζ ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ). The Kullback-Leibler divergence (KLD) of ζ𝜁\zetaitalic_ζ to π𝜋\piitalic_π is:

    𝖣𝖪𝖫(π||ζ){ΩX×ΩYπ(x,y)log(π(x,y)ζ(x,y))dλ(x,y)ifπζ,+otherwise,\mathsf{D}_{\mathsf{KL}}(\pi||\zeta)\equiv\left\{\begin{aligned} \int_{\mathbb% {\Omega}_{X}\times\mathbb{\Omega}_{Y}}\pi(x,y)\log\Bigl{(}\frac{\pi(x,y)}{% \zeta(x,y)}\Bigl{)}d\lambda(x,y)&\;\;\text{if}\;\;\pi\ll\zeta,\\ +\infty&\;\;\text{otherwise,}\\ \end{aligned}\right.sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π | | italic_ζ ) ≡ { start_ROW start_CELL ∫ start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π ( italic_x , italic_y ) roman_log ( divide start_ARG italic_π ( italic_x , italic_y ) end_ARG start_ARG italic_ζ ( italic_x , italic_y ) end_ARG ) italic_d italic_λ ( italic_x , italic_y ) end_CELL start_CELL if italic_π ≪ italic_ζ , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise, end_CELL end_ROW (2)

    where πζmuch-less-than𝜋𝜁\pi\ll\zetaitalic_π ≪ italic_ζ indicates the absolute continuity (a.c.) of π𝜋\piitalic_π w.r.t. ζ𝜁\zetaitalic_ζ.

  • If 𝗊𝗊\mathsf{q}sansserif_q is an integrable function with domain ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, then its expectation w.r.t π𝜋\piitalic_π is defined as

    𝖤π[𝗊]ΩX×ΩY𝗊(x,y)π(x,y)𝑑λ(x,y)<subscript𝖤𝜋delimited-[]𝗊subscriptsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝗊𝑥𝑦𝜋𝑥𝑦differential-d𝜆𝑥𝑦\mathsf{E}_{\pi}\left[\mathsf{q}\right]\equiv\int_{\mathbb{\Omega}_{X}\times% \mathbb{\Omega}_{Y}}\mathsf{q}(x,y)\pi(x,y)d\lambda(x,y)\;<\inftysansserif_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ sansserif_q ] ≡ ∫ start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_q ( italic_x , italic_y ) italic_π ( italic_x , italic_y ) italic_d italic_λ ( italic_x , italic_y ) < ∞
  • K𝐾Kitalic_K—in, for example,  𝖲(π|K)𝖲conditional𝜋𝐾\mathsf{S}(\pi|K)sansserif_S ( italic_π | italic_K )—is Jeffreys’ notation [Jeffreys, 1939], encoding the knowledge which acts as a condition on a probability model. It effectively confines 𝖲𝖲\mathsf{S}sansserif_S to a particular knowledge-constrained set, 𝕊Ksubscript𝕊𝐾{\mathbb{S}}_{K}blackboard_S start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Its specific meaning will be defined in context, at both the base level and hierarchical level, as appropriate.

  • 𝗌𝗎𝗉𝗉(μ)𝗌𝗎𝗉𝗉𝜇\mathsf{supp(\mu)}sansserif_supp ( italic_μ ) denotes the support of the distribution, μ𝜇\muitalic_μ.

  • <,><\!\!\cdot,\!\cdot\!\!>< ⋅ , ⋅ > denotes the standard inner product between vectors in a Euclidean space. When required, it will be generalized to the canonical duality pairing in the infinite dimensional setting.

  • succeeds-or-equals\succeq denotes an element-wise comparison between vectors u,vp𝑢𝑣superscript𝑝u,v\in\mathbb{R}^{p}italic_u , italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT: uvuivi,i{1,2,,p}iffsucceeds-or-equals𝑢𝑣formulae-sequencesubscript𝑢𝑖subscript𝑣𝑖for-all𝑖12𝑝u\succeq v\iff u_{i}\geq v_{i},\;\;\forall i\in\{1,2,\dots,p\}italic_u ⪰ italic_v ⇔ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ { 1 , 2 , … , italic_p }. Other relational operators between vectors should also be understood element-wise.

  • The indicator function of a set 𝔸𝔸\mathbb{A}blackboard_A is:

    χ𝔸(x){1ifx𝔸,0otherwise.\chi_{\mathbb{A}}(x)\equiv\left\{\begin{aligned} 1&\;\;\text{if}\;\;x\in% \mathbb{A},\\ 0&\;\;\text{otherwise.}\\ \end{aligned}\right.italic_χ start_POSTSUBSCRIPT blackboard_A end_POSTSUBSCRIPT ( italic_x ) ≡ { start_ROW start_CELL 1 end_CELL start_CELL if italic_x ∈ blackboard_A , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW
  • δx0(x)subscript𝛿subscript𝑥0𝑥\delta_{x_{0}}(x)italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) denotes the distribution that is singular at x=x0𝑥subscript𝑥0x=x_{0}italic_x = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, being the Dirac delta-function w.r.t. Lebesgue measure in the case of continuous x𝑥xitalic_x.

  • ΔqsubscriptΔ𝑞\Delta_{q}roman_Δ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT,  1q<1𝑞1\leq q<\infty1 ≤ italic_q < ∞, denotes the open probability simplex of dimension q𝑞qitalic_q. If q>1𝑞1q>1italic_q > 1 and xΔq𝑥subscriptΔ𝑞x\in\Delta_{q}italic_x ∈ roman_Δ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, then the support of the conditional distribution, 𝖥(x\i|xi)𝖥conditionalsubscript𝑥\absent𝑖subscript𝑥𝑖\mathsf{F}(x_{\backslash i}|x_{i})sansserif_F ( italic_x start_POSTSUBSCRIPT \ italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), is denoted by (1xi)Δq11subscript𝑥𝑖subscriptΔ𝑞1(1-x_{i})\Delta_{q-1}( 1 - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_Δ start_POSTSUBSCRIPT italic_q - 1 end_POSTSUBSCRIPT,  0<xi<10subscript𝑥𝑖10<x_{i}<10 < italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 1.

The outline of the paper is as follows. In Section 3, we state the mathematical problem and establish the duality result in the infinite dimensional case, hence deriving a formal characterization of the optimal Bayesian hyperprior (1). Section 4 introduces the parametric hyperprior, and we provide a descriptive analysis in a low dimensional setting in Section 4.1. Meanwhile, Section 4.2 proposes an algorithm for the computation of the optimal Kantorovitch potentials in this parametric setting. In Section 5, we apply the HFPD-OT formalism to a market matching problem in order to improve a contract diversity index, before closing the paper with our main conclusions in Section 6.

Refer to caption
(a) In the conventional base-level OT setting, the transport plan, π𝜋\piitalic_π, is deterministic, and so all the contracts, πi,j[0,1]subscript𝜋𝑖𝑗01\pi_{i,j}\in[0,1]italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ], are as well. Their respective (marginal) distributions are therefore singular at πi,josubscriptsuperscript𝜋𝑜𝑖𝑗\pi^{o}_{i,j}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, where πosuperscript𝜋𝑜\pi^{o}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT denotes the OT plan (5).
Refer to caption
(b) HFPD-OT acknowledges that uncertainty in μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν induces uncertainty in the transport plan, π𝜋\piitalic_π, and therefore in the individual contracts, πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. Hierarchical fully probabilistic design (HFPD) endows π𝜋\piitalic_π with an optimal hyperprior, π𝖲o(π|K)similar-to𝜋superscript𝖲𝑜conditional𝜋𝐾\pi\sim\mathsf{S}^{o}(\pi|K)italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), whose marginals, 𝖲o(πi,j)superscript𝖲𝑜subscript𝜋𝑖𝑗\mathsf{S}^{o}(\pi_{i,j})sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ), are the distributions of the contracts, πi,j[0,1]subscript𝜋𝑖𝑗01\pi_{i,j}\in[0,1]italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ].
Refer to caption
(c) For a fixed y0ΩYsubscript𝑦0subscriptdouble-struck-Ω𝑌y_{0}\in\mathbb{\Omega}_{Y}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, HFPD-OT also acknowledges the conditional plans, πx|y0subscript𝜋conditional𝑥subscript𝑦0\pi_{x|y_{0}}italic_π start_POSTSUBSCRIPT italic_x | italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, as random processes, again equipped with their own optimal distributions, 𝖲o(πx|y0)superscript𝖲𝑜subscript𝜋conditional𝑥subscript𝑦0\mathsf{S}^{o}(\pi_{x|y_{0}})sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_x | italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), consistent with π𝖲o(π|K)similar-to𝜋superscript𝖲𝑜conditional𝜋𝐾\pi\sim\mathsf{S}^{o}(\pi|K)italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ).
Figure 1: Schematics which distinguish conventional base-level (i.e. deterministic) OT, in (a), from HFPD-OT, in (b) and (c). For ease of illustration, we consider the finite dimensional specialization in Section 4, but the ideas extend to the continuous setting. In HFPD-OT, uncertainty in the marginals, μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, induce uncertainty in the joint (π𝜋\piitalic_π) and conditional (πx|y0subscript𝜋conditional𝑥subscript𝑦0\pi_{x|y_{0}}italic_π start_POSTSUBSCRIPT italic_x | italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT) transport plans, as well as in the individual contracts (πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT). All are optimally modeled in probability (i.e. they are random processes or variables, per the setting). Here, a contract, πi,j[0,1]subscript𝜋𝑖𝑗01\pi_{i,j}\in[0,1]italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ [ 0 , 1 ]—see (a) and (b)—refers to the normalized quantity of resource (information, assets, stock, etc.) transported from agent xiΩXsubscript𝑥𝑖subscriptdouble-struck-Ω𝑋x_{i}\in\mathbb{\Omega}_{X}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT to agent yjΩYsubscript𝑦𝑗subscriptdouble-struck-Ω𝑌y_{j}\in\mathbb{\Omega}_{Y}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, in delivering the (global) transport plan, π𝜋\piitalic_π.

3 Hierarchical Fully Probabilistic Design for (Optimal) Transport: HFPD-OT

The classical OT setting contemplates the transport plan as a purely deterministic object and frames the OT problem solely from an optimization perspective (Figure 1(a)). More precisely, FPD-OT [Quinn et al., 2025], which is a generalization of the classical EOT problem [Cuturi, 2013], is built upon the following optimization problem:

π𝖮𝖳,ϵ,ϕo(x,y|K)=argminπΠ(μ0,ν0)𝖣𝖪𝖫(π(x,y)||π𝖨(x,y|K)),\pi^{o}_{\mathsf{OT},\epsilon,\phi}(x,y|K)=\operatorname*{argmin}_{\pi\in% \mathbb{\Pi}(\mu_{0},\nu_{0})}\mathsf{D}_{\mathsf{KL}}(\pi(x,y)||\pi_{\mathsf{% I}}(x,y|K)),italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT sansserif_OT , italic_ϵ , italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) = roman_argmin start_POSTSUBSCRIPT italic_π ∈ blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π ( italic_x , italic_y ) | | italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) ) , (3)

where the base-level ideal design, π𝖨,subscript𝜋𝖨\pi_{\mathsf{I}},italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT , with support in ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, is defined as the following extended Gibbs kernel:

π𝖨(x,y|K)exp(𝖢(x,y)ϵ)ϕ(x,y).proportional-tosubscript𝜋𝖨𝑥conditional𝑦𝐾𝖢𝑥𝑦italic-ϵitalic-ϕ𝑥𝑦\pi_{\mathsf{I}}(x,y|K)\propto\exp\Bigl{(}\frac{-\mathsf{C}(x,y)}{\epsilon}% \Bigr{)}\phi(x,y).italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) ∝ roman_exp ( divide start_ARG - sansserif_C ( italic_x , italic_y ) end_ARG start_ARG italic_ϵ end_ARG ) italic_ϕ ( italic_x , italic_y ) . (4)

𝖢:ΩX×ΩY+:𝖢subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌superscript\mathsf{C}:\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}\rightarrow\mathbbm{R}^% {+}sansserif_C : blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denotes a continuous cost function, ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 is a smoothness (i.e. regularizing) parameter, and ϕitalic-ϕ\phiitalic_ϕ is a fixed distribution, which may be used to encode additional structural preferences in the design of the OT plan. K𝐾Kitalic_K (Section 2.2) denotes the deterministic, domain-specific knowledge constraints, consisting of external or side-information gathered from the environment, and any other prior knowledge related to the problem being modeled. In the conventional base-level (i.e. deterministic) EOT setting, we impose these knowledge constraints in the form of deterministic marginal constraints Π(μ0,ν0)double-struck-Πsubscript𝜇0subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (2.2). Importantly, when ϕitalic-ϕ\phiitalic_ϕ is instantiated as the uniform distribution, 𝒰𝒰\mathcal{U}caligraphic_U, with support in ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, the resulting EOT solution converges in the ΓΓ\Gammaroman_Γ-sense to the Monge-Kantorovitch solution [Carlier et al., 2017]:

π𝖮𝖳,ϵ,𝒰o(x,y|K)ϵ0π𝖮𝖳o(x,y|K)argminπΠ(μ0,ν0)ΩX×ΩY𝖢(𝗑,𝗒)π(x,y)𝑑λ(x,y).italic-ϵ0subscriptsuperscript𝜋𝑜𝖮𝖳italic-ϵ𝒰𝑥conditional𝑦𝐾subscriptsuperscript𝜋𝑜𝖮𝖳𝑥conditional𝑦𝐾subscriptargmin𝜋double-struck-Πsubscript𝜇0subscript𝜈0subscriptsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝖢𝗑𝗒𝜋𝑥𝑦differential-d𝜆𝑥𝑦\pi^{o}_{\mathsf{OT},\epsilon,\mathcal{U}}(x,y|K)\xrightarrow[]{\epsilon% \rightarrow 0}\pi^{o}_{\mathsf{OT}}(x,y|K)\equiv\operatorname*{argmin}_{\pi\in% \mathbb{\Pi}(\mu_{0},\nu_{0})}\int_{\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{% Y}}\mathsf{C(x,y)}\pi(x,y)d\lambda(x,y).italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT sansserif_OT , italic_ϵ , caligraphic_U end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) start_ARROW start_OVERACCENT italic_ϵ → 0 end_OVERACCENT → end_ARROW italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT sansserif_OT end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) ≡ roman_argmin start_POSTSUBSCRIPT italic_π ∈ blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_C ( sansserif_x , sansserif_y ) italic_π ( italic_x , italic_y ) italic_d italic_λ ( italic_x , italic_y ) . (5)

In the sequel, we will denote the base-level OT solution simply by πo(x,y)superscript𝜋𝑜𝑥𝑦\pi^{o}(x,y)italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_x , italic_y ), and will not distinguish between EOT and OT solutions, unless required by the context.

In contrast to conventional, base-level OT—in which the transport plan, π(x,y)𝜋𝑥𝑦\pi(x,y)italic_π ( italic_x , italic_y ), is a deterministic object—HFPD-OT acknowledges that π(x,y)𝜋𝑥𝑦\pi(x,y)italic_π ( italic_x , italic_y ) is uncertain (i.e. a random process), and needs to be equipped with an appropriate hierarchical probability model (i.e. triple) (Figure 1(b)). Next, we deduce this optimal model, π𝖲osimilar-to𝜋superscript𝖲𝑜\pi\sim\mathsf{S}^{o}italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (1), using the axiomatic Bayesian decision-making framework of hierarchical fully probabilistic design (HFPD).

3.1 The HFPD formulation of optimal transport

Consider a probability model in the hierarchical measurable space, (Ω𝖧,Ω𝖧)subscriptdouble-struck-Ω𝖧subscriptsubscriptdouble-struck-Ω𝖧(\mathbb{\Omega}_{\mathsf{H}},\mathcal{F}_{\mathbb{\Omega}_{\mathsf{H}}})( blackboard_Ω start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT , caligraphic_F start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where Ω𝖧ΩX×ΩY×(ΩX×ΩY)subscriptdouble-struck-Ω𝖧subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{\mathsf{H}}\equiv\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{% \mathbb{\Omega}}_{Y}\times\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times% \mathbb{\mathbb{\Omega}}_{Y})blackboard_Ω start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT ≡ blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT × blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) and Ω𝖧subscriptsubscriptdouble-struck-Ω𝖧\mathcal{F}_{\mathbb{\Omega}_{\mathsf{H}}}caligraphic_F start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the σ𝜎\sigmaitalic_σ-algebra of measurable sets in Ω𝖧subscriptdouble-struck-Ω𝖧\mathbb{\Omega}_{\mathsf{H}}blackboard_Ω start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT. Then, π(x,y)(ΩX×ΩY)𝜋𝑥𝑦subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\pi(x,y)\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})italic_π ( italic_x , italic_y ) ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) is a random process endowed with its own distribution, called the hyperprior, and denoted by 𝖲(π|K)𝖲conditional𝜋𝐾\mathsf{S}(\pi|K)sansserif_S ( italic_π | italic_K ). The notation π𝖲(π|K)similar-to𝜋𝖲conditional𝜋𝐾\pi\sim\mathsf{S}(\pi|K)italic_π ∼ sansserif_S ( italic_π | italic_K ) means that π𝜋\piitalic_π is distributed according to a hyperprior, 𝖲(π|K)𝖲conditional𝜋𝐾\mathsf{S}(\pi|K)sansserif_S ( italic_π | italic_K ), which is shaped by the knowledge constraints, K𝐾Kitalic_K (specified below). Moreover, let (π)𝜋\mathcal{L}(\pi)caligraphic_L ( italic_π ) denote the reference measure at the hierarchical level of the probability space. In the discrete case—when (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) specializes to the probability simplex, ΔΔ\Deltaroman_Δ(π)𝜋\mathcal{L}(\pi)caligraphic_L ( italic_π ) is instantiated as the Lebesgue measure, λ(π)𝜆𝜋\lambda(\pi)italic_λ ( italic_π ). As in the conventional base-level OT setting, we assume that 𝖲(π|K)𝖲conditional𝜋𝐾\mathsf{S}(\pi|K)sansserif_S ( italic_π | italic_K ) is absolutely continuous with respect to (π)𝜋\mathcal{L}(\pi)caligraphic_L ( italic_π ), and we overload 𝖲(π|K)𝖲conditional𝜋𝐾\mathsf{S}(\pi|K)sansserif_S ( italic_π | italic_K ) to denote its Radon-Nikodym derivative with respect to (π)𝜋\mathcal{L}(\pi)caligraphic_L ( italic_π ).

Let 𝕄𝖧subscript𝕄𝖧\mathbb{M}_{\mathsf{H}}blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT be the set of joint hierarchical Bayesian models with support in Ω𝖧subscriptdouble-struck-Ω𝖧\mathbb{\Omega}_{\mathsf{H}}blackboard_Ω start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT. The joint hierarchical Bayesian model 𝖬(x,y,π|𝖲,K)𝕄𝖧𝖬𝑥𝑦conditional𝜋𝖲𝐾subscript𝕄𝖧\mathsf{M}(x,y,\pi|\mathsf{S},K)\in\mathbb{M}_{\mathsf{H}}sansserif_M ( italic_x , italic_y , italic_π | sansserif_S , italic_K ) ∈ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT—our new variational object—reads as follows:

𝖬(x,y,π|𝖲,K)=𝖬(x,y|π,𝖲,K)𝖬(π|𝖲,K)=π(x,y|K)𝖲(π|K)𝖬𝑥𝑦conditional𝜋𝖲𝐾𝖬𝑥conditional𝑦𝜋𝖲𝐾𝖬conditional𝜋𝖲𝐾𝜋𝑥conditional𝑦𝐾𝖲conditional𝜋𝐾\begin{split}\mathsf{M}(x,y,\pi|\mathsf{S},K)&=\mathsf{M}(x,y|\pi,\mathsf{S},K% )\mathsf{M}(\pi|\mathsf{S},K)\\ &=\pi(x,y|K)\mathsf{S}(\pi|K)\end{split}start_ROW start_CELL sansserif_M ( italic_x , italic_y , italic_π | sansserif_S , italic_K ) end_CELL start_CELL = sansserif_M ( italic_x , italic_y | italic_π , sansserif_S , italic_K ) sansserif_M ( italic_π | sansserif_S , italic_K ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_π ( italic_x , italic_y | italic_K ) sansserif_S ( italic_π | italic_K ) end_CELL end_ROW (6)

(6) is a direct consequence of the conditional independence structure intrinsic to hierarchical modelling (Figure 2), and the fundamental definitions of π𝜋\piitalic_π and 𝖲𝖲\mathsf{S}sansserif_S.

{definition}[Expected transport plan] The random transport plan, π𝖲(π|K)similar-to𝜋𝖲conditional𝜋𝐾\pi\sim\mathsf{S}(\pi|K)italic_π ∼ sansserif_S ( italic_π | italic_K ) (6), has the expected value,

π^𝖲(x,y|K)𝖤𝖲[π](ΩX×ΩY)π(x,y|K)𝖲(π|K)𝑑(π).subscript^𝜋𝖲𝑥conditional𝑦𝐾subscript𝖤𝖲delimited-[]𝜋subscriptsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝜋𝑥conditional𝑦𝐾𝖲conditional𝜋𝐾differential-d𝜋\hat{\pi}_{\mathsf{S}}(x,y|K)\equiv\mathsf{E}_{\mathsf{S}}[\pi]\equiv\int_{% \mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})}\pi(x,y|K)\mathsf{S}(% \pi|K)d\mathcal{L}(\pi).over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) ≡ sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT [ italic_π ] ≡ ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_π ( italic_x , italic_y | italic_K ) sansserif_S ( italic_π | italic_K ) italic_d caligraphic_L ( italic_π ) . (7)

Hence, the marginal model of (x,y)𝑥𝑦(x,y)( italic_x , italic_y )—and, therefore, the base-level transport plan induced by 𝖲𝖲\mathsf{S}sansserif_S—is π^𝖲subscript^𝜋𝖲\hat{\pi}_{\mathsf{S}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT (7), as may be seen by integrating both sides of (6) over π(ΩX×ΩY)𝜋subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\pi\in\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times\mathbb{\mathbb{\Omega}}_{% Y})italic_π ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ):

𝖬(x,y|K)=π^𝖲(x,y|K).𝖬𝑥conditional𝑦𝐾subscript^𝜋𝖲𝑥conditional𝑦𝐾\mathsf{M}(x,y|K)=\hat{\pi}_{\mathsf{S}}(x,y|K).sansserif_M ( italic_x , italic_y | italic_K ) = over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) . (8)

This is a necessary condition for consistent hierarchical Bayesian modelling, and arises because of the deterministic mapping, (x,y)π(x,y)𝑥𝑦𝜋𝑥𝑦(x,y)\rightarrow\pi(x,y)( italic_x , italic_y ) → italic_π ( italic_x , italic_y ), imposed by any realization of π𝖲(π|K)similar-to𝜋𝖲conditional𝜋𝐾\pi\sim\mathsf{S}(\pi|K)italic_π ∼ sansserif_S ( italic_π | italic_K ).

From the foregoing, it is evident that the problem of hierarchical transport model design is one of optimization of deterministic 𝖲𝕊((ΩX×ΩY))𝖲𝕊subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathsf{S}\in{\mathbb{S}}(\mathbbm{P}(\mathbb{\mathbb{\Omega}}_{X}\times% \mathbb{\mathbb{\Omega}}_{Y}))sansserif_S ∈ blackboard_S ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ), noting that 𝖲𝖲\mathsf{S}sansserif_S appears as a condition in (6). The challenge in designing the optimal hierarchical model over the set of transport plans in (6) is to optimally process the stochastic knowledge constraints imposed by the uncertain environment while being close to an ideal design 𝖬𝖨subscript𝖬𝖨\mathsf{M}_{\mathsf{I}}sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT, which is used by the modeler to encode additional inductive biases and preferences in the HFPD-OT problem.

Refer to caption
Figure 2: The conditional independence graph associated with HFPD-OT. Shaded nodes are observed. The arrows indicate the causal structure, where an arrow from one variable to a second indicates that the first variable causes the second.

The generalized Bayesian inference framework considered here for the purpose of designing the optimal hierarchical model is Fully Probabilistic Design (FPD), introduced in [Kárný and Kroupa, 2012] and extended later to the hierarchical setting in [Quinn et al., 2016]. Generalized Bayesian inference (GBI) is a set of techniques that extend the classical Bayesian inference method by updating the prior belief distribution using a loss function rather than the traditional likelihood function. Under incomplete model specification, the latter may indeed not exist [Bissiri et al., 2016]. However, FPD differs from other GBI techniques in two ways. First, FPD relies on the concept of ideal design in place of a prior, and allow the designer to elicit their personal preferences in the design process through an ideal, and usually unattainable, distribution 𝖬𝖨(x,y,π|K)𝕄𝖧c𝕄𝕄𝖧subscript𝖬𝖨𝑥𝑦conditional𝜋𝐾superscriptsubscript𝕄𝖧𝑐𝕄subscript𝕄𝖧\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)\in{\mathbb{M}}_{\mathsf{H}}^{c}\equiv% \mathbb{M}\smallsetminus\mathbb{M}_{\mathsf{H}}sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y , italic_π | italic_K ) ∈ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ≡ blackboard_M ∖ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT (Figure 3). More precisely, we assume that the ideal design factorizes as follows:

𝖬𝖨(x,y,π|K)π𝖨(x,y|K)𝖲𝖨(π|K)subscript𝖬𝖨𝑥𝑦conditional𝜋𝐾subscript𝜋𝖨𝑥conditional𝑦𝐾subscript𝖲𝖨conditional𝜋𝐾\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)\equiv\pi_{\mathsf{I}}(x,y|K)\mathsf{S}_{% \mathsf{I}}(\pi|K)sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y , italic_π | italic_K ) ≡ italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_π | italic_K ) (9)

In other words, the joint ideal design, 𝖬𝖨(x,y,π|K)subscript𝖬𝖨𝑥𝑦conditional𝜋𝐾\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y , italic_π | italic_K ), is the base-level ideal design π𝖨subscript𝜋𝖨\pi_{\mathsf{I}}italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT, modulated by the hierarchical ideal design 𝖲𝖨subscript𝖲𝖨\mathsf{S}_{\mathsf{I}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT. Note that 𝖬𝖨(x,y,π|K)subscript𝖬𝖨𝑥𝑦conditional𝜋𝐾\mathsf{M}_{\mathsf{I}}(x,y,\pi|K)sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y , italic_π | italic_K ) is unattainable because π𝖨subscript𝜋𝖨\pi_{\mathsf{I}}italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT and 𝖲𝖨subscript𝖲𝖨\mathsf{S}_{\mathsf{I}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT are statistically independent models, and, as such, they may be conflicting in the following sense:

𝖤𝖲𝖨(π)π𝖨(x,y)subscript𝖤subscript𝖲𝖨𝜋subscript𝜋𝖨𝑥𝑦\mathsf{E}_{\mathsf{S}_{\mathsf{I}}}(\pi)\neq\pi_{\mathsf{I}}(x,y)sansserif_E start_POSTSUBSCRIPT sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_π ) ≠ italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y ) (10)

This is reasonable when we recall that the ideal design is an entirely subjective object used to encode the designer’s preferences (and representing their unattainable, zero-loss state of knowledge). By ranking the designer’s preferences against this ideal design, (hierarchical) FPD is consistent with Savage’s framework for Bayesian decision making [Savage, 1971]. The consistent ranking of knowledge-constrained models (6) is via the KLD referenced to 𝖬𝖨subscript𝖬𝖨\mathsf{M}_{\mathsf{I}}sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT. Hence, the optimal hierarchical design, 𝖬o(x,y,π|K)superscript𝖬𝑜𝑥𝑦conditional𝜋𝐾\mathsf{M}^{o}(x,y,\pi|K)sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_x , italic_y , italic_π | italic_K ), is formulated as follows:

(P):𝖬oargmin𝖬𝕄𝖧{𝖣𝖪𝖫(𝖬(x,y,π|K)||𝖬𝖨(x,y,π))},(P):\;\;\;\;\;\mathsf{M}^{o}\in\operatorname*{argmin}_{\mathsf{M}\in\mathbb{M}% _{\mathsf{H}}}\left\{\mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{M}(x,y,\pi|K)||% \mathsf{M}_{\mathsf{I}}(x,y,\pi)\bigr{)}\right\},( italic_P ) : sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT sansserif_M ∈ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT { sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( sansserif_M ( italic_x , italic_y , italic_π | italic_K ) | | sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y , italic_π ) ) } , (11)

subject to:

{𝖤𝖲(𝖣𝖪𝖫(μ||μ0))η𝖤𝖲(𝖣𝖪𝖫(ν||ν0))ζ\left\{\begin{aligned} \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0}))\leq\eta\\ \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))\leq\zeta\end{% aligned}\right.{ start_ROW start_CELL sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≤ italic_η end_CELL end_ROW start_ROW start_CELL sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≤ italic_ζ end_CELL end_ROW

We note the following:

  1. 1.

    Since 𝖣𝖪𝖫(||𝖬𝖨)\mathsf{D}_{\mathsf{KL}}\bigl{(}\cdot\;||\;\mathsf{M}_{\mathsf{I}}\bigr{)}sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( ⋅ | | sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ) is continuous, the space of joint hierarchical Bayesian distributions 𝕄𝖧subscript𝕄𝖧\mathbb{M}_{\mathsf{H}}blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT is compact in the weak-* topology (see Appendix 7) and the constraint set is nonempty (we can for instance choose 𝖲δμ0ν0𝖲subscript𝛿tensor-productsubscript𝜇0subscript𝜈0\mathsf{S}\equiv\delta_{\mu_{0}\otimes\nu_{0}}sansserif_S ≡ italic_δ start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊗ italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT), then the minimum is attained.

  2. 2.

    Moreover, the optimal joint hierarchical model 𝖬osuperscript𝖬𝑜\mathsf{M}^{o}sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT is unique up to a set of measure 0.

The ideal design 𝖬𝖨subscript𝖬𝖨\mathsf{M}_{\mathsf{I}}sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT enters the KL divergence as the second fixed argument against which all feasible Bayesian hierarchical models are ranked. Importantly, note that the marginals in (11) are no longer modeled as deterministic, crisp objects. This assumption is now relaxed, allowing the modeler to express their uncertainty by viewing the marginals as random realizations of some underlying stochastic process. In particular, we describe this uncertainty in the form of moment constraints: the random marginals belong to uncertainty sets in the form of Kullback-Leibler balls, centered on μ0(ΩX)subscript𝜇0subscriptdouble-struck-Ω𝑋\mu_{0}\in\mathbbm{P}(\mathbb{\Omega}_{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) and ν0(ΩY)subscript𝜈0subscriptdouble-struck-Ω𝑌\nu_{0}\in\mathbbm{P}(\mathbb{\Omega}_{Y})italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ). The new knowledge-constrained set of consistent hierarchical Bayesian models—denoted by 𝕄K𝕄𝖧subscript𝕄𝐾subscript𝕄𝖧\mathbb{M}_{K}\subseteq\mathbb{M}_{\mathsf{H}}blackboard_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ⊆ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT—is augmented with the following linear moment constraints over the marginals:

𝕄K{𝖬(x,y,π|K)|𝖬(x,y,π|K)𝕄𝖧,μμandνν}\begin{split}\mathbb{M}_{K}\equiv\Bigl{\{}\mathsf{M}(x,y,\pi|K)\;|\;\mathsf{M}% (x,y,\pi|K)\in\mathbb{M}_{\mathsf{H}}\;,\;\mu\in\bbmu\;\text{and}\;\nu\in\bbnu% \Bigr{\}}\end{split}start_ROW start_CELL blackboard_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ≡ { sansserif_M ( italic_x , italic_y , italic_π | italic_K ) | sansserif_M ( italic_x , italic_y , italic_π | italic_K ) ∈ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT , italic_μ ∈ start_UNKNOWN blackboard_μ end_UNKNOWN and italic_ν ∈ start_UNKNOWN blackboard_ν end_UNKNOWN } end_CELL end_ROW (12)

with the sets μdouble-struck-μ\bbmustart_UNKNOWN blackboard_μ end_UNKNOWN and νdouble-struck-ν\bbnustart_UNKNOWN blackboard_ν end_UNKNOWN defined as follows (Figure 1(b)):

μ{μ(ΩX)|𝖤𝖲[𝖣𝖪𝖫(μ||μ0)]η}\bbmu\equiv\{\mu\in\mathbb{P}(\mathbb{\Omega}_{X})\;|\;\mathsf{E}_{\mathsf{S}}% \left[\mathsf{D}_{\mathsf{KL}}(\mu||\mu_{0})\right]\leq\eta\}start_UNKNOWN blackboard_μ end_UNKNOWN ≡ { italic_μ ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) | sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT [ sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ italic_η } (13)
ν{ν(ΩY)|𝖤𝖲[𝖣𝖪𝖫(ν||ν0)]ζ}\bbnu\equiv\{\nu\in\mathbb{P}(\mathbb{\Omega}_{Y})\;|\;\mathsf{E}_{\mathsf{S}}% \left[\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0})\right]\leq\zeta\}start_UNKNOWN blackboard_ν end_UNKNOWN ≡ { italic_ν ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) | sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT [ sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ italic_ζ } (14)

where η0𝜂0\eta\geq 0italic_η ≥ 0 and ζ0𝜁0\zeta\geq 0italic_ζ ≥ 0 are prior-elicited KL radii, that express the degree of uncertainty the designer is placing over the marginals.

As we will see in the sequel, the interaction between the base-level and hierarchical ideals, on one hand, and the knowledge constraints on the other, is what gives rise to the Gibbsian form of the hyperprior in (1).

We now state the main result of the paper.

{theorem}

Let (P) be the HFPD-OT Primal problem, defined in (11).

  1. 1.

    (P) is equivalent to the following optimization problem over the set of hierarchical Bayesian models 𝕄𝖧subscript𝕄𝖧\mathbb{M}_{\mathsf{H}}blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT (12):

    (P):𝖬o(x,y,π)argmin𝖬𝕄𝖧{𝖣𝖪𝖫(𝖬(x,y,π|K)||π^𝖲~(x,y)𝖲~(π|K))},(P):\;\;\;\;\;\mathsf{M}^{o}(x,y,\pi)\in\operatorname*{argmin}_{\mathsf{M}\in% \mathbb{M}_{\mathsf{H}}}\left\{\mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{M}(x,y,% \pi|K)||\hat{\pi}_{\tilde{\mathsf{S}}}(x,y)\tilde{\mathsf{S}}(\pi|K)\bigr{)}% \right\},( italic_P ) : sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_x , italic_y , italic_π ) ∈ roman_argmin start_POSTSUBSCRIPT sansserif_M ∈ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT { sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( sansserif_M ( italic_x , italic_y , italic_π | italic_K ) | | over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT over~ start_ARG sansserif_S end_ARG end_POSTSUBSCRIPT ( italic_x , italic_y ) over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) ) } , (15)

    subject to

    {𝖤𝖲(𝖣𝖪𝖫(μ||μ0))η𝖤𝖲(𝖣𝖪𝖫(ν||ν0))ζ\left\{\begin{aligned} \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0}))\leq\eta\\ \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))\leq\zeta\end{% aligned}\right.{ start_ROW start_CELL sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≤ italic_η end_CELL end_ROW start_ROW start_CELL sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≤ italic_ζ end_CELL end_ROW

    where

    𝖲~(π|K)𝖲𝖨(π)exp(𝖣𝖪𝖫(π(x,y)||π𝖨(x,y))).\tilde{\mathsf{S}}(\pi|K)\equiv\mathsf{S}_{\mathsf{I}}(\pi)\exp\Bigl{(}-% \mathsf{D}_{\mathsf{KL}}\bigl{(}\pi(x,y)||\pi_{\mathsf{I}}(x,y)\bigr{)}\Bigr{)}.over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) ≡ sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_π ) roman_exp ( - sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π ( italic_x , italic_y ) | | italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ) . (16)
  2. 2.

    The optimal hyperprior 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) reads as follows:

    𝖲o(π|K)exp(λ1o𝖣𝖪𝖫(μ||μ0))𝖲~(π|K)exp(λ2o𝖣𝖪𝖫(ν||ν0)),-a.e.\mathsf{S}^{o}(\pi|K)\propto\exp\left(-\lambda_{1}^{o}{\mathsf{D}_{\mathsf{KL}% }}(\mu||\mu_{0})\right)\tilde{\mathsf{S}}(\pi|K)\exp\left(-\lambda_{2}^{o}{% \mathsf{D}_{\mathsf{KL}}}(\nu||\nu_{0})\right),\;\;\mathcal{L}\text{-a.e.}sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) ∝ roman_exp ( - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) roman_exp ( - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) , caligraphic_L -a.e. (17)
  3. 3.

    The Dual program associated with the primal (P)𝑃(P)( italic_P ) (15) reads

    (D):sup𝝀0{log(𝖭(𝝀))𝝀𝜽},\begin{split}(D):\;\;\;\;\sup_{\boldsymbol{\lambda}\succeq 0}\left\{\log\left(% \mathsf{N}(\boldsymbol{\lambda})\right)-\boldsymbol{\lambda}^{\intercal}% \boldsymbol{\theta}\right\},\end{split}start_ROW start_CELL ( italic_D ) : roman_sup start_POSTSUBSCRIPT bold_italic_λ ⪰ 0 end_POSTSUBSCRIPT { roman_log ( sansserif_N ( bold_italic_λ ) ) - bold_italic_λ start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_θ } , end_CELL end_ROW (18)

    where

    𝖭(𝝀)((ΩX×ΩY)𝖲~(π|K)exp(<𝝀,𝖱(π)>1)d(π))1,\mathsf{N}(\boldsymbol{\lambda})\equiv\left(\int_{\mathbbm{P}(\mathbb{\Omega}_% {X}\times\mathbb{\Omega}_{Y})}\tilde{\mathsf{S}}(\pi|K)\exp\left(-<\boldsymbol% {\lambda},\mathsf{R}(\pi)>-1\right)d\mathcal{L}(\pi)\right)^{-1},sansserif_N ( bold_italic_λ ) ≡ ( ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) roman_exp ( - < bold_italic_λ , sansserif_R ( italic_π ) > - 1 ) italic_d caligraphic_L ( italic_π ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (19)

    and 𝝀𝝀\boldsymbol{\lambda}bold_italic_λ \equiv [λ1λ2]matrixsubscript𝜆1subscript𝜆2\begin{bmatrix}\lambda_{1}\\ \lambda_{2}\end{bmatrix}[ start_ARG start_ROW start_CELL italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ], 𝜽[ηζ]𝜽matrix𝜂𝜁\boldsymbol{\theta}\equiv\begin{bmatrix}\eta\\ \zeta\end{bmatrix}bold_italic_θ ≡ [ start_ARG start_ROW start_CELL italic_η end_CELL end_ROW start_ROW start_CELL italic_ζ end_CELL end_ROW end_ARG ], 𝖱(π)𝖱𝜋\mathsf{R}(\pi)sansserif_R ( italic_π ) \equiv [𝖣𝖪𝖫(μ||μ0)𝖣𝖪𝖫(ν||ν0)]\begin{bmatrix}\mathsf{D}_{\mathsf{KL}}(\mu||\mu_{0})\\ \mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0})\end{bmatrix}[ start_ARG start_ROW start_CELL sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ].

    Moreover, strong duality holds, i.e. the optimal Kantorovitch potentials, 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT in (17), are the solution of the dual problem (18),

    𝝀oargmax𝝀0{log(𝖭(𝝀))𝝀𝜽},superscript𝝀𝑜subscriptargmaxsucceeds-or-equals𝝀0𝖭𝝀superscript𝝀𝜽\boldsymbol{\lambda}^{o}\equiv\operatorname*{argmax}_{\boldsymbol{\lambda}% \succeq 0}\left\{\log\left(\mathsf{N}(\boldsymbol{\lambda})\right)-\boldsymbol% {\lambda}^{\intercal}\boldsymbol{\theta}\right\},bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ≡ roman_argmax start_POSTSUBSCRIPT bold_italic_λ ⪰ 0 end_POSTSUBSCRIPT { roman_log ( sansserif_N ( bold_italic_λ ) ) - bold_italic_λ start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_θ } , (20)

    and the maximum of the dual problem is attained: min𝖬(P)=max𝝀(D)subscript𝖬𝑃subscript𝝀𝐷\min_{\mathsf{M}}(P)=\max_{\boldsymbol{\lambda}}(D)roman_min start_POSTSUBSCRIPT sansserif_M end_POSTSUBSCRIPT ( italic_P ) = roman_max start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT ( italic_D ).

Proof method.

Results (1) and (2) of the Theorem can be proved using basic algebraic manipulations. However, we opt here for a derivation based on information processing arguments, so as to gain more intuition about the design of the hyperprior in the hierarchical setting.

Given the factorized joint ideal design in (9), the optimal hyperprior 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) emerges via two sequential knowledge-processing steps (Figure 3), addressed in the first two of the following items:

  1. 1.

    Adapting the ideal design and processing the hyperprior without knowledge constraints K𝐾Kitalic_K. The purpose of this first step is to guide the optimization problem (P)𝑃(P)( italic_P ) in (11) from a possibly inconsistent ideal, 𝖬𝖨subscript𝖬𝖨\mathsf{M}_{\mathsf{I}}sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT (10), to a new consistent target (step 1 in Figure 3). The adapted hyperprior, 𝖲~~𝖲\tilde{\mathsf{S}}over~ start_ARG sansserif_S end_ARG (16), expresses the best compromise between possibly conflicting ideals. It involves the Gibbs-type modulation of the hierarchical ideal design 𝖲𝖨subscript𝖲𝖨\mathsf{S}_{\mathsf{I}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT via a term that depends on the base-level ideal design π𝖨subscript𝜋𝖨\pi_{\mathsf{I}}italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT (Theorem 1 in [Quinn et al., 2016]). The optimal hierarchical model 𝖬~𝕄𝖧~𝖬subscript𝕄𝖧\tilde{\mathsf{M}}\in\mathbb{M}_{\mathsf{H}}over~ start_ARG sansserif_M end_ARG ∈ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT is a boundary point in the convex set 𝕄𝖧subscript𝕄𝖧\mathbb{M}_{\mathsf{H}}blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT and is inferred from (6) as follows:

    𝖬~(x,y,π|K)=π^𝖲~(x,y|K)𝖲~(π|K)~𝖬𝑥𝑦conditional𝜋𝐾subscript^𝜋~𝖲𝑥conditional𝑦𝐾~𝖲conditional𝜋𝐾\tilde{\mathsf{M}}(x,y,\pi|K)=\hat{\pi}_{\tilde{\mathsf{S}}}(x,y|K)\tilde{% \mathsf{S}}(\pi|K)over~ start_ARG sansserif_M end_ARG ( italic_x , italic_y , italic_π | italic_K ) = over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT over~ start_ARG sansserif_S end_ARG end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) (21)

    where π^𝖲~subscript^𝜋~𝖲\hat{\pi}_{\tilde{\mathsf{S}}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT over~ start_ARG sansserif_S end_ARG end_POSTSUBSCRIPT is the expected transport plan w.r.t 𝖲~~𝖲\tilde{\mathsf{S}}over~ start_ARG sansserif_S end_ARG and follows from (7).

  2. 2.

    Processing the two marginal constraints specified in the knowledge set K𝐾Kitalic_K. This step leads to the new optimization problem stated in (15), which results in the optimal hyperprior (17) (Theorem 3 in [Quinn et al., 2016]). Each of the marginal constraints induces a MaxEnt Gibbs term that modulates the hyperprior obtained in Step 1. And the resulting optimal hierarchical model 𝖬o𝕄K𝕄𝖧superscript𝖬𝑜subscript𝕄𝐾subscript𝕄𝖧\mathsf{M}^{o}\in\mathbb{M}_{K}\subseteq\mathbb{M}_{\mathsf{H}}sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∈ blackboard_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ⊆ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT—which is also a boundary point in the convex set 𝕄Ksubscript𝕄𝐾\mathbb{M}_{K}blackboard_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT—reads as follows:

    𝖬o(x,y,π|K)=π^𝖲o(x,y|K)𝖲o(π|K)superscript𝖬𝑜𝑥𝑦conditional𝜋𝐾subscript^𝜋superscript𝖲𝑜𝑥conditional𝑦𝐾superscript𝖲𝑜conditional𝜋𝐾\mathsf{M}^{o}(x,y,\pi|K)=\hat{\pi}_{\mathsf{S}^{o}}(x,y|K)\mathsf{S}^{o}(\pi|K)sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_x , italic_y , italic_π | italic_K ) = over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (22)

    where π^𝖲osubscript^𝜋superscript𝖲𝑜\hat{\pi}_{\mathsf{S}^{o}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT follows similarly from (7).

  3. 3.

    It remains to prove the strong duality result and formally characterize the Kantorovitch potentials in (18). The details of this proof are provided in Appendix 7. There, we prove strong duality in the infinite dimensional case by relying on the classical Fenchel-Rockafellar duality theorem [Rockafellar, 1967], [Villani, 2008]. More precisely, we demonstrate that the conditions required by the theorem are satisfied in the hierarchical Bayesian setting of HFPD-OT, and we derive the dual problem (D)𝐷(D)( italic_D ).

Refer to caption
Figure 3: A sequential information-processing view of the optimal hierarchical model, 𝖬oπ^𝖲o𝖲osuperscript𝖬𝑜subscript^𝜋superscript𝖲𝑜superscript𝖲𝑜\mathsf{M}^{o}\equiv\hat{\pi}_{{\mathsf{S}}^{o}}\mathsf{S}^{o}sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ≡ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, used in the proof method (22). First, the inductive biases expressed via the hierarchical ideal model, 𝖬𝖨𝕄𝖧csubscript𝖬𝖨superscriptsubscript𝕄𝖧𝑐\mathsf{M}_{\mathsf{I}}\in{\mathbb{M}}_{\mathsf{H}}^{c}sansserif_M start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ∈ blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT (9), are processed to yield a new optimization problem over a constrained set 𝕄𝖧subscript𝕄𝖧\mathbb{M}_{\mathsf{H}}blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT, whose solution, 𝖬~~𝖬\tilde{\mathsf{M}}over~ start_ARG sansserif_M end_ARG (21), is on the boundary of 𝕄𝖧subscript𝕄𝖧\mathbb{M}_{\mathsf{H}}blackboard_M start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT. Second, the knowledge constraints, K𝐾Kitalic_K, are processed, further reducing the feasible set to the subset, 𝕄Ksubscript𝕄𝐾\mathbb{M}_{K}blackboard_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (12). The optimal hierarchical model is 𝖬osuperscript𝖬𝑜\mathsf{M}^{o}sansserif_M start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (on the boundary of 𝕄Ksubscript𝕄𝐾\mathbb{M}_{K}blackboard_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT), s.t. π𝖲o(π|K)similar-to𝜋superscript𝖲𝑜conditional𝜋𝐾\pi\sim\mathsf{S}^{o}(\pi|K)italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (17).

By sampling random realizations from our optimal hyperprior, we can design randomized and diverse transport policies in lieu of an immutable and fixed OT plan. This randomization principle is depicted in Figure 4. More precisely, the design of the optimal hyperprior over the space of transport plans is a twofold process:

  1. 1.

    The knowledge constraints K𝐾Kitalic_K are processed to yield the optimal hyperprior (17)italic-(17italic-)\eqref{eq:hyperprior}italic_( italic_). This mainly requires conditioning the Kantorovitch potentials on the uncertainty radii, (η,ζ)𝜂𝜁(\eta,\zeta)( italic_η , italic_ζ ) (Figure 4(a)).

  2. 2.

    Once the optimal hyperprior is available, random transport strategies are sampled and used in subsequent transport problems, in lieu of a crisp OT plan. Importantly, having access to a generative model over the space of transport plans provides us with the statistical devices to assess and reason about the intrinsic uncertainty in the transport problem (Figure 4(b)). The expected transport π^𝖲osubscript^𝜋superscript𝖲𝑜\hat{\pi}_{\mathsf{S}^{o}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT plan is obtained from (7).

Remark 1.

The Kantorovitch potentials λ1=λ1(η,ζ)subscript𝜆1subscript𝜆1𝜂𝜁\lambda_{1}=\lambda_{1}(\eta,\zeta)italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_η , italic_ζ ) and λ2=λ2(η,ζ)subscript𝜆2subscript𝜆2𝜂𝜁\lambda_{2}=\lambda_{2}(\eta,\zeta)italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_η , italic_ζ ) express the degree of uncertainty in the input data—i.e. the marginals. Depending on their values, they give rise to two interesting extremal modalities, that vary from high uncertainty to perfect characterization of the marginals:

  • If η𝜂\eta\rightarrow\inftyitalic_η → ∞ and ζ𝜁\zeta\rightarrow\inftyitalic_ζ → ∞, it is straightforward from (18) that the solution of the dual is achieved when 𝝀o=𝟎superscript𝝀𝑜0\boldsymbol{\lambda}^{o}=\boldsymbol{0}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = bold_0. This is also a direct consequence of complementary slackness. It follows that

    𝖲o(π|K)η,ζ𝖲~(π|K).formulae-sequence𝜂𝜁superscript𝖲𝑜conditional𝜋𝐾~𝖲conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to\infty,\;\zeta\to\infty}\tilde{% \mathsf{S}}(\pi|K).sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) start_ARROW start_OVERACCENT italic_η → ∞ , italic_ζ → ∞ end_OVERACCENT → end_ARROW over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) . (23)

    In other words, when the uncertainty around the marginals is unbounded, the optimal hyperprior is mainly characterized—see (16)—by the hierarchical ideal design modulated by a Gibbsian term that depends on π𝖨subscript𝜋𝖨\pi_{\mathsf{I}}italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT.

  • If η0𝜂0\eta\rightarrow 0italic_η → 0 and ζ0𝜁0\zeta\rightarrow 0italic_ζ → 0, the uncertainty in the marginals vanishes and learning222In the context of (H)FPD, learning (i.e. inductive inference) refers to the optimal processing of knowledge constraints into the hyperprior: K𝖲o(π|K)𝐾superscript𝖲𝑜conditional𝜋𝐾K\rightarrow\mathsf{S}^{o}(\pi|K)italic_K → sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ). For more discussion on the role of FPD in furnishing generalized settings of Bayes’ rule, see [Kracík and Kárný, 2005]. is maximal, leading to μμ0𝜇subscript𝜇0\mu\rightarrow\mu_{0}italic_μ → italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and νν0𝜈subscript𝜈0\nu\rightarrow\nu_{0}italic_ν → italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, or equivalently ππ~Π(μ0,ν0)𝜋~𝜋double-struck-Πsubscript𝜇0subscript𝜈0\pi\rightarrow\tilde{\pi}\in\mathbb{\Pi}(\mu_{0},\nu_{0})italic_π → over~ start_ARG italic_π end_ARG ∈ blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). It follows from the dual (18) that the maximum is attained when 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}\rightarrow\inftybold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT → ∞, and we achieve the limit,

    𝖲o(π|K)η0,ζ0𝖲~(π|K)χΠ(μ0,ν0)(π).formulae-sequence𝜂0𝜁0superscript𝖲𝑜conditional𝜋𝐾~𝖲conditional𝜋𝐾subscript𝜒double-struck-Πsubscript𝜇0subscript𝜈0𝜋\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to 0,\;\zeta\to 0}\tilde{\mathsf{S}}(% \pi|K)\chi_{\mathbb{\Pi}(\mu_{0},\nu_{0})}(\pi).sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) start_ARROW start_OVERACCENT italic_η → 0 , italic_ζ → 0 end_OVERACCENT → end_ARROW over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) italic_χ start_POSTSUBSCRIPT blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_π ) . (24)

    In other words, the hyperprior concentrates on the OT manifold, Π(μ0,ν0)double-struck-Πsubscript𝜇0subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (2.2). This concentration behaviour is reminiscent of the Laplace-Bernstein-Von Mises convergence theorem [Kolmogorov and Sarmanov, 1960].

Remark 2.

Conventional Base-level OT Consider further the regime of perfect specification of the marginals, i.e. η0η0\eta\rightarrow 0italic_η → 0, ζ0ζ0\zeta\rightarrow 0italic_ζ → 0 (Remark 1). The conjugate choice of the ideal hyperprior, 𝖲𝖨subscript𝖲𝖨\mathsf{S}_{\mathsf{I}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT, has the following Gibbs form:

𝖲𝖨(π|K)exp(α𝖣𝖪𝖫(π(x,y)||π𝖨(x,y))).\mathsf{S}_{\mathsf{I}}(\pi|K)\propto\exp(-\alpha\mathsf{D}_{\mathsf{KL}}(\pi(% x,y)||\pi_{\mathsf{I}}(x,y))).sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_π | italic_K ) ∝ roman_exp ( - italic_α sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π ( italic_x , italic_y ) | | italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ) . (25)

Here, α>0𝛼0\alpha>0italic_α > 0 plays the role of the inverse-temperature. Substituting (25) into (24), the optimal hyperprior becomes

𝖲o(π|K)η0,ζ0exp((α+1)𝖣𝖪𝖫(π||π𝖨))χΠ(μ0,ν0)(π).\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to 0,\;\zeta\to 0}\exp(-(\alpha+1)% \mathsf{D}_{\mathsf{KL}}(\pi||\pi_{\mathsf{I}}))\chi_{\mathbb{\Pi}(\mu_{0},\nu% _{0})}(\pi).sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) start_ARROW start_OVERACCENT italic_η → 0 , italic_ζ → 0 end_OVERACCENT → end_ARROW roman_exp ( - ( italic_α + 1 ) sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π | | italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ) ) italic_χ start_POSTSUBSCRIPT blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_π ) . (26)

When π𝖨subscript𝜋𝖨\pi_{\mathsf{I}}italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT is the extended Gibbs kernel (4)—where we instantiate ϕitalic-ϕ\phiitalic_ϕ as the uniform distribution with support in ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT—the minimum of 𝖣𝖪𝖫(π||π𝖨)\mathsf{D}_{\mathsf{KL}}(\pi||\pi_{\mathsf{I}})sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π | | italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ) in (26) is exactly achieved at the EOT solution (5):

πo(x,y|K)=argminπΠ(μ0,ν0)𝖣𝖪𝖫(π(x,y)||π𝖨(x,y)).\pi^{o}(x,y|K)=\operatorname*{argmin}_{\pi\in\mathbb{\Pi}(\mu_{0},\nu_{0})}% \mathsf{D}_{\mathsf{KL}}(\pi(x,y)||\pi_{\mathsf{I}}(x,y)).italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_x , italic_y | italic_K ) = roman_argmin start_POSTSUBSCRIPT italic_π ∈ blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_π ( italic_x , italic_y ) | | italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y ) ) . (27)

The latter can be recovered when α𝛼\alpha\rightarrow\inftyitalic_α → ∞, for example by simulated annealing [Delahaye et al., 2019]:

𝖲o(π|K)η0,ζ0,αδπo(π).formulae-sequence𝜂0formulae-sequence𝜁0𝛼superscript𝖲𝑜conditional𝜋𝐾subscript𝛿superscript𝜋𝑜𝜋\mathsf{S}^{o}(\pi|K)\xrightarrow[]{\eta\to 0,\;\zeta\to 0,\alpha\to\infty}% \delta_{\pi^{o}}(\pi).sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) start_ARROW start_OVERACCENT italic_η → 0 , italic_ζ → 0 , italic_α → ∞ end_OVERACCENT → end_ARROW italic_δ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_π ) . (28)
Refer to caption
(a) First, the optimal hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), is computed, by processing the marginal knowledge constraints into the optimal Kantorovitch potentials (20).
Refer to caption
(b) Once elicited, the optimal hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), can be used to sample random transport plans, π(k)superscript𝜋𝑘\pi^{(k)}italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. These random samples of plans can be used in two important inference steps: 1. the expected transport plan (bottom left), π^𝖲o(x,y|K)subscript^𝜋superscript𝖲𝑜𝑥conditional𝑦𝐾\hat{\pi}_{\mathsf{S}^{o}}(x,y|K)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ) (7), can be used in downstream transport tasks, in lieu of the conventional base-level OT plan, πosuperscript𝜋𝑜\pi^{o}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (3); and 2. measures of uncertainty (bottom right) in the form of entry-wise (i.e. contract) variances, or other summary statistics (including higher-order correlation structure between contracts) can be designed to inform the decision-making process. The asterisk (*) highlights an example of a contract that experiences diversified transport policies, enabled by randomized HFPD-OT.
Figure 4: The two-step principle underlying HFPD-OT. 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) is a generative model (i.e. a distribution) of random transport plans, π𝜋\piitalic_π. Realizations, π(k)superscript𝜋𝑘\pi^{(k)}italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, of π𝜋\piitalic_π can be sampled from 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), and these samples can then be used to estimate an expected transport plan (7) for downstream transport problems, via ergodic averaging. In addition, HFPD-OT enables a principled analysis of the intrinsic uncertainty in the transport problem.

4 The HFPD-OT hyperprior in the parametric case

As already noted, no special assumptions have been made in respect of the hierarchical transport model (6), and so (17) is the HFPD-OT hyperprior for the nonparametric (transport) process, π(ΩX×ΩY)𝜋subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\pi\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})italic_π ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ). The finite case—i.e. #(ΩX×ΩY)<#subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\#(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})<\infty# ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) < ∞—induces the parametric setting of HFPD-OT, with π𝜋\piitalic_π defined in the usual way w.r.t. the counting measure, and 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) defined on a K𝐾Kitalic_K-constrained subset (3.1) of the simplex. This allows us to easily visualize key properties of 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) in a low dimensional setting, and, importantly, to develop algorithms for computing random draws (Figure 4(b)), π(k)𝖲o(π|K)similar-tosuperscript𝜋𝑘superscript𝖲𝑜conditional𝜋𝐾\pi^{(k)}\sim\mathsf{S}^{o}(\pi|K)italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), from the HFPD-OT parametric hyperprior (17), via approximation of the Kantorovitch potentials (20).

4.1 Descriptive analysis of the parametric HFPD-OT hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K )

In the finite, parametric case—which we will pursue in the rest of this paper—xΩX{x1,,xi,,xm}𝑥subscriptdouble-struck-Ω𝑋subscript𝑥1subscript𝑥𝑖subscript𝑥𝑚x\in\mathbb{\Omega}_{X}\equiv\{x_{1},\ldots,x_{i},\ldots,x_{m}\}italic_x ∈ blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≡ { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and yΩY{y1,,yj,,yn}𝑦subscriptdouble-struck-Ω𝑌subscript𝑦1subscript𝑦𝑗subscript𝑦𝑛y\in\mathbb{\Omega}_{Y}\equiv\{y_{1},\ldots,y_{j},\ldots,y_{n}\}italic_y ∈ blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ≡ { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, with 2m<2𝑚2\leq m<\infty2 ≤ italic_m < ∞ and 2n<2𝑛2\leq n<\infty2 ≤ italic_n < ∞. We refer to ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\Omega}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT as the sets of source agents and target agents, respectively. Then, the base-level distributions are uncertain multinomials, with densities μ=i=1mμiδxi𝜇superscriptsubscript𝑖1𝑚subscript𝜇𝑖subscript𝛿subscript𝑥𝑖\mu=\sum_{i=1}^{m}\mu_{i}\delta_{x_{i}}italic_μ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, ν=j=1nνjδyj𝜈superscriptsubscript𝑗1𝑛subscript𝜈𝑗subscript𝛿subscript𝑦𝑗\nu=\sum_{j=1}^{n}\nu_{j}\delta_{y_{j}}italic_ν = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and π=j=1ni=1mπi,jδxi,yj𝜋superscriptsubscript𝑗1𝑛superscriptsubscript𝑖1𝑚subscript𝜋𝑖𝑗subscript𝛿subscript𝑥𝑖subscript𝑦𝑗\pi=\sum_{j=1}^{n}\sum_{i=1}^{m}\pi_{i,j}\delta_{x_{i},y_{j}}italic_π = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The associated pmfs are structured as vector-matrix objects, and also denoted by the same symbols: μΔm1𝜇subscriptΔ𝑚1\mu\in\Delta_{m-1}italic_μ ∈ roman_Δ start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT, νΔn1𝜈subscriptΔ𝑛1\nu\in\Delta_{n-1}italic_ν ∈ roman_Δ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT and πΔmn1𝜋subscriptΔ𝑚𝑛1\pi\in\Delta_{mn-1}italic_π ∈ roman_Δ start_POSTSUBSCRIPT italic_m italic_n - 1 end_POSTSUBSCRIPT. Without loss of generality, we consider the following class of conjugate333We consider a weak form of conjugacy [Quinn, 2012], where the processing of the ideal design, 𝖲𝖨(π|K)subscript𝖲𝖨conditional𝜋𝐾\mathsf{S}_{\mathsf{I}}(\pi|K)sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_π | italic_K ), via hierarchical FPD yields an optimal hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), of the same functional form. hierarchical ideal designs, parameterized by fixed 𝝀𝖨0succeeds-or-equalssubscript𝝀𝖨0{\boldsymbol{\lambda}}_{\mathsf{I}}\succeq 0bold_italic_λ start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ⪰ 0 (we absorb the parameter conditions—here, 𝝀𝖨subscript𝝀𝖨{\boldsymbol{\lambda}}_{\mathsf{I}}bold_italic_λ start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT, μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT—into the Jeffreys’ notation, K𝐾Kitalic_K):

𝖲𝖨(π|K)i=1m(μiμ0,i)λ𝖨,1μij=1n(νjν0,j)λ𝖨,2νjproportional-tosubscript𝖲𝖨conditional𝜋𝐾superscriptsubscriptproduct𝑖1𝑚superscriptsubscript𝜇𝑖subscript𝜇0𝑖subscript𝜆𝖨1subscript𝜇𝑖superscriptsubscriptproduct𝑗1𝑛superscriptsubscript𝜈𝑗subscript𝜈0𝑗subscript𝜆𝖨2subscript𝜈𝑗\mathsf{S}_{\mathsf{I}}(\pi|K)\propto\prod_{i=1}^{m}\Bigl{(}\frac{\mu_{i}}{\mu% _{0,i}}\Bigr{)}^{-\lambda_{\mathsf{I},1}\mu_{i}}\prod_{j=1}^{n}\Bigl{(}\frac{% \nu_{j}}{\nu_{0,j}}\Bigr{)}^{-\lambda_{\mathsf{I},2}\nu_{j}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_π | italic_K ) ∝ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT sansserif_I , 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT sansserif_I , 2 end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (29)

The base-level ideal design, π𝖨(x,y|K)subscript𝜋𝖨𝑥conditional𝑦𝐾\pi_{\mathsf{I}}(x,y|K)italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ( italic_x , italic_y | italic_K ), has the form of the extended Gibbs kernel (4), consistent with the FPD-OT setting. We further specialize ϕ(x,y)italic-ϕ𝑥𝑦\phi(x,y)italic_ϕ ( italic_x , italic_y ) to the uniform case, ϕ()𝒰italic-ϕ𝒰\phi(\cdot)\equiv{\mathcal{U}}italic_ϕ ( ⋅ ) ≡ caligraphic_U, yielding the following form of the parametric hyperprior:

{definition}

[HFPD-OT hyperprior for the parametric transport plan] The transport hyperprior (17) in the case of a domain, ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, of finite cardinality, m×n𝑚𝑛m\times nitalic_m × italic_n, is parametric, with parameters (λ1o,λ2o,λ𝖨,1,λ𝖨,2,μ0,ν0,π𝖨)superscriptsubscript𝜆1𝑜superscriptsubscript𝜆2𝑜subscript𝜆𝖨1subscript𝜆𝖨2subscript𝜇0subscript𝜈0subscript𝜋𝖨(\lambda_{1}^{o},\lambda_{2}^{o},\lambda_{\mathsf{I},1},\lambda_{\mathsf{I},2}% ,\mu_{0},\nu_{0},\pi_{\mathsf{I}})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT sansserif_I , 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT sansserif_I , 2 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT ), and support on the probability simplex Δm×n1subscriptΔ𝑚𝑛1\Delta_{m\times n-1}roman_Δ start_POSTSUBSCRIPT italic_m × italic_n - 1 end_POSTSUBSCRIPT. It is absolutely continuous w.r.t.  Lebesgue measure, λ𝜆\lambdaitalic_λ, with density

𝖲o(π|K)i=1mj=1n(μiμ0,i)(λ𝖨,1+λ1o)μi(νjν0,j)(λ𝖨,2+λ2o)νj(πi,jπ𝖨,i,j)πi,jλ-a.e.,proportional-tosuperscript𝖲𝑜conditional𝜋𝐾superscriptsubscriptproduct𝑖1𝑚superscriptsubscriptproduct𝑗1𝑛superscriptsubscript𝜇𝑖subscript𝜇0𝑖subscript𝜆𝖨1superscriptsubscript𝜆1𝑜subscript𝜇𝑖superscriptsubscript𝜈𝑗subscript𝜈0𝑗subscript𝜆𝖨2superscriptsubscript𝜆2𝑜subscript𝜈𝑗superscriptsubscript𝜋𝑖𝑗subscript𝜋𝖨𝑖𝑗subscript𝜋𝑖𝑗𝜆-a.e.\mathsf{S}^{o}(\pi|K)\propto\prod_{i=1}^{m}\prod_{j=1}^{n}\Bigl{(}\frac{\mu_{i% }}{\mu_{0,i}}\Bigr{)}^{-(\lambda_{\mathsf{I},1}+\lambda_{1}^{o})\mu_{i}}\Bigl{% (}\frac{\nu_{j}}{\nu_{0,j}}\Bigr{)}^{-(\lambda_{\mathsf{I},2}+\lambda_{2}^{o})% \nu_{j}}\Bigl{(}\frac{\pi_{i,j}}{\pi_{\mathsf{I},i,j}}\Bigr{)}^{-\pi_{i,j}}\;% \;\lambda\text{-a.e.},sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) ∝ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 0 , italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - ( italic_λ start_POSTSUBSCRIPT sansserif_I , 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - ( italic_λ start_POSTSUBSCRIPT sansserif_I , 2 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) italic_ν start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT sansserif_I , italic_i , italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_λ -a.e. , (30)

with the ideal design having the following Gibbs form:

π𝖨,i,jexp(𝖢(xi,yj)ϵ)proportional-tosubscript𝜋𝖨𝑖𝑗𝖢subscript𝑥𝑖subscript𝑦𝑗italic-ϵ\pi_{\mathsf{I},i,j}\propto\exp\Bigl{(}-\frac{\mathsf{C}(x_{i},y_{j})}{% \epsilon}\Bigr{)}italic_π start_POSTSUBSCRIPT sansserif_I , italic_i , italic_j end_POSTSUBSCRIPT ∝ roman_exp ( - divide start_ARG sansserif_C ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ϵ end_ARG )

The number of prior parameters, encoding K𝐾Kitalic_K in (30), is (m+1)×(n+1)𝑚1𝑛1(m+1)\times(n+1)( italic_m + 1 ) × ( italic_n + 1 ). This endows the HFPD-OT hyperprior design with far more expressivity (i.e. degrees-of-freedom (dofs)) than default distributions on the probability simplex. For instance, a Dirichlet distribution of π𝜋\piitalic_π in this finite setting has m+n+1𝑚𝑛1m+n+1italic_m + italic_n + 1 fewer dofs.

Remark 3 (Inference with the HFPD-OT hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K )).

The normalizing constant of the HFPD-OT hyperprior (30) is not available in closed form. A full study of its numerical approximation will be the subject of future work.

The marginal distribution of π1:k,1:lΔk×l1subscript𝜋:1𝑘1:𝑙subscriptΔ𝑘𝑙1\pi_{1:k,1:l}\in\Delta_{k\times l-1}italic_π start_POSTSUBSCRIPT 1 : italic_k , 1 : italic_l end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT italic_k × italic_l - 1 end_POSTSUBSCRIPT, being the sub-matrix of π𝜋\piitalic_π associated with the contracts, πijsubscript𝜋𝑖𝑗\pi_{ij}italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, 1ik<m1𝑖𝑘𝑚1\leq i\leq k<m1 ≤ italic_i ≤ italic_k < italic_m and 1jl<n1𝑗𝑙𝑛1\leq j\leq l<n1 ≤ italic_j ≤ italic_l < italic_n, is

𝖲o(π1:k,1:l|K)=(1wkl)Δmnkl1𝖲o(π|K)𝑑π\(1:k,1:l),superscript𝖲𝑜conditionalsubscript𝜋:1𝑘1:𝑙𝐾subscript1subscript𝑤𝑘𝑙subscriptΔ𝑚𝑛𝑘𝑙1superscript𝖲𝑜conditional𝜋𝐾differential-dsubscript𝜋\:1𝑘1:𝑙\mathsf{S}^{o}(\pi_{1:k,1:l}|K)=\int\displaylimits_{(1-w_{kl})\Delta_{mn-kl-1}% }\mathsf{S}^{o}(\pi|K)d\pi_{\backslash(1:k,1:l)},sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 1 : italic_k , 1 : italic_l end_POSTSUBSCRIPT | italic_K ) = ∫ start_POSTSUBSCRIPT ( 1 - italic_w start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ) roman_Δ start_POSTSUBSCRIPT italic_m italic_n - italic_k italic_l - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) italic_d italic_π start_POSTSUBSCRIPT \ ( 1 : italic_k , 1 : italic_l ) end_POSTSUBSCRIPT , (31)

where wklj=1li=1kπijsubscript𝑤𝑘𝑙superscriptsubscript𝑗1𝑙superscriptsubscript𝑖1𝑘subscript𝜋𝑖𝑗w_{kl}\equiv\sum_{j=1}^{l}\sum_{i=1}^{k}\pi_{ij}italic_w start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ≡ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, and π\(1:k,1:l)subscript𝜋\:1𝑘1:𝑙\pi_{\backslash(1:k,1:l)}italic_π start_POSTSUBSCRIPT \ ( 1 : italic_k , 1 : italic_l ) end_POSTSUBSCRIPT denotes the complement of π1:k,1:lsubscript𝜋:1𝑘1:𝑙\pi_{1:k,1:l}italic_π start_POSTSUBSCRIPT 1 : italic_k , 1 : italic_l end_POSTSUBSCRIPT in π𝜋\piitalic_π. In particular, the marginal distribution of πk,l(0,1)subscript𝜋𝑘𝑙01\pi_{k,l}\in(0,1)italic_π start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ∈ ( 0 , 1 )—i.e. of the (k,l)𝑘𝑙(k,l)( italic_k , italic_l )th random contract, being the normalized mass (probability) transported from the k𝑘kitalic_kth source node and the l𝑙litalic_lth target node—is

𝖲o(πkl|K)=(1πkl)Δmn2𝖲o(π|K)𝑑π\(k,l).superscript𝖲𝑜conditionalsubscript𝜋𝑘𝑙𝐾subscript1subscript𝜋𝑘𝑙subscriptΔ𝑚𝑛2superscript𝖲𝑜conditional𝜋𝐾differential-dsubscript𝜋\absent𝑘𝑙\mathsf{S}^{o}(\pi_{kl}|K)=\int\displaylimits_{(1-\pi_{kl})\Delta_{mn-2}}% \mathsf{S}^{o}(\pi|K)d\pi_{\backslash(k,l)}.sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT | italic_K ) = ∫ start_POSTSUBSCRIPT ( 1 - italic_π start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ) roman_Δ start_POSTSUBSCRIPT italic_m italic_n - 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) italic_d italic_π start_POSTSUBSCRIPT \ ( italic_k , italic_l ) end_POSTSUBSCRIPT . (32)


Finally, the HFPD-optimal full conditional distribution of the (k,l)𝑘𝑙(k,l)( italic_k , italic_l )th contract—having fixed all the others at specific probabilities, π0\(k,l)\pi_{0}{{}_{\backslash(k,l)}}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_FLOATSUBSCRIPT \ ( italic_k , italic_l ) end_FLOATSUBSCRIPT—is

𝖲o(πk,l|π0,\(k,l)K)𝖲o(πk,l,π0|\(k,l)K)χ(0,1ckl)(πkl),\mathsf{S}^{o}(\pi_{k,l}|\pi_{0}{{}_{\backslash(k,l)}},K)\propto\mathsf{S}^{o}% (\pi_{k,l},\pi_{0}{{}_{\backslash(k,l)}}|K)\;\chi_{(0,1-c_{kl})}(\pi_{kl}),sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT | italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_FLOATSUBSCRIPT \ ( italic_k , italic_l ) end_FLOATSUBSCRIPT , italic_K ) ∝ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_FLOATSUBSCRIPT \ ( italic_k , italic_l ) end_FLOATSUBSCRIPT | italic_K ) italic_χ start_POSTSUBSCRIPT ( 0 , 1 - italic_c start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ) , (33)

where cklj=1ni=1m(i,j){(k,l),(m,n)}π0i,jc_{kl}\equiv\underbrace{\sum_{j=1}^{n}\sum_{i=1}^{m}}_{(i,j)\notin\{(k,l),(m,n% )\}}\pi_{0}{{}_{i,j}}italic_c start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ≡ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT ( italic_i , italic_j ) ∉ { ( italic_k , italic_l ) , ( italic_m , italic_n ) } end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_i , italic_j end_FLOATSUBSCRIPT.

4.1.1 Illustration in the m=n=2𝑚𝑛2m=n=2italic_m = italic_n = 2 case

To gain further insight into the parametric HFPD-OT hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (30), we explore its location and shape in the m=n=2𝑚𝑛2m=n=2italic_m = italic_n = 2 case. Then, 𝖲o(π11,π12,π21|K)superscript𝖲𝑜subscript𝜋11subscript𝜋12conditionalsubscript𝜋21𝐾\mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT | italic_K ) has support in the three-dimensional simplex, i.e. (π11,π12,π21)(ΩX×ΩY)Δ3subscript𝜋11subscript𝜋12subscript𝜋21subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌subscriptΔ3(\pi_{11},\pi_{12},\pi_{21})\in\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})\equiv\Delta_{3}( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT ) ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ≡ roman_Δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (Figure 5). We assume that 𝝀o𝝀𝖨much-greater-thansuperscript𝝀𝑜subscript𝝀𝖨\boldsymbol{\lambda}^{o}\gg\boldsymbol{\lambda}_{\mathsf{I}}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ≫ bold_italic_λ start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT, which corresponds to the knowledge-dominated regime [Jeffreys, 1939] in which the ideal in (16) is diffuse in comparison with the K𝐾Kitalic_K-dependent modulating terms in (17). In this case, (30) specializes to:

𝖲o(π11,π12,π21|K)(π11+π12μ0,1)λ1o(π11+π12)(1π11π121μ0,1)λ1o(1π11π12)(π11+π21ν0,1)λ2o(π11+π21)×(1π11π211ν0,1)λ2o(1π11π21)(π11π𝖨,11)π11(π12π𝖨,12)π12(π21π𝖨,21)π21(1π11π12π211π𝖨,11π𝖨,12π𝖨,21)(1π11π12π21)proportional-tosuperscript𝖲𝑜subscript𝜋11subscript𝜋12|subscript𝜋21𝐾superscriptsubscript𝜋11subscript𝜋12subscript𝜇01superscriptsubscript𝜆1𝑜subscript𝜋11subscript𝜋12superscript1subscript𝜋11subscript𝜋121subscript𝜇01superscriptsubscript𝜆1𝑜1subscript𝜋11subscript𝜋12superscriptsubscript𝜋11subscript𝜋21subscript𝜈01superscriptsubscript𝜆2𝑜subscript𝜋11subscript𝜋21superscript1subscript𝜋11subscript𝜋211subscript𝜈01superscriptsubscript𝜆2𝑜1subscript𝜋11subscript𝜋21superscriptsubscript𝜋11subscript𝜋𝖨11subscript𝜋11superscriptsubscript𝜋12subscript𝜋𝖨12subscript𝜋12superscriptsubscript𝜋21subscript𝜋𝖨21subscript𝜋21superscript1subscript𝜋11subscript𝜋12subscript𝜋211subscript𝜋𝖨11subscript𝜋𝖨12subscript𝜋𝖨211subscript𝜋11subscript𝜋12subscript𝜋21\begin{split}\mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)\propto\Bigl{(}\frac{% \pi_{11}+\pi_{12}}{\mu_{0,1}}\Bigr{)}^{-\lambda_{1}^{o}(\pi_{11}+\pi_{12})}% \Bigl{(}\frac{1-\pi_{11}-\pi_{12}}{1-\mu_{0,1}}\Bigr{)}^{-\lambda_{1}^{o}(1-% \pi_{11}-\pi_{12})}\Bigl{(}\frac{\pi_{11}+\pi_{21}}{\nu_{0,1}}\Bigr{)}^{-% \lambda_{2}^{o}(\pi_{11}+\pi_{21})}\times&\\ \Bigl{(}\frac{1-\pi_{11}-\pi_{21}}{1-\nu_{0,1}}\Bigr{)}^{-\lambda_{2}^{o}(1-% \pi_{11}-\pi_{21})}\Bigl{(}\frac{\pi_{11}}{\pi_{\mathsf{I},11}}\Bigr{)}^{-\pi_% {11}}\Bigl{(}\frac{\pi_{12}}{\pi_{\mathsf{I},12}}\Bigr{)}^{-\pi_{12}}\Bigl{(}% \frac{\pi_{21}}{\pi_{\mathsf{I},21}}\Bigr{)}^{-\pi_{21}}\Bigl{(}\frac{1-\pi_{1% 1}-\pi_{12}-\pi_{21}}{1-\pi_{\mathsf{I},11}-\pi_{\mathsf{I},12}-\pi_{\mathsf{I% },21}}\Bigr{)}^{-(1-\pi_{11}-\pi_{12}-\pi_{21})}\end{split}start_ROW start_CELL sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT | italic_K ) ∝ ( divide start_ARG italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( divide start_ARG 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_μ start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( divide start_ARG italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT × end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ( divide start_ARG 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_ν start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( divide start_ARG italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT sansserif_I , 11 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT sansserif_I , 12 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT sansserif_I , 21 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_π start_POSTSUBSCRIPT sansserif_I , 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT sansserif_I , 12 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT sansserif_I , 21 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - ( 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_CELL end_ROW (34)
Refer to caption
Figure 5: Schematic of an uncertain transport plan in the Δ3subscriptΔ3\Delta_{3}roman_Δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT simplex, annotating the corresponding nominal (i.e. prior-specified) row and column marginals. The (2,2)22(2,2)( 2 , 2 ) entry (i.e. contract) is necessarily π22=1π11π12π21subscript𝜋221subscript𝜋11subscript𝜋12subscript𝜋21\pi_{22}=1-\pi_{11}-\pi_{12}-\pi_{21}italic_π start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT = 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT.

Its parameters are μ0,1(0,1)subscript𝜇0101\mu_{0,1}\in(0,1)italic_μ start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT ∈ ( 0 , 1 ),  ν0,1(0,1)subscript𝜈0101\nu_{0,1}\in(0,1)italic_ν start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT ∈ ( 0 , 1 ),   (π𝖨,11,π𝖨,12,π𝖨,21)Δ3subscript𝜋𝖨11subscript𝜋𝖨12subscript𝜋𝖨21subscriptΔ3(\pi_{\mathsf{I},11},\pi_{\mathsf{I},12},\pi_{\mathsf{I},21})\in\Delta_{3}( italic_π start_POSTSUBSCRIPT sansserif_I , 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT sansserif_I , 12 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT sansserif_I , 21 end_POSTSUBSCRIPT ) ∈ roman_Δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and the Kantorovitch potentials, 𝝀o𝟎succeeds-or-equalssuperscript𝝀𝑜0\boldsymbol{\lambda}^{o}\succeq\boldsymbol{0}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ⪰ bold_0. The purpose of the following simulations is to study the influence of the Kantorovitch potentials, 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (20), and the nominal marginals, μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, on the location and shape of the hyperprior. For ease of visualization (in Δ2subscriptΔ2\Delta_{2}roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), we focus primarily on the bivariate marginal distribution444All integrals in this section are computed using Gaussian quadrature integration, yielding results with an average integration error of 1.46×108absent1.46superscript108\approx 1.46\times 10^{-8}≈ 1.46 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT. (31), i.e. the hyperprior concentrated on the two contracts forming the first row of the uncertain transport plan (Figure 5):

𝖲o(π11,π12|K)01π11π12𝖲o(π11,π12,π21|K)𝑑π21proportional-tosuperscript𝖲𝑜subscript𝜋11conditionalsubscript𝜋12𝐾superscriptsubscript01subscript𝜋11subscript𝜋12superscript𝖲𝑜subscript𝜋11subscript𝜋12conditionalsubscript𝜋21𝐾differential-dsubscript𝜋21\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)\propto\int_{0}^{1-\pi_{11}-\pi_{12}}% \mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)d\pi_{21}sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT | italic_K ) ∝ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 - italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT | italic_K ) italic_d italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT (35)
Shape parameters:

The cost matrix (4) and nominal marginals are respectively set to the following values:

𝖢[0110],(μ0,ν0){[0.30.7],[0.80.2]}.formulae-sequence𝖢matrix0110subscript𝜇0subscript𝜈0matrix0.30.7matrix0.80.2\mathsf{C}\equiv\begin{bmatrix}0&1\\ 1&0\end{bmatrix},\;\;\;(\mu_{0},\nu_{0})\equiv\left\{\begin{bmatrix}0.3\\ 0.7\end{bmatrix},\begin{bmatrix}0.8\\ 0.2\end{bmatrix}\right\}.sansserif_C ≡ [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] , ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≡ { [ start_ARG start_ROW start_CELL 0.3 end_CELL end_ROW start_ROW start_CELL 0.7 end_CELL end_ROW end_ARG ] , [ start_ARG start_ROW start_CELL 0.8 end_CELL end_ROW start_ROW start_CELL 0.2 end_CELL end_ROW end_ARG ] } .

For now, we fix the smoothness parameter ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 and study its influence on the shape of the hyperprior in a separate section. We examine the influence of the Kantorovitch potentials, 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, on the shape of the hyperprior, by varying their values as follows: 𝝀o{0.05,10,100}2superscript𝝀𝑜superscript0.05101002\boldsymbol{\lambda}^{o}\in\{0.05,10,100\}^{2}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∈ { 0.05 , 10 , 100 } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

As discussed earlier, these potentials—through their connection to the KLD radii, (η,ζ)𝜂𝜁(\eta,\zeta)( italic_η , italic_ζ )—quantify the uncertainty in the marginals and induce two asymptotic learning modes. The first is attained when 𝝀o𝟎superscript𝝀𝑜0\boldsymbol{\lambda}^{o}\rightarrow\boldsymbol{0}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT → bold_0, and coincides with the non-specification of the marginals, and the absence of effective learning. The second is attained when 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}\rightarrow\inftybold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT → ∞, i.e. when there is perfect specification of the marginals. The visualizations in Figure 6—which shows the contour plots of the marginal hyperprior, 𝖲o(π11,π12|K)superscript𝖲𝑜subscript𝜋11conditionalsubscript𝜋12𝐾\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT | italic_K ) for the chosen values of 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT—illustrate this concentration behaviour, as we progress from the first to the second modality. By increasing the potentials, the contours gradually concentrate on a thin statistical manifold, namely Π(μ0,ν0)double-struck-Πsubscript𝜇0subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). In addition to the marginal hyperprior, we show the first row, (π11subscript𝜋11\pi_{11}italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, π12subscript𝜋12\pi_{12}italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT), of the expected transport plan, π^𝖲subscript^𝜋𝖲\hat{\pi}_{\mathsf{S}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT (7) (red dot). The latter is obtained by averaging samples drawn from the joint hyperprior: π(k)𝖲o(π11,π12,π21|K)similar-tosuperscript𝜋𝑘superscript𝖲𝑜subscript𝜋11subscript𝜋12conditionalsubscript𝜋21𝐾\pi^{(k)}\sim\mathsf{S}^{o}(\pi_{11},\pi_{12},\pi_{21}|K)italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT | italic_K ). The blue dot, on the other hand, corresponds to the first row of the EOT plan, πo(xi,yj|K)superscript𝜋𝑜subscript𝑥𝑖conditionalsubscript𝑦𝑗𝐾\pi^{o}(x_{i},y_{j}|K)italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_K ) (5), computed for the nominal marginals, (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), using the Sinkhorn-Knopp algorithm [Cuturi, 2013] (and is, of course, invariant with 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT). The expected transport plan gradually converges towards the OT plan, as the support of the marginal hyperprior contracts towards Π(μ0,ν0)double-struck-Πsubscript𝜇0subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) when 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}\rightarrow\inftybold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT → ∞, which is consistent with the Laplace-Bernstein concentration theorem.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Contour plots of the bivariate marginal hyperprior, 𝖲o(π11,π12|K)superscript𝖲𝑜subscript𝜋11conditionalsubscript𝜋12𝐾\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT | italic_K ), defined over the 2D simplex, Δ2subscriptΔ2\Delta_{2}roman_Δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, for various values of the Kantorovitch potentials, 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, and for fixed nominal marginals, (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and base-level ideal design. The red dots correspond to the expected value of the first row, (π11subscript𝜋11\pi_{11}italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, π12subscript𝜋12\pi_{12}italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT), of the uncertain transport plan. We also show—via the blue dots—the first row of the conventional EOT plan, for (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).
Location parameters:

The nominal marginals, (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), play the role of location parameters for the hyperprior. To illustrate this, we fix the Kantorovitch potentials and the smoothness parameter, respectively, to default values: 𝝀o=(1,1)superscript𝝀𝑜11\boldsymbol{\lambda}^{o}=(1,1)bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = ( 1 , 1 ), ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 and vary the nominal marginals as follows:

(μ0,ν0){[(0.9,0.1)(0.1,0.9)],[(0.5,0.5)(0.5,0.5)],[(0.1,0.9)(0.9,0.1)]}.subscript𝜇0subscript𝜈0superscriptmatrix0.90.10.10.9superscriptmatrix0.50.50.50.5superscriptmatrix0.10.90.90.1(\mu_{0},\nu_{0})\in\left\{\begin{bmatrix}(0.9,0.1)\\ (0.1,0.9)\end{bmatrix}^{\intercal},\begin{bmatrix}(0.5,0.5)\\ (0.5,0.5)\end{bmatrix}^{\intercal},\begin{bmatrix}(0.1,0.9)\\ (0.9,0.1)\end{bmatrix}^{\intercal}\right\}.( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ { [ start_ARG start_ROW start_CELL ( 0.9 , 0.1 ) end_CELL end_ROW start_ROW start_CELL ( 0.1 , 0.9 ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT , [ start_ARG start_ROW start_CELL ( 0.5 , 0.5 ) end_CELL end_ROW start_ROW start_CELL ( 0.5 , 0.5 ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT , [ start_ARG start_ROW start_CELL ( 0.1 , 0.9 ) end_CELL end_ROW start_ROW start_CELL ( 0.9 , 0.1 ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT } . (36)


For each pair of the nominal marginals in (36), we show in Figure 7 the contour plot of the marginal hyperprior, 𝖲o(π11,π12|K)superscript𝖲𝑜subscript𝜋11conditionalsubscript𝜋12𝐾\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT | italic_K ). Moreover, we plot the first row of the expected transport plan, π^𝖲subscript^𝜋𝖲\hat{\pi}_{\mathsf{S}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT, in red and the EOT plan, πosuperscript𝜋𝑜\pi^{o}italic_π start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT in blue. The location of the mode is clearly influenced by the nominal marginals, (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and more precisely, by their symmetry and skewness. The expected plan, π^𝖲subscript^𝜋𝖲\hat{\pi}_{\mathsf{S}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT (7), is attracted by the mode of the marginal hyperprior; the optimal plan, on the other hand, initially has a low probability under the marginal hyperprior but contracts gradually towards the mode.

Refer to caption
Refer to caption
Refer to caption
Figure 7: Marginal hyperprior, 𝖲o(π11,π12|K)superscript𝖲𝑜subscript𝜋11conditionalsubscript𝜋12𝐾\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT | italic_K ), for fixed Kantorovitch potentials, 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT, and various values of the nominal marginals, (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The red and blue dots correspond to the first row of the expected transport plan, and of the EOT plan, respectively.
Influence of the ideal hyperprior :

Finally, we explore the influence of the ideal design (9), and, more precisely, its smoothness parameter, ϵitalic-ϵ\epsilonitalic_ϵ, which enters at the base-level of the ideal specification (4). We hold the nominal marginals, (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), constant, as indicated. By varying ϵ{0.1,0.5,10}italic-ϵ0.10.510\epsilon\in\{0.1,0.5,10\}italic_ϵ ∈ { 0.1 , 0.5 , 10 }, it is clear from Figure 8 that this parameter affects the location of the hyperprior, 𝖲o(π11,π12|K)superscript𝖲𝑜subscript𝜋11conditionalsubscript𝜋12𝐾\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT | italic_K ).

Refer to caption
Refer to caption
Refer to caption
Figure 8: Contour plots of the marginal hyperprior 𝖲o(π11,π12|K)superscript𝖲𝑜subscript𝜋11conditionalsubscript𝜋12𝐾\mathsf{S}^{o}(\pi_{11},\pi_{12}|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT | italic_K ), for various values of the smoothness parameter ϵitalic-ϵ\epsilonitalic_ϵ. The red and blue dots correspond to the first row of the expected transport plan, and of the EOT plan, respectively.

4.2 Stochastic approximation of the optimal Kantorovitch potentials

We now focus on the derivation of the optimal Kantorovitch potentials, 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. This requires processing the knowledge constraints, (η,ζ)𝜂𝜁(\eta,\zeta)( italic_η , italic_ζ ), in the hyperprior, by solving the dual program (18). To this end, we leverage a combination of second-order optimization and MCMC techniques.

Computing 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT by means of the dual program in (18) is a critical step in the design of the optimal hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (30). However, deriving their exact values in high-dimensional settings is not trivial, as it requires manipulating the intractable normalizing constant (19). The methodology proposed herein approximates these potentials using a combination of Quasi-Newton [Nocedal and Wright, 2006], [Nesterov, 2018] and Hamiltonian Monte Carlo (HMC) [Betancourt, 2017], thus circumventing the need to evaluate 𝖭(𝝀)𝖭𝝀\mathsf{N}(\boldsymbol{\lambda})sansserif_N ( bold_italic_λ ). In particular, HMC provides a rigorous and efficient framework for sampling in high-dimensional settings: compared to other MCMC techniques, the number of gradient estimations in HMC is less sensitive to the dimension of the problem [Mangoubi and Smith, 2019], making it a convenient choice when generating random transport plans π(k)𝖲osimilar-tosuperscript𝜋𝑘superscript𝖲𝑜\pi^{(k)}\sim\mathsf{S}^{o}italic_π start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT.

As proved in (18), the optimal Kantorovitch potentials read as follows:

𝝀o=argmin𝝀0{𝝀𝜽log(𝖭(𝝀))}.superscript𝝀𝑜subscriptargminsucceeds-or-equals𝝀0superscript𝝀𝜽𝖭𝝀\boldsymbol{\lambda}^{o}=\operatorname*{argmin}_{\boldsymbol{\lambda}\succeq 0% }\left\{\boldsymbol{\lambda}^{\intercal}\boldsymbol{\theta}-\log\left(\mathsf{% N}(\boldsymbol{\lambda})\right)\right\}.bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT bold_italic_λ ⪰ 0 end_POSTSUBSCRIPT { bold_italic_λ start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_θ - roman_log ( sansserif_N ( bold_italic_λ ) ) } . (37)

Let ϱ(𝝀)𝝀𝜽log(𝖭(𝝀))italic-ϱ𝝀superscript𝝀𝜽𝖭𝝀\varrho(\boldsymbol{\lambda})\equiv\boldsymbol{\lambda}^{\intercal}\boldsymbol% {\theta}-\log\left(\mathsf{N}(\boldsymbol{\lambda})\right)italic_ϱ ( bold_italic_λ ) ≡ bold_italic_λ start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_italic_θ - roman_log ( sansserif_N ( bold_italic_λ ) ) denote the optimization objective in (37). Its gradient vector can be written conveniently using the following expectation:

𝝀ϱ(𝝀)=𝜽𝖤𝖲[𝖱(π)]subscript𝝀italic-ϱ𝝀𝜽subscript𝖤𝖲delimited-[]𝖱𝜋\begin{split}\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda})&=% \boldsymbol{\theta}-\mathsf{E}_{\mathsf{S}}\bigl{[}\mathsf{R}(\pi)\bigr{]}\end% {split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_ϱ ( bold_italic_λ ) end_CELL start_CELL = bold_italic_θ - sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT [ sansserif_R ( italic_π ) ] end_CELL end_ROW (38)

We define stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the Kantorovitch potentials and their gradient differentials respectively, as follows:

st𝝀t+1𝝀tsubscript𝑠𝑡subscript𝝀𝑡1subscript𝝀𝑡s_{t}\equiv\boldsymbol{\lambda}_{t+1}-\boldsymbol{\lambda}_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ bold_italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

and

ntϱ(𝝀t+1)ϱ(𝝀t)subscript𝑛𝑡italic-ϱsubscript𝝀𝑡1italic-ϱsubscript𝝀𝑡n_{t}\equiv\nabla\varrho(\boldsymbol{\lambda}_{t+1})-\nabla\varrho(\boldsymbol% {\lambda}_{t})italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ ∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - ∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where t>0𝑡0t>0italic_t > 0 is the iteration in Quasi-Newton. The recursive approximation of the inverse Hessian can be written as follows [Nocedal and Wright, 2006]:

𝖧t+1=(𝗜ςtstnt)𝖧t(𝗜ςtntst)+ςtstst,ςt1ntstformulae-sequencesubscript𝖧𝑡1𝗜subscript𝜍𝑡subscript𝑠𝑡superscriptsubscript𝑛𝑡subscript𝖧𝑡𝗜subscript𝜍𝑡subscript𝑛𝑡superscriptsubscript𝑠𝑡subscript𝜍𝑡subscript𝑠𝑡superscriptsubscript𝑠𝑡subscript𝜍𝑡1superscriptsubscript𝑛𝑡subscript𝑠𝑡\mathsf{H}_{t+1}=(\boldsymbol{\mathsf{I}}-\varsigma_{t}s_{t}n_{t}^{\intercal})% \mathsf{H}_{t}(\boldsymbol{\mathsf{I}}-\varsigma_{t}n_{t}s_{t}^{\intercal})+% \varsigma_{t}s_{t}s_{t}^{\intercal}\;\;\;,\;\ \varsigma_{t}\equiv\frac{1}{n_{t% }^{\intercal}s_{t}}sansserif_H start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = ( bold_sansserif_I - italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_sansserif_I - italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) + italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT , italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (39)

where 𝗜𝗜\boldsymbol{\mathsf{I}}bold_sansserif_I denotes the identity matrix. We note that the inverse Hessian 𝖧tsubscript𝖧𝑡\mathsf{H}_{t}sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depends only on the stochastic gradients ϱ(𝝀t)italic-ϱsubscript𝝀𝑡\nabla\varrho(\boldsymbol{\lambda}_{t})∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (38). Thus, we avoid stability issues when dealing with ill-conditioned stochastic inverse Hessian approximations, as it is the case with high-variance MC samplers.

Once computed, the gradient and the inverse Hessian are plugged into the usual BFGS iterative updates [Nocedal and Wright, 2006]:

𝝀t+1𝝀tρt𝖧t𝝀ϱ(𝝀t)absentsubscript𝝀𝑡1subscript𝝀𝑡subscript𝜌𝑡subscript𝖧𝑡subscript𝝀italic-ϱsubscript𝝀𝑡\boldsymbol{\lambda}_{t+1}\xleftarrow{}\boldsymbol{\lambda}_{t}-\rho_{t}% \mathsf{H}_{t}\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t})bold_italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (40)

where ρt>0subscript𝜌𝑡0\rho_{t}>0italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 is the step size at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT iteration in the search direction given by:

𝖽(𝝀t)𝖧t𝝀ϱ(𝝀t)𝖽subscript𝝀𝑡subscript𝖧𝑡subscript𝝀italic-ϱsubscript𝝀𝑡\mathsf{d}(\boldsymbol{\lambda}_{t})\equiv-\mathsf{H}_{t}\nabla_{\boldsymbol{% \lambda}}\varrho(\boldsymbol{\lambda}_{t})sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≡ - sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (41)

The step size ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT should be adapted carefully to ensure convergence to the global minimum 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT. It is usually computed by solving an auxiliary line search problem, using techniques such as backtrack line search (BTLS) [Nesterov, 2018]. However, most of line search techniques require the evaluation of the objective ϱ(𝝀)italic-ϱ𝝀\varrho(\boldsymbol{\lambda})italic_ϱ ( bold_italic_λ ) at each step. To avoid explicit function evaluations, we propose a simple local approximation that estimates the position of the minimum along the search line (41), based solely on two gradient evaluations [Snyman, 2005].

More precisely, the optimal step size that yields sufficient decrease in the search direction (41) can be found by solving the following problem:

ρt=argminρ[0,1]ϱ(𝝀t+ρ𝖽(𝝀t))superscriptsubscript𝜌𝑡subscriptargmin𝜌01italic-ϱsubscript𝝀𝑡𝜌𝖽subscript𝝀𝑡\rho_{t}^{*}=\operatorname*{argmin}_{\rho\in[0,1]}\varrho(\boldsymbol{\lambda}% _{t}+\rho\;\mathsf{d}(\boldsymbol{\lambda}_{t}))italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_argmin start_POSTSUBSCRIPT italic_ρ ∈ [ 0 , 1 ] end_POSTSUBSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (42)

Assuming that ϱitalic-ϱ\varrhoitalic_ϱ is locally quadratic at 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it follows that solving (42) reduces to finding ρ𝜌\rhoitalic_ρ that satisfies:

ϱ(𝝀t+ρ𝖽(𝝀t))=ϱ(𝝀t)italic-ϱsubscript𝝀𝑡𝜌𝖽subscript𝝀𝑡italic-ϱsubscript𝝀𝑡\varrho(\boldsymbol{\lambda}_{t}+\rho\;\mathsf{d}(\boldsymbol{\lambda}_{t}))=% \varrho(\boldsymbol{\lambda}_{t})italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ρ sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (43)

Which yields the following optimal step size:

ρt=𝖽(𝝀t)ϱ(𝝀t)𝖽(𝝀t)2ϱ(𝝀t)𝖽(𝝀t)superscriptsubscript𝜌𝑡𝖽superscriptsubscript𝝀𝑡italic-ϱsubscript𝝀𝑡𝖽superscriptsubscript𝝀𝑡superscript2italic-ϱsubscript𝝀𝑡𝖽subscript𝝀𝑡\begin{split}\rho_{t}^{*}&=\frac{-\mathsf{d}(\boldsymbol{\lambda}_{t})^{% \intercal}\nabla\varrho(\boldsymbol{\lambda}_{t})}{{\mathsf{d}(\boldsymbol{% \lambda}_{t})}^{\intercal}\nabla^{2}\varrho(\boldsymbol{\lambda}_{t})\mathsf{d% }(\boldsymbol{\lambda}_{t})}\end{split}start_ROW start_CELL italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = divide start_ARG - sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW (44)

Finally, by a second-order Taylor expansion at 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝝀t+𝖽(𝝀t)subscript𝝀𝑡𝖽subscript𝝀𝑡\boldsymbol{\lambda}_{t}+\mathsf{d}(\boldsymbol{\lambda}_{t})bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), the denominator in (44) can be computed using two gradients estimations, as follows:

𝖽(𝝀t)2ϱ(𝝀t)𝖽(𝝀t)𝖽(𝝀t)[ϱ(𝝀t+𝖽(𝝀t))ϱ(𝝀t)]𝖽superscriptsubscript𝝀𝑡superscript2italic-ϱsubscript𝝀𝑡𝖽subscript𝝀𝑡𝖽superscriptsubscript𝝀𝑡delimited-[]italic-ϱsubscript𝝀𝑡𝖽subscript𝝀𝑡italic-ϱsubscript𝝀𝑡{\mathsf{d}(\boldsymbol{\lambda}_{t})}^{\intercal}\nabla^{2}\varrho(% \boldsymbol{\lambda}_{t})\mathsf{d}(\boldsymbol{\lambda}_{t})\approx{\mathsf{d% }(\boldsymbol{\lambda}_{t})}^{\intercal}\left[\nabla\varrho(\boldsymbol{% \lambda}_{t}+\mathsf{d}(\boldsymbol{\lambda}_{t}))-\nabla\varrho(\boldsymbol{% \lambda}_{t})\right]sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT [ ∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + sansserif_d ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) - ∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] (45)

What remains is to compute the gradient terms, which can be estimated using HMC. If ns>0subscript𝑛𝑠0n_{s}>0italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 0 is the number of independent realizations π(i)𝖲(π|K)similar-tosuperscript𝜋𝑖𝖲conditional𝜋𝐾\pi^{(i)}\sim\mathsf{S}(\pi|K)italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ sansserif_S ( italic_π | italic_K ), then the expectation in (38) can be approximated as follows:

𝖤𝖲[𝖱(π)]1nsampi=1ns𝖱(π(i))subscript𝖤𝖲delimited-[]𝖱𝜋1subscript𝑛𝑠𝑎𝑚𝑝superscriptsubscript𝑖1subscript𝑛𝑠𝖱superscript𝜋𝑖\mathsf{E}_{\mathsf{S}}\bigl{[}\mathsf{R}(\pi)\bigr{]}\approx\frac{1}{n_{samp}% }\sum\limits_{i=1}^{n_{s}}\mathsf{R}(\pi^{(i)})sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT [ sansserif_R ( italic_π ) ] ≈ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT sansserif_R ( italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) (46)

At each iteration t𝑡titalic_t, the error (stopping criterion) is measured by means of the following Newton’s decrement, which corresponds to the inverse Hessian norm of the gradient. This quantity provides a good indication of the proximity to the optimal Kantorovitch potentials:

𝖾𝗋𝗋𝝀ϱ(𝝀t)𝝀2ϱ(𝝀t)1𝝀ϱ(𝝀t)𝖾𝗋𝗋subscript𝝀italic-ϱsuperscriptsubscript𝝀𝑡superscriptsubscript𝝀2italic-ϱsuperscriptsubscript𝝀𝑡1subscript𝝀italic-ϱsubscript𝝀𝑡\mathsf{err}\equiv\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t% })^{\intercal}\nabla_{\boldsymbol{\lambda}}^{2}\varrho(\boldsymbol{\lambda}_{t% })^{-1}\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t})sansserif_err ≡ ∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (47)

The optimal potentials 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT are then plugged into (17) and the optimal hyperprior can be used to generate random transport plans, by means of another HMC sampler.

Input: nominal marginals (μ0,ν0)subscript𝜇0subscript𝜈0(\mu_{0},\nu_{0})( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), KLD radii (η,ζ)𝜂𝜁(\eta,\zeta)( italic_η , italic_ζ ), target precision τ𝜏\tauitalic_τ, base-level ideal design π𝖨subscript𝜋𝖨\pi_{\mathsf{I}}italic_π start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT, hierarchical ideal design 𝖲𝖨subscript𝖲𝖨\mathsf{S}_{\mathsf{I}}sansserif_S start_POSTSUBSCRIPT sansserif_I end_POSTSUBSCRIPT, number of samples n𝗌𝖺𝗆𝗉subscript𝑛𝗌𝖺𝗆𝗉n_{\mathsf{samp}}italic_n start_POSTSUBSCRIPT sansserif_samp end_POSTSUBSCRIPT
Result: 𝝀osuperscript𝝀𝑜\boldsymbol{\lambda}^{o}bold_italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT
1 Initialization: t=1𝑡1t=1italic_t = 1, 𝝀t0succeeds-or-equalssubscript𝝀𝑡0\boldsymbol{\lambda}_{t}\succeq 0bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⪰ 0, ρt=1subscript𝜌𝑡1\rho_{t}=1italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1, 𝖧t=𝗜subscript𝖧𝑡𝗜\mathsf{H}_{t}=\boldsymbol{\mathsf{I}}sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_sansserif_I, 𝖾𝗋𝗋=𝖾𝗋𝗋\mathsf{err}=\inftysansserif_err = ∞ ;
2 while τ<𝖾𝗋𝗋𝜏𝖾𝗋𝗋\tau<\mathsf{err}italic_τ < sansserif_err do
3       Sample {πt(l)}l=1n𝗌𝖺𝗆𝗉𝖲(π|K𝝀,𝝀t)similar-tosuperscriptsubscriptsubscriptsuperscript𝜋𝑙𝑡𝑙1subscript𝑛𝗌𝖺𝗆𝗉𝖲conditional𝜋subscript𝐾𝝀subscript𝝀𝑡\{\pi^{(l)}_{t}\}_{l=1}^{n_{\mathsf{samp}}}\sim\mathsf{S}(\pi|K_{-\boldsymbol{% \lambda}},\boldsymbol{\lambda}_{t}){ italic_π start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT sansserif_samp end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∼ sansserif_S ( italic_π | italic_K start_POSTSUBSCRIPT - bold_italic_λ end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright HMC sampler. K𝛌subscript𝐾𝛌K_{-\boldsymbol{\lambda}}italic_K start_POSTSUBSCRIPT - bold_italic_λ end_POSTSUBSCRIPT denotes all parameters in the knowledge set K𝐾Kitalic_K, except 𝛌𝛌\boldsymbol{\lambda}bold_italic_λ ;
4       Estimate 𝖤𝖲(π|K𝝀,𝝀t)[𝖱(π)]subscript𝖤𝖲conditional𝜋subscript𝐾𝝀subscript𝝀𝑡delimited-[]𝖱𝜋\mathsf{E}_{\mathsf{S}(\pi|K_{-\boldsymbol{\lambda}},\boldsymbol{\lambda}_{t})% }\left[\mathsf{R}(\pi)\right]sansserif_E start_POSTSUBSCRIPT sansserif_S ( italic_π | italic_K start_POSTSUBSCRIPT - bold_italic_λ end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ sansserif_R ( italic_π ) ] ;
5       Estimate 𝝀ϱ(𝝀t)subscript𝝀italic-ϱsubscript𝝀𝑡\nabla_{\boldsymbol{\lambda}}\varrho(\boldsymbol{\lambda}_{t})∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ;
6       Compute 𝝀ˇt+1𝝀ˇt𝖧tϱ(𝝀t)subscriptˇ𝝀𝑡1subscriptˇ𝝀𝑡subscript𝖧𝑡italic-ϱsubscript𝝀𝑡\check{\boldsymbol{\lambda}}_{t+1}\leftarrow\check{\boldsymbol{\lambda}}_{t}-% \mathsf{H}_{t}\nabla\varrho(\boldsymbol{\lambda}_{t})overroman_ˇ start_ARG bold_italic_λ end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← overroman_ˇ start_ARG bold_italic_λ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ;
7       Sample {πˇt+1(l)}l=1n𝗌𝖺𝗆𝗉𝖲(π|K𝝀,𝝀ˇt+1)similar-tosuperscriptsubscriptsubscriptsuperscriptˇ𝜋𝑙𝑡1𝑙1subscript𝑛𝗌𝖺𝗆𝗉𝖲conditional𝜋subscript𝐾𝝀subscriptˇ𝝀𝑡1\{\check{\pi}^{(l)}_{t+1}\}_{l=1}^{n_{\mathsf{samp}}}\sim\mathsf{S}(\pi|K_{-% \boldsymbol{\lambda}},\check{\boldsymbol{\lambda}}_{t+1}){ overroman_ˇ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT sansserif_samp end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∼ sansserif_S ( italic_π | italic_K start_POSTSUBSCRIPT - bold_italic_λ end_POSTSUBSCRIPT , overroman_ˇ start_ARG bold_italic_λ end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ;
8       Estimate 𝝀ϱ(𝝀ˇt+1)subscript𝝀italic-ϱsubscriptˇ𝝀𝑡1\nabla_{\boldsymbol{\lambda}}\varrho(\check{\boldsymbol{\lambda}}_{t+1})∇ start_POSTSUBSCRIPT bold_italic_λ end_POSTSUBSCRIPT italic_ϱ ( overroman_ˇ start_ARG bold_italic_λ end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ;
9       Compute ρtsubscriptsuperscript𝜌𝑡\rho^{*}_{t}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ;
10       Update 𝝀t+1𝝀tρt𝖧tϱ(𝝀t)absentsubscript𝝀𝑡1subscript𝝀𝑡subscriptsuperscript𝜌𝑡subscript𝖧𝑡italic-ϱsubscript𝝀𝑡\boldsymbol{\lambda}_{t+1}\xleftarrow{}\boldsymbol{\lambda}_{t}-\rho^{*}_{t}% \mathsf{H}_{t}\nabla\varrho(\boldsymbol{\lambda}_{t})bold_italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ italic_ϱ ( bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ;
11       Compute stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ςtsubscript𝜍𝑡\varsigma_{t}italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ;
12       Update 𝖧t+1(𝗜ςtstnt)𝖧t(𝗜ςtntst)+ςtststsubscript𝖧𝑡1𝗜subscript𝜍𝑡subscript𝑠𝑡superscriptsubscript𝑛𝑡subscript𝖧𝑡𝗜subscript𝜍𝑡subscript𝑛𝑡superscriptsubscript𝑠𝑡subscript𝜍𝑡subscript𝑠𝑡superscriptsubscript𝑠𝑡\mathsf{H}_{t+1}\leftarrow(\boldsymbol{\mathsf{I}}-\varsigma_{t}s_{t}n_{t}^{% \intercal})\mathsf{H}_{t}(\boldsymbol{\mathsf{I}}-\varsigma_{t}n_{t}s_{t}^{% \intercal})+\varsigma_{t}s_{t}s_{t}^{\intercal}sansserif_H start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ← ( bold_sansserif_I - italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) sansserif_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_sansserif_I - italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) + italic_ς start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ;
13       Update 𝖾𝗋𝗋𝖾𝗋𝗋\mathsf{err}sansserif_err ;
14       Update tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
return 𝝀t+1subscript𝝀𝑡1\boldsymbol{\lambda}_{t+1}bold_italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
Algorithm 1 Approximation of the Kantorovitch potentials
Remark 4.

Computational complexity. In Algorithm 1, we replace each approximation of the normalising constant, 𝖭(𝛌)𝖭𝛌\mathsf{N}(\boldsymbol{\lambda})sansserif_N ( bold_italic_λ ) (19), with two gradient approximations. Therefore, the overall computational complexity is mainly driven by the sampling operations in line 3333 and 7777 of the Algorithm, whose complexity is, in turn, contingent upon the number of gradient evaluations used in the leapfrog integrator of the HMC sampler [Betancourt, 2017]. Under certain regularity conditions, this number is of order 𝒪(mn)𝒪mn\mathcal{O}(\sqrt{mn})caligraphic_O ( square-root start_ARG italic_m italic_n end_ARG ) [Mangoubi and Smith, 2019]. Though these regularity conditions are not fully satisfied here (see Remark 5), this provides us with a good lower bound on the computational complexity. Using a mean-field variational Bayes method at each iteration of the Quasi-Newton method—which assumes that all the parameters (i.e. contracts), πi,jsubscriptπij\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, are independent—would result in a linear time complexity in the number of parameters, that is 𝒪(mn)𝒪mn\mathcal{O}(mn)caligraphic_O ( italic_m italic_n ).

Remark 5.

On HMC mixing properties. It is worth noting that the main convergence results of HMC, when sampling from a log-concave function, 𝖾fsuperscript𝖾f\mathsf{e}^{-f}sansserif_e start_POSTSUPERSCRIPT - italic_f end_POSTSUPERSCRIPT, require strongly convex and Lipschitz smooth (i.e. Lipschitz ff\nabla f∇ italic_f) potential functions, fffitalic_f [Chen and Vempala, 2022]. However, the KLD is not Lipschitz smooth and the theoretical convergence results are not guaranteed in our setting. This results in a longer integration time and biased estimators, especially when (η,ζ)(0,0)ηζ00(\eta,\zeta)\rightarrow(0,0)( italic_η , italic_ζ ) → ( 0 , 0 ). For the time being, we will use HMC while carefully tuning its main parameters (integrator step size, adaptation step, etc.), and will explore specialized samplers in a separate work.

5 HFPD-OT for Algorithmic fairness in market matching

The goal of algorithmic fairness is to detect and mitigate algorithmic biases induced by automated decision-making systems [Barocas et al., 2023]. This is a compelling setting for HFPD-OT, since we can benefit from randomized transport plans to elicit fair transport strategies in the presence of uncertainty. Note that OT for fairness has already been proposed in other works (see [Gordaliza et al., 2019] and references therein), with the focus being on notions of data repair and learning fair models. In contrast, we are concerned, here, with fair OT, whose purpose is to design transport plans that are fair per se. The literature on fair OT is sparse: in [Hughes and Chen, 2021], the authors address the fair OT problem by proposing a dynamic and distributed fair OT algorithm. In this manuscript, we propose a different approach that leverages randomized policies, which are induced naturally by the HFPD-OT setting.

To appreciate the implications for fair OT of the randomization and diversity allowed by HFPD-OT, we study the problem of fair market matching [Galichon, 2021], [Echenique et al., 2024], and more precisely the question of worker-job matching, in which the nominal marginals, μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, are estimates of the distributions of workers and jobs, respectively. An agent xiΩXsubscript𝑥𝑖subscriptdouble-struck-Ω𝑋x_{i}\in\mathbb{\Omega}_{X}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT represents a category of workers or skills, while an agent yjΩYsubscript𝑦𝑗subscriptdouble-struck-Ω𝑌y_{j}\in\mathbb{\Omega}_{Y}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT is a job opportunity or a company. In particular, we study vertically-differentiated agents: workers in one category may exhibit skills not available in other categories. Similarly, some companies may differ in their size or may have unique production technologies 555This is in contrast to horizontally differentiated agents, where some hierarchy may exist between agents.. A contract πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT seeks to match some of the workers in category xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with some of the job opportunities offered by yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Our purpose is to study the following question: Can randomized transport plans elicit long-term fair matching strategies in a worker-job matching problem, for agents as well as for individual contracts? Our notion of fairness is asymptotic, in the sense that fairness is achieved in the long-run. This is in contrast to the static (i.e. invariant) designs of classical OT, which may, indeed, satisfy a standard fairness metric based on the ensemble of contracts on ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, but, unfortunately, harms the same individual agents or contracts, either because of:

  • (i)

    misspecification of the marginals for some of the agents, ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\Omega}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT or ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT; and/or

  • (ii)

    an invariant and uneven distribution of mass across the contracts, πi,josuperscriptsubscript𝜋𝑖𝑗𝑜\pi_{i,j}^{o}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT.

Before addressing the problem of fair labour market matching (Section 5.4), we review the fairness-related concept of diversity.

5.1 Simulation study

We consider the following setting:

  • mnd20𝑚𝑛𝑑20m\equiv n\equiv d\equiv 20italic_m ≡ italic_n ≡ italic_d ≡ 20

  • ϵ103italic-ϵsuperscript103\epsilon\equiv 10^{-3}italic_ϵ ≡ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT

  • 𝖢i,jxiyj22,(i,j)[[m]]×[[n]]formulae-sequencesubscript𝖢𝑖𝑗superscriptsubscriptnormsubscript𝑥𝑖subscript𝑦𝑗22𝑖𝑗delimited-[]delimited-[]𝑚delimited-[]delimited-[]𝑛\mathsf{C}_{i,j}\equiv\|x_{i}-y_{j}\|_{2}^{2},\;\;(i,j)\in[\![m]\!]\times[\![n% ]\!]\ sansserif_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≡ ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ( italic_i , italic_j ) ∈ [ [ italic_m ] ] × [ [ italic_n ] ]

  • η2𝜂2\eta\equiv 2italic_η ≡ 2,  ζ2𝜁2\zeta\equiv 2italic_ζ ≡ 2

  • λ𝖨,10.5subscript𝜆𝖨10.5\lambda_{\mathsf{I},1}\equiv 0.5italic_λ start_POSTSUBSCRIPT sansserif_I , 1 end_POSTSUBSCRIPT ≡ 0.5,  λ𝖨,20.5subscript𝜆𝖨20.5\lambda_{\mathsf{I},2}\equiv 0.5italic_λ start_POSTSUBSCRIPT sansserif_I , 2 end_POSTSUBSCRIPT ≡ 0.5 We simulate the nominal worker and job distributions as μ0𝓉𝒩(2,5)similar-tosubscript𝜇0𝓉𝒩25\mu_{0}\sim\mathcal{tN}(2,5)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_t caligraphic_N ( 2 , 5 ) and ν0𝓉𝒩(6,3)similar-tosubscript𝜈0𝓉𝒩63\nu_{0}\sim\mathcal{tN}(6,3)italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_t caligraphic_N ( 6 , 3 ), respectively, where 𝓉𝒩(a,b)𝓉𝒩𝑎𝑏\mathcal{tN}(a,b)caligraphic_t caligraphic_N ( italic_a , italic_b ) denotes the truncated Gaussian distribution with positive support, mean a𝑎aitalic_a and variance b𝑏bitalic_b.

  • To sample from the hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (30), we leverage the Hamiltonian Monte Carlo (HMC) module available in TensorFlow Probability (version 0.24.0)666https://www.tensorflow.org/probability, with the following configuration:

    • Number of burn-in steps: 8000800080008000

    • Number of adaptation steps: 0.8×number of burn-in steps0.8number of burn-in steps0.8\times\text{number of burn-in steps}0.8 × number of burn-in steps

    • Target acceptance probability (fixed): 0.60.60.60.6

    • The length traveled by the leapfrog integrator is adjusted using a No U-Turn Sampler (NUTS) [Hoffman and Gelman, 2011].

    • The step size is optimized using a dual averaging policy [Hoffman and Gelman, 2011].

    • The sampler is compiled using XLA (Accelerated Linear Algebra).

    • The optimal Kantorovitch potentials (37) are computed using Algorithm (1).

  • The base-level EOT model (3) is computed using the POT library [Flamary et al., 2021].

5.2 Quantifying diversity in HFPD-OT

Our definition of long-term fairness—to follow—relies on the notion of diversity, which we quantify using the following diversity index:

{definition}

[Diversity index] Let m×n𝑚𝑛m\times nitalic_m × italic_n be the dimension of the parametric random transport plan, π𝖲o(π|K)similar-to𝜋superscript𝖲𝑜conditional𝜋𝐾\pi\sim\mathsf{S}^{o}(\pi|K)italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) (30). The 1-diversity index (or perplexity score [Jelinek et al., 1977]) associated with 𝖲osuperscript𝖲𝑜\mathsf{S}^{o}sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT is:

𝖣(𝖲o(π|K))𝖤𝖲o[exp(𝖧(π))],𝖣superscript𝖲𝑜conditional𝜋𝐾subscript𝖤superscript𝖲𝑜delimited-[]𝖧𝜋\mathsf{D}(\mathsf{S}^{o}(\pi|K))\equiv\mathsf{E}_{\mathsf{S}^{o}}\left[\exp% \bigl{(}\mathsf{H}(\pi)\bigr{)}\right],sansserif_D ( sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ) ) ≡ sansserif_E start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_exp ( sansserif_H ( italic_π ) ) ] , (48)

where 𝖧(π)𝖧𝜋\mathsf{H}(\pi)sansserif_H ( italic_π ) denotes the entropy of π𝜋\piitalic_π:

𝖧(π)i=1mj=1nπi,jlog(πi,j)𝖧𝜋superscriptsubscript𝑖1𝑚superscriptsubscript𝑗1𝑛subscript𝜋𝑖𝑗subscript𝜋𝑖𝑗\mathsf{H}(\pi)\equiv-\sum_{i=1}^{m}\sum_{j=1}^{n}\pi_{i,j}\log(\pi_{i,j})sansserif_H ( italic_π ) ≡ - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_log ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) (49)
Refer to caption
(a) Diversity index, 𝖣()𝖣\mathsf{D}(\cdot)sansserif_D ( ⋅ ), computed for different values of the Kantorovitch potentials. The average diversity index attained by HFPD-OT (red dots) remains greater than that of the EOT policy (red line), even when the latter is computed using a relatively high smoothing parameter, ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1.
Refer to caption
(b) Fairness for agents illustrated by computing the mean diversity index of the conditional transport plan π(.|Y=y0)\pi(.|Y=y_{0})italic_π ( . | italic_Y = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for five different companies (C1,,C5)subscript𝐶1subscript𝐶5(C_{1},\dots,C_{5})( italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ). The Kantorovitch potentials and the smoothing parameter are respectively fixed to: (λ1o,λ2o)=(10,10)superscriptsubscript𝜆1𝑜superscriptsubscript𝜆2𝑜1010(\lambda_{1}^{o},\lambda_{2}^{o})=(10,10)( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) = ( 10 , 10 ), and ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1. Here again, a high diversity index means that each company is matched to a far more diverse set of skills and workers than it would be possible with highly-smoothed EOT policies.
Figure 9: Comparative study of the diversity, 𝖣()𝖣\mathsf{D}(\cdot)sansserif_D ( ⋅ ), induced by HFPD-OT random matching policies and fixed EOT matching policies. Error bars correspond to the 95%percent9595\%95 % confidence interval over 100 Monte Carlo experiments in HFPD-OT.

In Figure 9(a), we compute and graph 𝖣()𝖣\mathsf{D}(\cdot)sansserif_D ( ⋅ ) for different values of the Kantorovitch potentials (37), and we compare the diversity of random HFPD-OT matching polices to that of the base-level EOT policy (3). While increasing the Kantorovitch potentials decreases the diversity, it remains substantially higher than that of the EOT policy, even when the smoothness parameter is fixed at a relatively high value: ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1. In practical terms, a higher 𝖣()𝖣\mathsf{D}(\cdot)sansserif_D ( ⋅ ) ensures that a more diverse set of skills is allocated to each company, in expectation. Similarly, workers are expected to have access to a more diverse set of job opportunities. We use this insight in the sequel, to formalize the meanings of diversity and fairness both for agents (Definition 5.3) and contracts (Definition 5.4).

Remark 6.

One might argue that the smoothness parameter, ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, in base-level EOT (3) can be used to induce some level of diversity for fair OT (i.e. objective (ii) in Section 5). However, it does not address objective (i). Note that the randomness in HFPD-OT is informed, since it emerges from modelling the uncertainty in the marginals, whereas the smoothness in EOT is mainly a computational convenience that is not informed by a mathematical model of uncertainty.

5.3 Long-term fairness for agents through randomization

We first discuss the notion of fairness for agents (groups of workers and companies, in our application) enabled by a randomized transport strategy and propose the following definition.

{definition}

Fairness for agents via diversified transport plans

A transport policy fulfills fairness for agents if:

  1. 1.

    It acknowledges that marginals (the supply and demand) may not have been fairly measured and incorporates this knowledge in the design of the transport policy.

  2. 2.

    It allows asymptomatically for a diversified allocation of resources. This diversification should be proportional to the uncertainty in the marginals.

Underestimating the supply of a category of workers can produce a matching policy in which all workers in that category are unfairly assigned to closer companies (in the sense of the cost 𝖢𝖢\mathsf{C}sansserif_C). Accounting for uncertainty in the supply, however, would allow, in expectation, for a more diverse mix of skills to be transferred to companies. To illustrate this point further, we analyze the diversity of workers matched to companies and compute the mean diversity index of the random conditional transport policy π(.|Y=y0)\pi(.|Y=y_{0})italic_π ( . | italic_Y = italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) associated with each company, y0ΩYsubscript𝑦0subscriptdouble-struck-Ω𝑌y_{0}\in\mathbb{\Omega}_{Y}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. Figure 9(b) shows that the diversity of skills allowed by HFPD-OT remains consistently higher than that of the base-level EOT, thus allowing each company y0subscript𝑦0y_{0}italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to benefit from a more diverse set of skills.

5.4 Long-term fairness for contracts through randomization

In our worker-company matching problem, as in many other transport problems, contracts correspond to a physical infrastructure, deployed to match resources to demand (agencies, recruitment processes, crowd-sourcing labour market platforms, etc.). By design, the OT model yields a sparse transport strategy where the transport burden is supported by a small number of contracts, and though the base-level EOT may allow for smoother, i.e. more diverse transport strategies, this diversity does not emerge from a proper mathematical modelling of uncertainty (Remark 7). In contrast, randomized HFPD-OT strategies allow the activation of a more diverse set of contracts, yielding a fairer utilization of the transport infrastructure. In this regard, HFPD-OT is closely related to maximum diversity problems [Martí et al., 2022].

To formalize the previous point, we start by introducing the notion of eligible contracts:

Π𝖤(η,ζ,υ){πi,j,(i,j)[[m]]×[[n]]|π𝗌𝗎𝗉𝗉(𝖲o)and𝖤𝖲o[𝟙(πi,jυ)]>0}.subscriptdouble-struck-Π𝖤𝜂𝜁𝜐conditional-setsubscript𝜋𝑖𝑗𝑖𝑗delimited-[]delimited-[]𝑚delimited-[]delimited-[]𝑛𝜋𝗌𝗎𝗉𝗉superscript𝖲𝑜andsubscript𝖤superscript𝖲𝑜delimited-[]1subscript𝜋𝑖𝑗𝜐0\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\equiv\bigl{\{}\pi_{i,j},\;(i,j)% \in[\![m]\!]\times[\![n]\!]\;|\;\pi\in\mathsf{supp}(\mathsf{S}^{o})\;\text{and% }\;\mathsf{E}_{\mathsf{S}^{o}}\left[\mathbb{1}(\pi_{i,j}\geq\upsilon)\right]>0% \bigr{\}}.blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT ( italic_η , italic_ζ , italic_υ ) ≡ { italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , ( italic_i , italic_j ) ∈ [ [ italic_m ] ] × [ [ italic_n ] ] | italic_π ∈ sansserif_supp ( sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) and sansserif_E start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_1 ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≥ italic_υ ) ] > 0 } . (50)

Here, υ>0𝜐0\upsilon>0italic_υ > 0 is an activation threshold, imposed by design constraints (technical specifications, design requirements, etc.). Eligible contracts are those with a positive probability of being active under the hyperprior, 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ). The set Π𝖤subscriptdouble-struck-Π𝖤\mathbb{\Pi}_{\mathsf{E}}blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT is better understood through its asymptotic behaviour:

  • In the absence of any constraint on the marginals, Π𝖤subscriptdouble-struck-Π𝖤\mathbb{\Pi}_{\mathsf{E}}blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT is fully determined by the base-level and hierarchical ideal designs (23), and:

    Π𝖤(η,ζ,υ)η,ζ{πi,j,(i,j)[[m]]×[[n]]|π𝗌𝗎𝗉𝗉(𝖲~)and𝖤𝖲~[𝟙(πi,jυ)]>0}.formulae-sequence𝜂𝜁subscriptdouble-struck-Π𝖤𝜂𝜁𝜐conditional-setsubscript𝜋𝑖𝑗𝑖𝑗delimited-[]delimited-[]𝑚delimited-[]delimited-[]𝑛𝜋𝗌𝗎𝗉𝗉~𝖲andsubscript𝖤~𝖲delimited-[]1subscript𝜋𝑖𝑗𝜐0\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\xrightarrow{\eta\rightarrow% \infty,\zeta\rightarrow\infty}\bigl{\{}\pi_{i,j},\;(i,j)\in[\![m]\!]\times[\![% n]\!]\;|\;\pi\in\mathsf{supp}(\tilde{\mathsf{S}})\;\text{and}\;\mathsf{E}_{% \tilde{\mathsf{S}}}\left[\mathbb{1}(\pi_{i,j}\geq\upsilon)\right]>0\bigr{\}}.blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT ( italic_η , italic_ζ , italic_υ ) start_ARROW start_OVERACCENT italic_η → ∞ , italic_ζ → ∞ end_OVERACCENT → end_ARROW { italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , ( italic_i , italic_j ) ∈ [ [ italic_m ] ] × [ [ italic_n ] ] | italic_π ∈ sansserif_supp ( over~ start_ARG sansserif_S end_ARG ) and sansserif_E start_POSTSUBSCRIPT over~ start_ARG sansserif_S end_ARG end_POSTSUBSCRIPT [ blackboard_1 ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≥ italic_υ ) ] > 0 } .

    In particular, if the base-level and hierarchical ideal designs are chosen to be uninformative, it follows that

    Π𝖤(η,ζ,υ)η,ζ{πi,j,(i,j)[[m]]×[[n]]|𝖤𝒰[𝟙(πi,jυ)]>0}.formulae-sequence𝜂𝜁subscriptdouble-struck-Π𝖤𝜂𝜁𝜐conditional-setsubscript𝜋𝑖𝑗𝑖𝑗delimited-[]delimited-[]𝑚delimited-[]delimited-[]𝑛subscript𝖤𝒰delimited-[]1subscript𝜋𝑖𝑗𝜐0\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\xrightarrow{\eta\rightarrow% \infty,\zeta\rightarrow\infty}\bigl{\{}\pi_{i,j},\;(i,j)\in[\![m]\!]\times[\![% n]\!]\;|\;\mathsf{E}_{\mathcal{U}}\left[\mathbb{1}(\pi_{i,j}\geq\upsilon)% \right]>0\bigr{\}}.blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT ( italic_η , italic_ζ , italic_υ ) start_ARROW start_OVERACCENT italic_η → ∞ , italic_ζ → ∞ end_OVERACCENT → end_ARROW { italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , ( italic_i , italic_j ) ∈ [ [ italic_m ] ] × [ [ italic_n ] ] | sansserif_E start_POSTSUBSCRIPT caligraphic_U end_POSTSUBSCRIPT [ blackboard_1 ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≥ italic_υ ) ] > 0 } .
  • In the case of crisp marginals (i.e. no marginal uncertainties), Π𝖤subscriptdouble-struck-Π𝖤\mathbb{\Pi}_{\mathsf{E}}blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT contracts to a subset of Π(μ0,ν0)double-struck-Πsubscript𝜇0subscript𝜈0\mathbb{\Pi}(\mu_{0},\nu_{0})blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (2.2):

    Π𝖤(η,ζ,υ)η0,ζ0formulae-sequence𝜂0𝜁0subscriptdouble-struck-Π𝖤𝜂𝜁𝜐absent\displaystyle\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)\xrightarrow{\eta% \rightarrow 0,\zeta\rightarrow 0}blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT ( italic_η , italic_ζ , italic_υ ) start_ARROW start_OVERACCENT italic_η → 0 , italic_ζ → 0 end_OVERACCENT → end_ARROW {πi,j,(i,j)[[m]]×[[n]]|πΠ(μ0,ν0)and𝖤𝖲o[𝟙(πi,jυ)]>0}conditional-setsubscript𝜋𝑖𝑗𝑖𝑗delimited-[]delimited-[]𝑚delimited-[]delimited-[]𝑛𝜋double-struck-Πsubscript𝜇0subscript𝜈0andsubscript𝖤superscript𝖲𝑜delimited-[]1subscript𝜋𝑖𝑗𝜐0\displaystyle\bigl{\{}\pi_{i,j},\;(i,j)\in[\![m]\!]\times[\![n]\!]\;|\;\pi\in% \mathbb{\Pi}(\mu_{0},\nu_{0})\;\text{and}\;\mathsf{E}_{\mathsf{S}^{o}}\left[% \mathbb{1}(\pi_{i,j}\geq\upsilon)\right]>0\bigr{\}}{ italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , ( italic_i , italic_j ) ∈ [ [ italic_m ] ] × [ [ italic_n ] ] | italic_π ∈ blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and sansserif_E start_POSTSUBSCRIPT sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_1 ( italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≥ italic_υ ) ] > 0 }
    Π(μ0,ν0).absentdouble-struck-Πsubscript𝜇0subscript𝜈0\displaystyle\subset\mathbb{\Pi}(\mu_{0},\nu_{0}).⊂ blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) .

We use Π𝖤(η,ζ,υ)subscriptdouble-struck-Π𝖤𝜂𝜁𝜐\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT ( italic_η , italic_ζ , italic_υ ) to introduce our definition of fairness for contracts.

{definition}

Fairness for contracts via diversified transport plans
A random transport plan, π𝖲o(π|K)similar-to𝜋superscript𝖲𝑜conditional𝜋𝐾\pi\sim\mathsf{S}^{o}(\pi|K)italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), achieves fairness for contracts if it distributes the transport burden over all eligible contracts in Π𝖤(η,ζ,υ)subscriptdouble-struck-Π𝖤𝜂𝜁𝜐\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT ( italic_η , italic_ζ , italic_υ ). For the purpose of illustration, we fix the optimal Kantorovitch potentials (20) to arbitrarily small values: λ1o=λ2o=0.05subscriptsuperscript𝜆𝑜1subscriptsuperscript𝜆𝑜20.05\lambda^{o}_{1}=\lambda^{o}_{2}=0.05italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.05 (or, equivalently, large uncertainty radii (η,ζ)𝜂𝜁(\eta,\zeta)( italic_η , italic_ζ )), and the activation threshold to υ=2×102𝜐2superscript102\upsilon=2\times 10^{-2}italic_υ = 2 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. Both the base-level and hierarchical ideal designs (9) are chosen to be uniform. We generate a sequence of relative frequency maps, each providing estimates of the probabilities that the respective contracts, πi,jΠ𝖤(η,ζ,υ)subscript𝜋𝑖𝑗subscriptdouble-struck-Π𝖤𝜂𝜁𝜐\pi_{i,j}\in\mathbb{\Pi}_{\mathsf{E}}(\eta,\zeta,\upsilon)italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_Π start_POSTSUBSCRIPT sansserif_E end_POSTSUBSCRIPT ( italic_η , italic_ζ , italic_υ ), are active. We compare these to the base-level EOT matching policy (Figure 10(a)), which – being oblivious to the uncertainty in the marginals – yields a sparse transport policy and thus fails to achieve fairness for contracts (Definition 5.4). In contrast, the random HFPD-OT matching policies enable a greater diversity by ensuring that more of the contracts are active, as shown in Figure 10(b), 10(c) and 10(d). These are the estimated activation probability maps, averaged over N{10,50,100}𝑁1050100N\in\{10,50,100\}italic_N ∈ { 10 , 50 , 100 } randomly sampled transport plans, π(i)𝖲o(π|K)similar-tosuperscript𝜋𝑖superscript𝖲𝑜conditional𝜋𝐾\pi^{(i)}\sim\mathsf{S}^{o}(\pi|K)italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), for i[[N]]𝑖delimited-[]delimited-[]𝑁i\in[\![{N}]\!]italic_i ∈ [ [ italic_N ] ]. As N𝑁N\rightarrow\inftyitalic_N → ∞, these activation estimates converge to the ergodic limit, in which all eligible contracts have the same probability of being active.

Refer to caption
(a) EOT plan computed between nominal marginals μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.
Refer to caption
(b) Estimated probabilities of active contracts, N=10𝑁10N=10italic_N = 10 realized plans.
Refer to caption
(c) Estimated probabilities of active contracts, N=50𝑁50N=50italic_N = 50 realized plans.
Refer to caption
(d) Estimated probabilities of active contracts, N=100𝑁100N=100italic_N = 100 realized plans.
Figure 10: Comparison of the diversity of contracts induced by the conventional base-level EOT solution vs HFPD-OT. Figure (a): The base-level EOT plan, with the smoothness parameter fixed to ϵ=103italic-ϵsuperscript103\epsilon=10^{-3}italic_ϵ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, induces a sparse policy, and therefore does not fairly distribute the burden of transport across all eligible contracts, πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. Figures (b), (c), (d): random HFPD-OT policies induce a long-term (i.e. ergodically) fair regime, where the burden of transport is distributed across a larger set of contracts. Each entry in the relative frequency maps shows the estimated probability of activation of the corresponding contract, πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. In the limit of N𝑁N\rightarrow\inftyitalic_N → ∞ randomly realized matching policies, π(i)𝖲o(π|K)similar-tosuperscript𝜋𝑖superscript𝖲𝑜conditional𝜋𝐾\pi^{(i)}\sim\mathsf{S}^{o}(\pi|K)italic_π start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ), the map of estimated probabilities of active policies converges to a fair regime, where all eligible contracts equally support the transport burden.
Remark 7.

Another way to appreciate fairness for contracts induced by randomized HFPD-OT plans is to study the random marginal cost, ci,jsubscript𝑐𝑖𝑗c_{i,j}italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, associated with the contract πi,jsubscript𝜋𝑖𝑗\pi_{i,j}italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT (Figure 1(b)):

ci,jπi,j𝖢i,j,π𝖲oformulae-sequencesubscript𝑐𝑖𝑗subscript𝜋𝑖𝑗subscript𝖢𝑖𝑗similar-to𝜋superscript𝖲𝑜c_{i,j}\equiv\pi_{i,j}\mathsf{C}_{i,j}\;\;,\;\;\pi\sim\mathsf{S}^{o}italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ≡ italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT sansserif_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT (51)

Recall that the squared 2-Kantorovitch distance,

𝖪𝖣22(μ0,ν0)minπΠ(μ0,ν0)i,j𝖢i,jπi,j,superscriptsubscript𝖪𝖣22subscript𝜇0subscript𝜈0subscript𝜋double-struck-Πsubscript𝜇0subscript𝜈0subscript𝑖𝑗subscript𝖢𝑖𝑗subscript𝜋𝑖𝑗\mathsf{KD}_{2}^{2}(\mu_{0},\nu_{0})\equiv\min_{\pi\in\mathbb{\Pi}(\mu_{0},\nu% _{0})}\sum_{i,j}\mathsf{C}_{i,j}\pi_{i,j},sansserif_KD start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≡ roman_min start_POSTSUBSCRIPT italic_π ∈ blackboard_Π ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT sansserif_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (52)

is the minimum expected transport cost between μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ν0subscript𝜈0\nu_{0}italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, for the Euclidean cost function, 𝖢𝖢\mathsf{C}sansserif_C (Section 5.1) [Villani, 2008]. The base-level OT objective in (52) yields a fixed optimal solution, where the cost ci,jsubscript𝑐𝑖𝑗c_{i,j}italic_c start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is immutable. Consequently, the transport burden is supported by the same set of contracts. Let πi0,j0subscript𝜋subscript𝑖0subscript𝑗0\pi_{i_{0},j_{0}}italic_π start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT be one such contract where:

ci0,j0>𝖪𝖣22(μ0,ν0),(i0,j0)[[m]]×[[n]]formulae-sequencesubscript𝑐subscript𝑖0subscript𝑗0superscriptsubscript𝖪𝖣22subscript𝜇0subscript𝜈0subscript𝑖0subscript𝑗0delimited-[]delimited-[]𝑚delimited-[]delimited-[]𝑛c_{i_{0},j_{0}}>\mathsf{KD}_{2}^{2}(\mu_{0},\nu_{0}),\;\;(i_{0},j_{0})\in[\![m% ]\!]\times[\![n]\!]italic_c start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT > sansserif_KD start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , ( italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ [ [ italic_m ] ] × [ [ italic_n ] ] (53)

On the other hand, in HFPD-OT, and by virtue of the random nature of ci0,j0subscript𝑐subscript𝑖0subscript𝑗0c_{i_{0},j_{0}}italic_c start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we can write the following Markov inequality:

Pr[ci0,j0𝖪𝖣22(μ0,ν0)]𝖤π𝖲o[ci0,j0]𝖪𝖣22(μ0,ν0)Prsubscript𝑐subscript𝑖0subscript𝑗0superscriptsubscript𝖪𝖣22subscript𝜇0subscript𝜈0subscript𝖤similar-to𝜋superscript𝖲𝑜delimited-[]subscript𝑐subscript𝑖0subscript𝑗0superscriptsubscript𝖪𝖣22subscript𝜇0subscript𝜈0\Pr\left[c_{i_{0},j_{0}}\geq\mathsf{KD}_{2}^{2}(\mu_{0},\nu_{0})\right]\leq% \frac{\mathsf{E}_{\pi\sim\mathsf{S}^{o}}\left[c_{i_{0},j_{0}}\right]}{\mathsf{% KD}_{2}^{2}(\mu_{0},\nu_{0})}roman_Pr [ italic_c start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≥ sansserif_KD start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ≤ divide start_ARG sansserif_E start_POSTSUBSCRIPT italic_π ∼ sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] end_ARG start_ARG sansserif_KD start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG (54)

Hence, this probability upper bound depends on the ratio of the expected marginal transport cost associated with the contract, πi0,j0subscript𝜋subscript𝑖0subscript𝑗0\pi_{i_{0},j_{0}}italic_π start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (51), to the squared 2-Kantorovitch distance between the nominal marginals (52). Essentially, it provides an upper bound on the probability of a fairness-related proposition (Definition 5.4). Insights such as these may be used to establish operating conditions that are conducive to fairness. Such statistical handles on transport fairness are, of course, unavailable in conventional base-level OT.

6 Conclusions and next steps

This paper recasts the optimal transport problem into a broader class of fully probabilistic design and generalized Bayesian inference techniques. In this new formalism, the transport plan is no longer regarded as a crisp, deterministic object, but is modeled as a random (i.e. uncertain) distribution in a hierarchical Bayesian setting. This is in clear contrast with the existing, certainty-equivalence-based OT paradigm. In this way, we augment the conventional base-level (i.e. deterministic) OT framework with the necessary tools to reason about uncertainty and design robust transport algorithms. In this new hierarchical setting, the object of interest is no longer the optimal transport plan, which may not even exist—since the marginals are themselves noisy, uncertain realizations of some underlying stochastic process—but is rather the optimal hyperprior, which is effectively a generative model over the set of transport plans.

We now recall some key results on HFPD-OT, obtained in this paper:

  • The functional form of the optimal hyperprior has been characterized in both the non-parametric and parametric settings. Importantly, we proved that the HFPD-OT setting is a generalization of the classical EOT in that the optimal transport plan can be recovered asymptotically when uncertainty in the marginals decreases.

  • Considering the parametric setting, we proposed an algorithm to approximate the Kantorovitch potentials and described some of the inferential properties of the hyperprior, highlighting its shape and location parameters.

  • To illustrate the importance of HFPD-OT, we studied the problem of algorithmic fairness as it arises in fair market matching problems. First, we explored the role of randomization and diversification in eliciting fairer transport policies for agents, that is, for specific categories of workers and the companies which need their skills. Second, we investigated the role of randomization in eliciting fair matching policies for individual contracts between agents, by allowing the distribution of the transport burden across a larger set of contracts.

There remain important open questions to be studied and improvements to be implemented in subsequent work. The stochastic algorithms leveraged here enable a first approximation of the optimal hyperprior, but better samplers can be derived. Interestingly, sampling from the hyperprior may require new MCMC techniques that leverage the geometry of the support of 𝖲o(π|K)superscript𝖲𝑜conditional𝜋𝐾\mathsf{S}^{o}(\pi|K)sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_π | italic_K ). Moreover, the HFPD-OT application covered in this paper is on algorithmic fairness, however, we contend that the set of possible applications is broader: randomized policies play indeed an important role in a diversity of problems related to generalizability and robustness in machine learning. Finally, a notable contribution of this paper has been to expand duality results from the classical setting in OT to the hierarchical framework of HFPD-OT. However, key theoretical results in base-level deterministic OT—mainly those related to its geometry ([Gangbo and McCann, 1996], [Villani, 2008], etc.)—need careful consideration within the extended framework of HFPD-OT.

Acknowledgement

This work has been supported by the European Union’s Horizon Europe research and innovation programme, under grant agreement no. 101070568. It has also been supported by the European Union’s competitive HORIZON-MSCA-2021-DN-01 (Marie Sklodowska-Curie Doctoral Networks) programme, under grant agreement no. 101073508. The authors also acknowledge the support of Innovate UK in underwriting both of the above grants, under the Horizon Europe Guarantee.

7 Appendix: proof of strong duality in Theorem 3.1 (step 3)

The following additional mathematical definitions are required, supplementing the preliminaries in Section 2.2.

  • Besides being compact, we assume henceforth that ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\Omega}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT are Hausdorff sets. This separability property guarantees uniqueness of limits and sequences.

  • From compactness of ΩXsubscriptdouble-struck-Ω𝑋\mathbb{\Omega}_{X}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and ΩYsubscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT, it follows, by the Riesz-Markov-Kakutani Theorem [Folland, 1999], that the topological dual of (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{C}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_C ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT )—the set of continuous functions on ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT—is the set of Radon measures with support in ΩX×ΩYsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT. This also implies that (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{C}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_C ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) is a Banach space. Thus, by the Banach-Alaoglu Theorem, (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) is compact in the weak-* topology [Billingsley, 1999].

  • The previous compactness result allows us to again invoke the Riesz-Markov-Kakutani representation Theorem, which states that the topological dual of ((ΩX×ΩY))subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) is the hierarchical space of Radon measures with support in (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ). We denote this dual space by 𝕊𝕊\mathbb{S}blackboard_S. The canonical duality pairing reads as follows [Folland, 1999]:

    <f,𝖲>(ΩX×ΩY)fd𝖲<f,\mathsf{S}>\equiv\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}% _{Y})}fd\mathsf{S}< italic_f , sansserif_S > ≡ ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_f italic_d sansserif_S (55)

    with f((ΩX×ΩY))𝑓subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌f\in\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))italic_f ∈ blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) and 𝖲𝕊𝖲𝕊\mathsf{S}\in\mathbb{S}sansserif_S ∈ blackboard_S. Later in the proof, we will constrain 𝕊𝕊\mathbb{S}blackboard_S to the set of hierarchical (probability) distributions.

  • If 𝖮:𝕊p:𝖮𝕊superscript𝑝\mathsf{O}:\mathbb{S}\rightarrow\mathbbm{R}^{p}sansserif_O : blackboard_S → blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is a linear map, its adjoint is defined as: 𝖮:p((ΩX×ΩY)):superscript𝖮superscript𝑝subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathsf{O}^{*}\colon\mathbbm{R}^{p}\rightarrow\mathbb{C}(\mathbb{P}(\mathbb{% \Omega}_{X}\times\mathbb{\Omega}_{Y}))sansserif_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT → blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) such that:

    <𝖮(𝖲),z>=<𝖲,𝖮(z)><\mathsf{O}(\mathsf{S}),z>=<\mathsf{S},\mathsf{O}^{*}(z)>< sansserif_O ( sansserif_S ) , italic_z > = < sansserif_S , sansserif_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z ) > (56)

    for 𝖲𝕊𝖲𝕊\mathsf{S}\in\mathbb{S}sansserif_S ∈ blackboard_S and zp𝑧superscript𝑝z\in\mathbb{R}^{p}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT.

  • fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the Legendre-Fenchel transform of f𝑓fitalic_f defined in ((ΩX×ΩY))subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ). It is given by:

    f(u)supv((ΩX×ΩY))(<u,v>f(v))f^{*}(u)\equiv\sup_{v\in\mathbb{C}(\mathbbm{P}(\mathbb{\Omega}_{X}\times% \mathbb{\Omega}_{Y}))}(<u,v>-f(v))italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_u ) ≡ roman_sup start_POSTSUBSCRIPT italic_v ∈ blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT ( < italic_u , italic_v > - italic_f ( italic_v ) ) (57)
  • 𝖽𝗈𝗆(h)𝖽𝗈𝗆\mathsf{dom}(h)sansserif_dom ( italic_h ) denotes the effective domain of the function h((ΩX×ΩY))subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌h\in\mathbb{C}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))italic_h ∈ blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ), defined as: 𝖽𝗈𝗆(h){π(ΩX×ΩY)|h(π)<}𝖽𝗈𝗆conditional-set𝜋subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝜋\mathsf{dom}(h)\equiv\{\pi\in\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})\;|\;h(\pi)<\infty\}sansserif_dom ( italic_h ) ≡ { italic_π ∈ blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) | italic_h ( italic_π ) < ∞ }.

  • Our proof relies on the notion of decomposable spaces, as originally defined in Theorem 1 of [Rockafellar, 1971]. A space is decomposable if it is stable under bounded alterations over sets of finite measure.

  • Let 𝕃((ΩX×ΩY))𝕃subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))blackboard_L ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) denote the set of integrable functions, defined in (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ). 𝕃((ΩX×ΩY))𝕃subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))blackboard_L ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) is decomposable, since it satisfies the following conditions [Rockafellar, 1971]:

    • 𝕃((ΩX×ΩY))𝕃subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))blackboard_L ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) contains every bounded and measurable functions defined on (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ).

    • If h𝕃((ΩX×ΩY))𝕃subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌h\in\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))italic_h ∈ blackboard_L ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) and 𝕝Ω𝖧𝕝subscriptsubscriptdouble-struck-Ω𝖧\mathbb{l}\in\mathcal{F}_{\mathbb{\Omega}_{\mathsf{H}}}blackboard_l ∈ caligraphic_F start_POSTSUBSCRIPT blackboard_Ω start_POSTSUBSCRIPT sansserif_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT is an arbitrary set of finite measure in (ΩX×ΩY)subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) (3), then 𝕃((ΩX×ΩY))𝕃subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{L}(\mathbb{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))blackboard_L ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) contains χ𝕝hbold-⋅subscript𝜒𝕝\chi_{\mathbb{l}}\boldsymbol{\cdot}hitalic_χ start_POSTSUBSCRIPT blackboard_l end_POSTSUBSCRIPT bold_⋅ italic_h, where bold-⋅\boldsymbol{\cdot}bold_⋅ denotes the dot product between the indicator function χ𝕝subscript𝜒𝕝\chi_{\mathbb{l}}italic_χ start_POSTSUBSCRIPT blackboard_l end_POSTSUBSCRIPT of 𝕝𝕝\mathbb{l}blackboard_l and the function hhitalic_h.

  • The characteristic function of a (convex) set 𝔸𝔸\mathbb{A}blackboard_A is the convex function:

    𝟙𝔸(x){0ifx𝔸,+otherwise.\mathbb{1}_{\mathbb{A}}(x)\equiv\left\{\begin{aligned} 0&\;\;\text{if}\;\;x\in% \mathbb{A},\\ +\infty&\;\;\text{otherwise.}\\ \end{aligned}\right.blackboard_1 start_POSTSUBSCRIPT blackboard_A end_POSTSUBSCRIPT ( italic_x ) ≡ { start_ROW start_CELL 0 end_CELL start_CELL if italic_x ∈ blackboard_A , end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL start_CELL otherwise. end_CELL end_ROW

For the sake of completeness, we recall the main duality Theorem [Rockafellar, 1974] in the general setting, before specializing it to our problem later in the proof: {theorem}[Fenchel-Rockafellar] Let (E,E)𝐸superscript𝐸(E,E^{*})( italic_E , italic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and (F,F)𝐹superscript𝐹(F,F^{*})( italic_F , italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) be two topologically paired spaces. Let 𝖮:EF:𝖮𝐸𝐹\mathsf{O}\colon E\rightarrow Fsansserif_O : italic_E → italic_F be a continuous linear operator and 𝖮:FE:superscript𝖮superscript𝐹superscript𝐸\mathsf{O}^{*}\colon F^{*}\rightarrow E^{*}sansserif_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT → italic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT its adjoint. Let f and g be two lower semi-continuous and proper convex functions defined on E and F, respectively. If the following qualification condition is satisfied: ydom(g)superscript𝑦𝑑𝑜𝑚superscript𝑔\exists\;y^{*}\in dom(g^{*})∃ italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_d italic_o italic_m ( italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )   s.t.   fsuperscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is continuous at 𝖮(y)superscript𝖮superscript𝑦\mathsf{O}^{*}(y^{*})sansserif_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), then:

maxxEf(x)g(𝖮(x))=minyFf(𝖮(y))+g(y)subscript𝑥𝐸𝑓𝑥𝑔𝖮𝑥subscriptsuperscript𝑦superscript𝐹superscript𝑓superscript𝖮superscript𝑦superscript𝑔superscript𝑦\max_{x\in E}-f(-x)-g(\mathsf{O}(x))=\min_{y^{*}\in F^{*}}f^{*}(\mathsf{O}^{*}% (y^{*}))+g^{*}(y^{*})roman_max start_POSTSUBSCRIPT italic_x ∈ italic_E end_POSTSUBSCRIPT - italic_f ( - italic_x ) - italic_g ( sansserif_O ( italic_x ) ) = roman_min start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( sansserif_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (58)
Proof.

Let’s consider the Primal problem in (11). Using Fubini’s Theorem and the Bayesian hierarchical modelling consistency condition stated in (7), it is easy to show that this original problem can be formulated equivalently, over the set of hyperpriors 𝕊𝕊\mathbb{S}blackboard_S, as follows:

(P):𝖲oargmin𝖲𝕊{𝖣𝖪𝖫(𝖲(π|K)||𝖲~(π|K))}(P):\;\;\;\;\;\mathsf{S}^{o}\in\operatorname*{argmin}_{\mathsf{S}\in\mathbb{S}% }\left\{\mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{S}(\pi|K)||\tilde{\mathsf{S}}(% \pi|K)\bigr{)}\right\}( italic_P ) : sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT sansserif_S ∈ blackboard_S end_POSTSUBSCRIPT { sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( sansserif_S ( italic_π | italic_K ) | | over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) ) }

subject to:

{𝖤𝖲(𝖣𝖪𝖫(μ||μ0))η𝖤𝖲(𝖣𝖪𝖫(ν||ν0))ζ\left\{\begin{aligned} \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0}))\leq\eta\\ \mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))\leq\zeta\end{% aligned}\right.{ start_ROW start_CELL sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≤ italic_η end_CELL end_ROW start_ROW start_CELL sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≤ italic_ζ end_CELL end_ROW

where 𝖲~~𝖲\tilde{\mathsf{{S}}}over~ start_ARG sansserif_S end_ARG is defined in (16). The constraints involve the following linear map:

𝖨(𝖲)(ΩX×ΩY)𝖲(π|K)𝑑(π)𝖨𝖲subscriptsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝖲conditional𝜋𝐾differential-d𝜋\mathsf{I}(\mathsf{S})\equiv\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb% {\Omega}_{Y})}\mathsf{S}(\pi|K)d\mathcal{L}(\pi)sansserif_I ( sansserif_S ) ≡ ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT sansserif_S ( italic_π | italic_K ) italic_d caligraphic_L ( italic_π )

besides our usual moment constraints:

𝖮1(𝖲)𝖤𝖲(𝖣𝖪𝖫(μ||μ0)),𝖮2(𝖲)𝖤𝖲(𝖣𝖪𝖫(ν||ν0))\mathsf{O}_{1}(\mathsf{S})\equiv\mathsf{E}_{\mathsf{S}}(\mathsf{D}_{\mathsf{KL% }}(\mu||\mu_{0}))\;\;\;,\;\;\mathsf{O}_{2}(\mathsf{S})\equiv\mathsf{E}_{% \mathsf{S}}(\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}))sansserif_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( sansserif_S ) ≡ sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) , sansserif_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( sansserif_S ) ≡ sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )

For convenience, we denote by 𝖮¯¯𝖮\bar{\mathsf{O}}over¯ start_ARG sansserif_O end_ARG the linear map given by:

𝖮¯(𝖲)(𝖨(𝖲),𝖮1(𝖲),𝖮2(𝖲))3¯𝖮𝖲𝖨𝖲subscript𝖮1𝖲subscript𝖮2𝖲superscript3\bar{\mathsf{O}}(\mathsf{S})\equiv(\mathsf{I}(\mathsf{S}),\mathsf{O}_{1}(% \mathsf{S}),\mathsf{O}_{2}(\mathsf{S}))\in\mathbbm{R}^{3}over¯ start_ARG sansserif_O end_ARG ( sansserif_S ) ≡ ( sansserif_I ( sansserif_S ) , sansserif_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( sansserif_S ) , sansserif_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( sansserif_S ) ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (59)

As usual, we can use the characteristic function to encode the constraints directly in the objective (P)𝑃(P)( italic_P ) , yielding the following equivalent unconstrained problem:

(P):𝖲oargmin𝖲{𝖣𝖪𝖫(𝖲(π|K)||𝖲~(π|K))+g0(𝖮¯(𝖲))}(P^{\prime}):\;\;\;\mathsf{S}^{o}\in\operatorname*{argmin}_{\mathsf{S}}\left\{% \mathsf{D}_{\mathsf{KL}}\bigl{(}\mathsf{S}(\pi|K)||\tilde{\mathsf{S}}(\pi|K)% \bigr{)}+g_{0}(\bar{\mathsf{O}}(\mathsf{S}))\right\}( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT { sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( sansserif_S ( italic_π | italic_K ) | | over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) ) + italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over¯ start_ARG sansserif_O end_ARG ( sansserif_S ) ) } (60)

where we define g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as follows:

g0(z0,z1,z2)𝟙[0,η](z0)+𝟙[0,ζ](z1)+𝟙{1}(z2),(z0,z1,z2)3formulae-sequencesubscript𝑔0subscript𝑧0subscript𝑧1subscript𝑧2subscript10𝜂subscript𝑧0subscript10𝜁subscript𝑧1subscript11subscript𝑧2subscript𝑧0subscript𝑧1subscript𝑧2superscript3g_{0}(z_{0},z_{1},z_{2})\equiv\mathbb{1}_{[0,\eta]}(z_{0})+\mathbb{1}_{[0,% \zeta]}(z_{1})+\mathbb{1}_{\{1\}}(z_{2})\;\;,\;\;(z_{0},z_{1},z_{2})\in% \mathbbm{R}^{3}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≡ blackboard_1 start_POSTSUBSCRIPT [ 0 , italic_η ] end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_1 start_POSTSUBSCRIPT [ 0 , italic_ζ ] end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + blackboard_1 start_POSTSUBSCRIPT { 1 } end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , ( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT

We begin by deriving the Legendre-Fenchel dual of 𝖮¯()¯𝖮\bar{\mathsf{O}}(\cdot)over¯ start_ARG sansserif_O end_ARG ( ⋅ ), g0()subscript𝑔0g_{0}(\cdot)italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) and 𝖣𝖪𝖫(||)\mathsf{D}_{\mathsf{KL}}(\cdot||\cdot)sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( ⋅ | | ⋅ ), respectively. By the definition of the adjoint in (59), it is straightforward to show that 𝖮¯superscript¯𝖮\mathsf{\bar{O}}^{*}over¯ start_ARG sansserif_O end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is given by:

𝖮¯(λ1,λ2,λ3)=λ1𝖣𝖪𝖫(μ||μ0)+λ2𝖣𝖪𝖫(ν||ν0)+λ3,(λ1,λ2,λ3)3\mathsf{\bar{O}}^{*}(\lambda_{1},\lambda_{2},\lambda_{3})=\lambda_{1}\mathsf{D% }_{\mathsf{KL}}(\mu||\mu_{0})+\lambda_{2}\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0}% )+\lambda_{3}\;\;,\;\;(\lambda_{1},\lambda_{2},\lambda_{3})\in\mathbbm{R}^{3}over¯ start_ARG sansserif_O end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT

Moreover, applying the definition of Legendre-Fenchel transform (57) yields the following conjugate of g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT:

g0(λ1,λ2,λ3)=λ1η+λ2ζ+λ3superscriptsubscript𝑔0subscript𝜆1subscript𝜆2subscript𝜆3subscript𝜆1𝜂subscript𝜆2𝜁subscript𝜆3\begin{split}g_{0}^{*}(\lambda_{1},\lambda_{2},\lambda_{3})=\lambda_{1}\eta+% \lambda_{2}\zeta+\lambda_{3}\end{split}start_ROW start_CELL italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_η + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ζ + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW

We now turn our attention to the conjugate of 𝖣𝖪𝖫(||)\mathsf{D}_{\mathsf{KL}}(\cdot||\cdot)sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( ⋅ | | ⋅ ). To this aim, we first consider the following integral functional [Rockafellar, 1971]:

𝖿(u):((ΩX×ΩY)):subscript𝖿𝑢subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\displaystyle\mathcal{I}_{\mathsf{f}}(u)\colon\mathbb{C}(\mathbbm{P}(\mathbb{% \Omega}_{X}\times\mathbb{\Omega}_{Y}))caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT ( italic_u ) : blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) absent\displaystyle\longrightarrow\mathbbm{R}⟶ blackboard_R (61)
u𝑢\displaystyle uitalic_u (ΩX×ΩY)𝖿(π,u(π))𝑑(π)absentsubscriptsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝖿𝜋𝑢𝜋differential-d𝜋\displaystyle\longmapsto\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})}\mathsf{f}(\pi,u(\pi))d\mathcal{L}(\pi)⟼ ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT sansserif_f ( italic_π , italic_u ( italic_π ) ) italic_d caligraphic_L ( italic_π ) (62)

where:

𝖿(π,u(π))𝖲~(π|K)exp(u(π)1)𝖿𝜋𝑢𝜋~𝖲conditional𝜋𝐾𝑢𝜋1\mathsf{f}(\pi,u(\pi))\equiv\tilde{\mathsf{S}}(\pi|K)\exp\bigl{(}u(\pi)-1\bigr% {)}sansserif_f ( italic_π , italic_u ( italic_π ) ) ≡ over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) roman_exp ( italic_u ( italic_π ) - 1 )

𝖿(π,)𝖿𝜋\mathsf{f}(\pi,\cdot)sansserif_f ( italic_π , ⋅ ) is clearly an integrable, proper and convex function. As we saw earlier, the space 𝕃((ΩX×ΩY))𝕃subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌\mathbb{L}(\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))blackboard_L ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) is decomposable. Therefore, by Theorem 2 in [Rockafellar, 1971], we can perform the Legendre-Fenchel transform of 𝖿subscript𝖿\mathcal{I}_{\mathsf{f}}caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT through the integral sign and write:

𝖿(u)=𝖿(𝖲)(ΩX×ΩY)𝖿(π,u(π))𝑑(π)subscriptsuperscript𝖿𝑢subscriptsuperscript𝖿𝖲subscriptsubscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌superscript𝖿𝜋𝑢𝜋differential-d𝜋\mathcal{I}^{*}_{\mathsf{f}}(u)=\mathcal{I}_{\mathsf{f}^{*}}(\mathsf{S})\equiv% \int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y})}\mathsf{f}^{*}% (\pi,u(\pi))d\mathcal{L}(\pi)caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT ( italic_u ) = caligraphic_I start_POSTSUBSCRIPT sansserif_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( sansserif_S ) ≡ ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT sansserif_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π , italic_u ( italic_π ) ) italic_d caligraphic_L ( italic_π ) (63)

𝖿superscript𝖿\mathsf{f}^{*}sansserif_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is obtained using again the definition of the Fenchel-Rockafellar transform (57):

𝖿(π,𝖲(π|K))supu((ΩX×ΩY)){u(π)𝖲(π|K)𝖿(π,u(π))}=𝖲(π|K)log(𝖲(π|K)𝖲~(π|K))superscript𝖿𝜋𝖲conditional𝜋𝐾subscriptsupremum𝑢subscriptdouble-struck-Ω𝑋subscriptdouble-struck-Ω𝑌𝑢𝜋𝖲conditional𝜋𝐾𝖿𝜋𝑢𝜋𝖲conditional𝜋𝐾𝖲conditional𝜋𝐾~𝖲conditional𝜋𝐾\begin{split}\mathsf{f}^{*}(\pi,\mathsf{S}(\pi|K))&\equiv\sup_{u\in\mathbb{C}(% \mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{\Omega}_{Y}))}\bigl{\{}u(\pi)% \mathsf{S}(\pi|K)-\mathsf{f}(\pi,u(\pi))\bigr{\}}\\ &=\mathsf{S}(\pi|K)\log\Bigl{(}\frac{\mathsf{S}(\pi|K)}{\tilde{\mathsf{S}}(\pi% |K)}\Bigr{)}\end{split}start_ROW start_CELL sansserif_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_π , sansserif_S ( italic_π | italic_K ) ) end_CELL start_CELL ≡ roman_sup start_POSTSUBSCRIPT italic_u ∈ blackboard_C ( blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT { italic_u ( italic_π ) sansserif_S ( italic_π | italic_K ) - sansserif_f ( italic_π , italic_u ( italic_π ) ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = sansserif_S ( italic_π | italic_K ) roman_log ( divide start_ARG sansserif_S ( italic_π | italic_K ) end_ARG start_ARG over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) end_ARG ) end_CELL end_ROW

It follows that:

𝖿(x)=𝖣𝖪𝖫(𝖲||𝖲~)\mathcal{I}^{*}_{\mathsf{f}}(x)=\mathsf{D}_{\mathsf{KL}}(\mathsf{S}||\tilde{% \mathsf{S}})caligraphic_I start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT ( italic_x ) = sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( sansserif_S | | over~ start_ARG sansserif_S end_ARG )

There exists at least one hyperprior 𝖲𝕊𝖲𝕊\mathsf{S}\in\mathbb{S}sansserif_S ∈ blackboard_S s.t. 𝖿superscript𝖿\mathsf{f}^{*}sansserif_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is an integrable function of π𝜋\piitalic_π (consider for instance 𝖲=𝖲~𝖲~𝖲\mathsf{S}=\mathsf{\tilde{S}}sansserif_S = over~ start_ARG sansserif_S end_ARG). It follows, by Theorem 1 in [Rockafellar, 1971], that 𝖿subscript𝖿\mathcal{I}_{\mathsf{f}}caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT is a well-defined convex functional. Thus, the conjugacy operator acts as an involution on 𝖿subscript𝖿\mathcal{I}_{\mathsf{f}}caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT, yielding:

𝖣𝖪𝖫=𝖿=𝖿superscriptsubscript𝖣𝖪𝖫superscriptsubscript𝖿absentsubscript𝖿\mathsf{D}_{\mathsf{KL}}^{*}=\mathcal{I}_{\mathsf{f}}^{**}=\mathcal{I}_{% \mathsf{f}}sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT = caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT

Going back to our main Theorem in (7), it is obvious that 𝖣𝖪𝖫(||)\mathsf{D}_{\mathsf{KL}}(\cdot||\cdot)sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( ⋅ | | ⋅ ) and g0()subscript𝑔0g_{0}(\cdot)italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) are lower semicontinuous, proper and convex. Furthermore, 𝖣𝖪𝖫(||)\mathsf{D}_{\mathsf{KL}}^{*}(\cdot||\cdot)sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ | | ⋅ ) is continuous everywhere w.r.t the uniform norm (Theorem 4 in [Rockafellar, 1974]). It follows that strong duality holds and that the primal and dual problems are equal, the dual reading as follows:

(D):sup(λ1,λ2,λ3)3{(ΩX×ΩY)𝖲~(π|K)exp(𝖮¯(λ1,λ2,λ3)1)d(π)λ1ηλ2ζλ3}\begin{split}(D):\;\;\;\;\sup_{(\lambda_{1},\lambda_{2},\lambda_{3})\in% \mathbbm{R}^{3}}\left\{-\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})}\tilde{\mathsf{S}}(\pi|K)\exp\bigl{(}\bar{\mathsf{O}}^{*}(-% \lambda_{1},-\lambda_{2},-\lambda_{3})-1\bigr{)}d\mathcal{L}(\pi)-\lambda_{1}% \eta-\lambda_{2}\zeta-\lambda_{3}\right\}\end{split}start_ROW start_CELL ( italic_D ) : roman_sup start_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { - ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) roman_exp ( over¯ start_ARG sansserif_O end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , - italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) - 1 ) italic_d caligraphic_L ( italic_π ) - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_η - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_ζ - italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } end_CELL end_ROW (64)

One can simplify further the previous result by maximizing (64) w.r.t λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT for fixed (λ1,λ2)subscript𝜆1subscript𝜆2(\lambda_{1},\lambda_{2})( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), yielding the following value for λ3osubscriptsuperscript𝜆𝑜3\lambda^{o}_{3}italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT:

λ3o=log((ΩX×ΩY)𝖲~(π|K)exp(λ1𝖣𝖪𝖫(μ||μ0)λ2𝖣𝖪𝖫(ν||ν0)1)d(π))\lambda^{o}_{3}=\log\left(\int_{\mathbbm{P}(\mathbb{\Omega}_{X}\times\mathbb{% \Omega}_{Y})}\tilde{\mathsf{S}}(\pi|K)\exp\left(-\lambda_{1}\mathsf{D}_{% \mathsf{KL}}(\mu||\mu_{0})-\lambda_{2}\mathsf{D}_{\mathsf{KL}}(\nu||\nu_{0})-1% \right)d\mathcal{L}(\pi)\right)italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = roman_log ( ∫ start_POSTSUBSCRIPT blackboard_P ( blackboard_Ω start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT × blackboard_Ω start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) roman_exp ( - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - 1 ) italic_d caligraphic_L ( italic_π ) ) (65)

By substituting λ3osubscriptsuperscript𝜆𝑜3\lambda^{o}_{3}italic_λ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT back into (64), we obtain (18).

The optimality condition:

0𝖣𝖪𝖫(𝖲(π|K))+g0(O¯(𝖲))0subscript𝖣𝖪𝖫𝖲conditional𝜋𝐾subscript𝑔0¯𝑂𝖲0\in\partial\mathsf{D}_{\mathsf{KL}}(\mathsf{S}(\pi|K))+\partial{g_{0}}\bigl{(% }\bar{O}(\mathsf{S}\bigr{)})0 ∈ ∂ sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( sansserif_S ( italic_π | italic_K ) ) + ∂ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over¯ start_ARG italic_O end_ARG ( sansserif_S ) )

implies that the primal and dual optimal solutions should satisfy the following extremality conditions [Rockafellar, 1967]:

{𝖲o𝖿(𝖮(λ1,λ2))a.e.(λ1o,λ2o)g0(𝖮1(𝖲),𝖮2(𝖲))\left\{\begin{aligned} \mathsf{S}^{o}\in\partial{\mathcal{I}_{\mathsf{f}}(-% \mathsf{O}^{*}(\lambda_{1},\lambda_{2}))}\;\;\;\mathcal{L}-a.e.\\ (-\lambda_{1}^{o},-\lambda_{2}^{o})\in\partial{g_{0}\bigl{(}\mathsf{O}_{1}(% \mathsf{S}),\mathsf{O}_{2}(\mathsf{S})\bigr{)}}\end{aligned}\right.{ start_ROW start_CELL sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∈ ∂ caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT ( - sansserif_O start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) caligraphic_L - italic_a . italic_e . end_CELL end_ROW start_ROW start_CELL ( - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ∈ ∂ italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( sansserif_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( sansserif_S ) , sansserif_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( sansserif_S ) ) end_CELL end_ROW

𝖿subscript𝖿\mathcal{I}_{\mathsf{f}}caligraphic_I start_POSTSUBSCRIPT sansserif_f end_POSTSUBSCRIPT being differentiable everywhere, its sub-differential reduces to the usual gradient, leading to the same optimal hyperprior derived earlier using information processing arguments (17):

𝖲oexp(λ1o𝖣𝖪𝖫(μ||μ0))𝖲~(π|K)exp(λ2o𝖣𝖪𝖫(ν||ν0))a.e.\mathsf{S}^{o}\propto\exp\left(-\lambda_{1}^{o}\mathsf{D}_{\mathsf{KL}}(\mu||% \mu_{0})\right)\tilde{\mathsf{S}}(\pi|K)\exp\left(-\lambda_{2}^{o}\mathsf{D}_{% \mathsf{KL}}(\nu||\nu_{0})\right)\;\;\;\mathcal{L}-a.e.sansserif_S start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ∝ roman_exp ( - italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) over~ start_ARG sansserif_S end_ARG ( italic_π | italic_K ) roman_exp ( - italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) caligraphic_L - italic_a . italic_e .

On the other hand, noting that the sub-differential of the indicator function g0subscript𝑔0g_{0}italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the normal cone 𝖭¯(𝖮1(𝖲),𝖮2(𝖲))subscript¯𝖭subscript𝖮1𝖲subscript𝖮2𝖲\bar{\mathsf{N}}_{\mathbb{Q}}\bigl{(}\mathsf{O}_{1}(\mathsf{S}),\mathsf{O}_{2}% (\mathsf{S})\bigr{)}over¯ start_ARG sansserif_N end_ARG start_POSTSUBSCRIPT blackboard_Q end_POSTSUBSCRIPT ( sansserif_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( sansserif_S ) , sansserif_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( sansserif_S ) ), defined as follows:

𝖭¯(𝖮1(𝖲),𝖮2(𝖲)){v2|v[𝒙[𝖮1(𝖲)𝖮2(𝖲)]]0,𝒙[0,η]×[0,ζ]}subscript¯𝖭subscript𝖮1𝖲subscript𝖮2𝖲conditional-set𝑣superscript2formulae-sequenceprecedes-or-equalssuperscript𝑣delimited-[]𝒙matrixsubscript𝖮1𝖲subscript𝖮2𝖲0for-all𝒙0𝜂0𝜁\bar{\mathsf{N}}_{\mathbb{Q}}\bigl{(}\mathsf{O}_{1}(\mathsf{S}),\mathsf{O}_{2}% (\mathsf{S})\bigr{)}\equiv\left\{v\in\mathbbm{R}^{2}\;\;|\;\;v^{\intercal}% \left[\boldsymbol{x}-\begin{bmatrix}\mathsf{O}_{1}(\mathsf{S})\\ \mathsf{O}_{2}(\mathsf{S})\end{bmatrix}\right]\preceq 0,\forall\boldsymbol{x}% \in[0,\eta]\times[0,\zeta]\right\}over¯ start_ARG sansserif_N end_ARG start_POSTSUBSCRIPT blackboard_Q end_POSTSUBSCRIPT ( sansserif_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( sansserif_S ) , sansserif_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( sansserif_S ) ) ≡ { italic_v ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_v start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT [ bold_italic_x - [ start_ARG start_ROW start_CELL sansserif_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( sansserif_S ) end_CELL end_ROW start_ROW start_CELL sansserif_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( sansserif_S ) end_CELL end_ROW end_ARG ] ] ⪯ 0 , ∀ bold_italic_x ∈ [ 0 , italic_η ] × [ 0 , italic_ζ ] } (66)

the following optimality conditions are obtained, for the special choice of 𝒙=(η,ζ)𝒙𝜂𝜁\boldsymbol{x}=(\eta,\zeta)bold_italic_x = ( italic_η , italic_ζ ) plugged in (66):

{λ1o(η𝖤𝖲(𝖣𝖪𝖫(μ||μ0))0λ2o(ζ𝖤𝖲(𝖣𝖪𝖫(ν||ν0))0\left\{\begin{aligned} \lambda_{1}^{o}\Bigl{(}\eta-\mathsf{E}_{\mathsf{S}}% \bigl{(}\mathsf{D}_{\mathsf{KL}}(\mu||\mu_{0}\bigr{)}\Bigr{)}\geq 0\\ \lambda_{2}^{o}\Bigl{(}\zeta-\mathsf{E}_{\mathsf{S}}\bigl{(}\mathsf{D}_{% \mathsf{KL}}(\nu||\nu_{0}\bigr{)}\Bigr{)}\geq 0\end{aligned}\right.{ start_ROW start_CELL italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_η - sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_μ | | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≥ 0 end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ( italic_ζ - sansserif_E start_POSTSUBSCRIPT sansserif_S end_POSTSUBSCRIPT ( sansserif_D start_POSTSUBSCRIPT sansserif_KL end_POSTSUBSCRIPT ( italic_ν | | italic_ν start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≥ 0 end_CELL end_ROW

Thus: 𝝀(λ1o,λ2o)0𝝀superscriptsubscript𝜆1𝑜superscriptsubscript𝜆2𝑜succeeds-or-equals0\boldsymbol{\lambda}\equiv(\lambda_{1}^{o},\lambda_{2}^{o})\succeq 0bold_italic_λ ≡ ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT ) ⪰ 0. ∎

References

  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 06–11 Aug 2017.
  • Courty et al. [2017] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853–1865, 2017. doi: 10.1109/TPAMI.2016.2615921.
  • Mathon et al. [2014] Benjamin Mathon, François Cayre, Patrick Bas, and Benoit Macq. Optimal transport for secure spread-spectrum watermarking of still images. Image Processing, IEEE Transactions on, 23:1694–1705, 04 2014. doi: 10.1109/TIP.2014.2305873.
  • Guerreiro et al. [2023] Nuno M. Guerreiro, Pierre Colombo, Pablo Piantanida, and André F. T. Martins. Optimal transport for unsupervised hallucination detection in neural machine translation, 2023.
  • Galichon [2016] Alfred Galichon. Optimal Transport Methods in Economics. Princeton University Press, 2016.
  • Saumier et al. [2015] Louis-Philippe Saumier, Boualem Khouider, and Martial Agueh. Optimal transport for particle image velocimetry: Real data and postprocessing algorithms. SIAM Journal on Applied Mathematics, 75(6):2495–2514, 2015. ISSN 00361399.
  • El Moselhy and Marzouk [2012] Tarek A. El Moselhy and Youssef M. Marzouk. Bayesian inference with optimal maps. Journal of Computational Physics, 231(23):7815–7850, 2012. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2012.07.022.
  • Villani [2008] Cédric Villani. Optimal Transport: Old and New. Springer, 2008.
  • Peyré and Cuturi [2019] Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • Ben-Tal et al. [2009] Aharon Ben-Tal, Laurent Ghaoui, and Arkadi Nemirovski. Robust Optimization. 08 2009. ISBN 9781400831050. doi: 10.1515/9781400831050.
  • Sklar [1959] Abe Sklar. Fonctions de répartition à n dimensions et leurs marges. pages 229–231. Publications de l’Institut de Statistique de l’Université de Paris, 8, 1959.
  • Goodman [1953] Leo Goodman. Ecological regressions and behavior of individuals. American Sociological Review, 18:663, 1953.
  • Wakefield [2004] Jon Wakefield. Ecological inference for 2x2 tables (with discussion). Journal of the Royal Statistical Society Series A, 167:385–445, 08 2004. doi: 10.1111/j.1467-985x.2004.02046.x.
  • Frogner and Poggio [2019] Charlie Frogner and Tomaso Poggio. Fast and flexible inference of joint distributions from their marginals. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2002–2011. PMLR, 09–15 Jun 2019.
  • Mallasto et al. [2021] Anton Mallasto, Markus Heinonen, and Samuel Kaski. Bayesian inference for optimal transport with stochastic cost. In Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research, pages 1601–1616. PMLR, 2021.
  • Kárný and Kroupa [2012] Miroslav Kárný and Tomáš Kroupa. Axiomatisation of fully probabilistic design. Information Sciences, 186(1):105–113, 2012. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2011.09.018.
  • Séjourné et al. [2023] Thibault Séjourné, Gabriel Peyré, and François-Xavier Vialard. Chapter 12 - unbalanced optimal transport, from theory to numerics. In Emmanuel Trélat and Enrique Zuazua, editors, Numerical Control: Part B, volume 24 of Handbook of Numerical Analysis, pages 407–471. Elsevier, 2023. doi: https://doi.org/10.1016/bs.hna.2022.11.003.
  • Cuturi [2013] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  • Quinn et al. [2025] Anthony Quinn, Sarah Boufelja Yacobi, Martin Corless, and Robert Shorten. Fully probabilistic design for optimal transport. Communications in Optimization Theory, 2025. To appear.
  • Jeffreys [1939] Harold Jeffreys. Theory of Probability. Clarendon Press, Oxford, England, 1939.
  • Carlier et al. [2017] Guillaume Carlier, Vincent Duval, Gabriel Peyré, and Bernhard Schmitzer. Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385–1418, 2017. doi: 10.1137/15M1050264.
  • Quinn et al. [2016] Anthony Quinn, Miroslav Kárný, and Tatiana V. Guy. Fully probabilistic design of hierarchical Bayesian models. Information Sciences, 369:532–547, 2016. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2016.07.035.
  • Bissiri et al. [2016] Pier Giovanni Bissiri, Chris Holmes, and Stephen Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):1103–1130, feb 2016. doi: 10.1111/rssb.12158.
  • Savage [1971] Leonard J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971. doi: 10.1080/01621459.1971.10482346.
  • Rockafellar [1967] Ralph Tyrrell Rockafellar. Duality and stability in extremum problems involving convex functions. Pacific Journal of Mathematics, 21(1):167 – 187, 1967.
  • Kracík and Kárný [2005] Jan Kracík and Miroslav Kárný. Merging of data knowledge in Bayesian estimation. In International Conference on Informatics in Control, Automation and Robotics, volume 2, pages 229–232, 2005.
  • Kolmogorov and Sarmanov [1960] Andrey Nikolaevich Kolmogorov and Oleg Vasilévich Sarmanov. The work of S. N. Bernshtein on the theory of probability. Theory of Probability & Its Applications, 5(2):197–203, 1960. doi: 10.1137/1105017.
  • Delahaye et al. [2019] Daniel Delahaye, Supatcha Chaimatanan, and Marcel Mongeau. Simulated Annealing: From Basics to Applications, pages 1–35. Springer International Publishing, Cham, 2019. ISBN 978-3-319-91086-4. doi: 10.1007/978-3-319-91086-4_1.
  • Quinn [2012] Anthony Quinn. Recursive inference for inverse problems using variational Bayes methodology. In 1st International ICST Workshop on New Computational Methods for Inverse Problems. ACM, 2012.
  • Nocedal and Wright [2006] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, 2e edition, 2006.
  • Nesterov [2018] Yurii Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018. ISBN 3319915770.
  • Betancourt [2017] Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv: Methodology, 2017.
  • Mangoubi and Smith [2019] Oren Mangoubi and Aaron Smith. Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions 2: Numerical integrators. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 586–595. PMLR, 16–18 Apr 2019.
  • Snyman [2005] Jan Snyman. A gradient-only line search method for the conjugate gradient method applied to constrained optimization problems with severe noise in the objective function. International Journal for Numerical Methods in Engineering, 62:72 – 82, 01 2005. doi: 10.1002/nme.1189.
  • Chen and Vempala [2022] Zongchen Chen and Santosh S. Vempala. Optimal convergence rate of Hamiltonian Monte Carlo for strongly logconcave distributions. Theory of Computing, 18(9):1–18, 2022. doi: 10.4086/toc.2022.v018a009.
  • Barocas et al. [2023] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and machine learning: Limitations and opportunities. MIT press, 2023.
  • Gordaliza et al. [2019] Paula Gordaliza, Eustasio Del Barrio, Gamboa Fabrice, and Jean-Michel Loubes. Obtaining fairness using optimal transport theory. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2357–2365. PMLR, 09–15 Jun 2019.
  • Hughes and Chen [2021] Jason Hughes and Juntao Chen. Fair and distributed dynamic optimal transport for resource allocation over networks. In 2021 55th Annual Conference on Information Sciences and Systems (CISS), pages 1–6, 2021. doi: 10.1109/CISS50987.2021.9400236.
  • Galichon [2021] Alfred Galichon. The unreasonable effectiveness of optimal transport in economics. arXiv preprint arXiv:2107.04700, 2021.
  • Echenique et al. [2024] Federico Echenique, Joseph Root, and Fedor Sandomirskiy. Stable matching as transportation. In Proceedings of the 25th ACM Conference on Economics and Computation, EC ’24, page 418, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400707049. doi: 10.1145/3670865.3673585.
  • Hoffman and Gelman [2011] Matthew D. Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15:1593–1623, 2011.
  • Flamary et al. [2021] Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python optimal transport. J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435.
  • Jelinek et al. [1977] Frederick Jelinek, Robert L. Mercer, Lalit R. Bahl, and Janet M. Baker. Perplexity—a measure of the difficulty of speech recognition tasks. Journal of the Acoustical Society of America, 62, 1977.
  • Martí et al. [2022] Rafael Martí, Anna Martínez-Gavara, Sergio Pérez-Peló, and Jesús Sánchez-Oro. A review on discrete diversity and dispersion maximization from an or perspective. European Journal of Operational Research, 299(3):795–813, 2022. ISSN 0377-2217. doi: https://doi.org/10.1016/j.ejor.2021.07.044.
  • Gangbo and McCann [1996] Wilfrid Gangbo and Robert J. McCann. The geometry of optimal transportation. Acta Mathematica, 177(2):113 – 161, 1996. doi: 10.1007/BF02392620.
  • Folland [1999] Gerald Budge Folland. Real Analysis : Modern Techniques and Their Applications. Wiley, New York, 1999.
  • Billingsley [1999] Patrick Billingsley. Convergence of probability measures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., New York, second edition, 1999. ISBN 0-471-19745-9. A Wiley-Interscience Publication.
  • Rockafellar [1971] Ralph Tyrrell Rockafellar. Integrals which are convex functionals. II. Pacific Journal of Mathematics, 39(2):439 – 469, 1971.
  • Rockafellar [1974] Ralph Tyrrell Rockafellar. Conjugate duality and optimization. Society for Industrial and Applied Mathematics, 1974.