Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games

Antonio Ocello    Daniil Tiapkin    Lorenzo Mancini    Mathieu Laurière    Eric Moulines
Abstract

We introduce Mean-Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.

Mean-Field Games,Trust Region Policy Optimization, Reinforcement Learning, Finite Sample Complexity, Nash Equilibrium, Sample-Based Algorithms, Multi-Agent Systems

1 Introduction

In an increasingly interconnected world, autonomous systems capable of adaptive decision-making have become indispensable. Their range of applications is vast and continuously expanding, spanning from autonomous driving (see, e.g., Shalev-Shwartz et al., 2016) to energy market control (Samvelyan et al., 2019), and from traffic light optimization (Wiering et al., 2000) to advanced robotic systems (Matignon et al., 2007; Kober et al., 2013).

A powerful framework to model these adaptive decision agents is Multi-Agent Reinforcement Learning (MARL), which enables agents to learn optimal strategies through interaction with both the environment and other agents (see, e.g., Zhang et al., 2021; Gronauer & Diepold, 2022). However, MARL faces two major challenges: scalability and non-stationarity. As the number of agents increases, the joint state-action space grows exponentially, making learning computationally prohibitive. Additionally, because all agents are simultaneously updating their policies, the environment becomes non-stationary from the perspective of each individual agent, severely hindering convergence and stability. To address these issues, many techniques have been considered, like Centralized Training with Decentralized Execution (CTDE) (Foerster et al., 2018) and attention approaches (Iqbal & Sha, 2019). While these methods improve stability and learning efficiency, it comes at a significant computational cost, limiting its scalability in large-scale systems.

Under the assumptions of homogeneity and anonymity, complex multi-agent games can be effectively approximated using Mean-Field Games (MFG). MFG provide an asymptotic approximation of exchangeable particle systems as the number of agents grows. Originally introduced by Lasry & Lions (2006a, b, 2007) and Huang et al. (2003, 2005, 2006a, 2006b), MFG replace direct agent-to-agent interactions with a representative agent interacting with the statistical distribution of the population. Due to their analytical tractability and broad applicability, MFG have been widely adopted across various domains, including economic modeling (Bassière et al., 2024), finance (Lavigne & Tankov, 2023; Carmona et al., 2013), public health dynamics (Doncel et al., 2022), and energy storage (Alasseur et al., 2020).

In particular, Mean-Field Reinforcement Learning (MFRL) arises as the scaling limit of many MARL problems, positioning itself at the intersection of MFG and Reinforcement Learning (RL). In this setting, RL techniques are employed to solve equilibrium problems in large-scale multi-agent systems. At the core of this framework, each agent optimizes its objective while treating the mean-field distribution as fixed. This structure closely resembles CTDE in the MARL setting, but its computational cost is significantly reduced due to the mean-field approximation. In turn, the distribution evolves dynamically based on the collective behavior of all agents. This formulation extends the classical notion of Nash equilibrium to the Mean-Field Nash Equilibrium (MFNE), where equilibrium emerges from the interaction between individual decision-making and population-wide updates.

MFNE provide an adequate approximation for large-scale multi-agent interactions, achieving an O~(1/N)~𝑂1𝑁\widetilde{O}(1/\sqrt{N})over~ start_ARG italic_O end_ARG ( 1 / square-root start_ARG italic_N end_ARG )-approximate Nash equilibrium in the corresponding N𝑁Nitalic_N-player game (Cardaliaguet et al., 2019; Fischer & Silva, 2021; Flandoli et al., 2022). This approximation drastically reduces the complexity of analyzing strategic interactions in large populations, establishing MFG as a scalable and computationally efficient framework for real-world applications.

Related works.

Recent work in MFG has explored the use of proximal methods due to their stability and empirical performance. Notably, Pérolat et al. (2022); Perrin et al. (2022) analyze Online Mirror Descent (OMD) from a model-specific perspective, while Yardim et al. (2023) extend this direction with a model-free approach. However, their analysis does not address population updates, operating in a restricted no-manipulation regime. These works have demonstrated the effectiveness of proximal methods in MFG, which is consistent with our research direction.

We extend this line of research by establishing finite-sample complexity guarantees providing a rigorous theoretical framework that ensures provable efficiency in solving MFG. Moreover, we explicitly incorporates the role of monotonicity in stabilizing population dynamics, a key aspect of the ergodic structure in MFG, and relax overly restrictive assumptions—such as uniformly bounded-away-from-zero policies or absolute continuity with respect to the uniform distribution. With a more flexible framework, we get refined learning guarantees under weaker assumptions, achieving an O~(1/L)~𝑂1𝐿\widetilde{O}(1/L)over~ start_ARG italic_O end_ARG ( 1 / italic_L ) convergence rate in the optimization problem with improved sample efficiency, thus broadening the applicability of these methods. For an additional discussion on related work, we refer to Appendix A.

Contributions.

We propose Exact MF-TRPO and Sample-Based MF-TRPO, trust-region-based algorithms for computing approximate MFNE in the MFRL setting. Our method combines the structure of ergodic MFG with the stability of trust-region optimization to enable theoretically grounded learning in multi-agent environments. Our key contributions are:

  1. 1.

    Theoretical analysis of Exact MF-TRPO with a bound on the exploitability of the learned policies, quantifying the achieved εKsubscript𝜀𝐾\varepsilon_{K}italic_ε start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT-MFNE after K𝐾Kitalic_K iterations.

  2. 2.

    A sample-based variant, Sample-Based MF-TRPO, with finite-sample complexity guarantees under the ν𝜈\nuitalic_ν-paradigm, requiring at most O~(1/ε6)~𝑂1superscript𝜀6\widetilde{O}(1/\varepsilon^{6})over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) environment interactions to reach an ε𝜀\varepsilonitalic_ε-MFNE.

  3. 3.

    Numerical experiments validating the efficiency and effectiveness of our approach in representative MFRL settings.

2 Framework

We use the discounted formulation as an approximation to the ergodic setting, a standard approach in the literature (see, e.g., Laurière et al., 2022). This methodology allows us to build on the well-established theoretical framework of discounted RL while capturing the long-term behavior of the ergodic formulation in a stepwise fashion. In particular, we adopt the Mean-Field Markov Decision Process (MF-MDP) framework, a natural extension of the infinite-horizon discounted MDP commonly studied in RL (Sutton & Barto, 2018a). This adaptation preserves the computational tractability and convergence properties of the discounted problem while approximating the stationary dynamics of the ergodic setting. The MF-MDP framework thus serves as a bridge between classical RL and MFG models and provides a structured approach to policy optimization that is valid for both finite horizons and stationary regimes. A detailed discussion of this approach and its theoretical foundations can be found in the Appendix B.

Notations.

For a finite set 𝒳𝒳\mathcal{X}caligraphic_X, let 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ) denote the set of probability distributions over 𝒳𝒳\mathcal{X}caligraphic_X. A finite MF-MDP is a tuple (𝒮,𝒜,𝖯,𝗋,γ)𝒮𝒜superscript𝖯absentsuperscript𝗋absent𝛾(\mathcal{S},\mathcal{A},\mathsf{P}^{\hskip 0.49005pt},\mathsf{r}^{\hskip 0.49% 005pt},\gamma)( caligraphic_S , caligraphic_A , sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_γ ), where 𝒮𝒮\mathcal{S}caligraphic_S is a finite state space, 𝒜𝒜\mathcal{A}caligraphic_A is a finite action space, 𝖯:𝒮×𝒜×𝒫(𝒮)𝒫(𝒮):superscript𝖯absent𝒮𝒜𝒫𝒮𝒫𝒮\mathsf{P}^{\hskip 0.49005pt}\colon\mathcal{S}\times\mathcal{A}\times\mathcal{% P}(\mathcal{S})\to\mathcal{P}(\mathcal{S})sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A × caligraphic_P ( caligraphic_S ) → caligraphic_P ( caligraphic_S ) is the transition function, 𝗋:𝒮×𝒜×𝒫(𝒮):superscript𝗋absent𝒮𝒜𝒫𝒮\mathsf{r}^{\hskip 0.49005pt}\colon\mathcal{S}\times\mathcal{A}\times\mathcal{% P}(\mathcal{S})\to\mathbb{R}sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT : caligraphic_S × caligraphic_A × caligraphic_P ( caligraphic_S ) → blackboard_R is the reward function, γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) is the discount factor. Since we consider probability distributions over a finite state space of size |𝒮|𝒮|\mathcal{S}|| caligraphic_S |, they can be identified as vectors in |𝒮|superscript𝒮\mathbb{R}^{|\mathcal{S}|}blackboard_R start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT; thus, we define the inner product ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ and the Euclidean norm 2\left\|\cdot\right\|_{{2}}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT accordingly.

We assume that 𝗋superscript𝗋absent\mathsf{r}^{\hskip 0.49005pt}sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is continuous, thus bounded, and denote by 𝗋subscriptnormsuperscript𝗋absent\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT its upper bound, i.e., 𝗋(s,a,μ)[0,𝗋]superscript𝗋absent𝑠𝑎𝜇0subscriptnormsuperscript𝗋absent\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)\in[0,\left\|\mathsf{r}^{\hskip 0.49005% pt}\right\|_{{\infty}}]sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) ∈ [ 0 , ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ], for (s,a,μ)𝒮×𝒜×𝒫(𝒮).𝑠𝑎𝜇𝒮𝒜𝒫𝒮(s,a,\mu)\in\mathcal{S}\times\mathcal{A}\times\mathcal{P}(\mathcal{S}).( italic_s , italic_a , italic_μ ) ∈ caligraphic_S × caligraphic_A × caligraphic_P ( caligraphic_S ) . Given a policy π:𝒮𝒫(𝒜):𝜋𝒮𝒫𝒜\pi\colon\mathcal{S}\to\mathcal{P}(\mathcal{A})italic_π : caligraphic_S → caligraphic_P ( caligraphic_A ) and a population profile μ𝒫(𝒮)𝜇𝒫𝒮\mu\in\mathcal{P}(\mathcal{S})italic_μ ∈ caligraphic_P ( caligraphic_S ), we define the transition operator 𝖯μπsuperscriptsubscript𝖯𝜇𝜋\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT as the transition matrix induced by the probability kernel, where actions are sampled as aπ(s)similar-to𝑎𝜋𝑠a\sim\pi(s)italic_a ∼ italic_π ( italic_s ) under the mean-field parameter μ𝜇\muitalic_μ, i.e.,

𝖯μπ(s,s)superscriptsubscript𝖯𝜇𝜋𝑠superscript𝑠\displaystyle\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}(s,s^{\prime})sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =a𝒜π(a|s)𝖯(s|s,a,μ),absentsubscript𝑎𝒜𝜋conditional𝑎𝑠superscript𝖯absentconditionalsuperscript𝑠𝑠𝑎𝜇\displaystyle=\sum_{a\in\mathcal{A}}\pi(a|s)\mathsf{P}^{\hskip 0.49005pt}(s^{% \prime}|s,a,\mu)\;,= ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_μ ) , for s,s𝒮,for 𝑠superscript𝑠𝒮\displaystyle\text{ for }s,s^{\prime}\in\mathcal{S}\;,for italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S ,
μ𝖯μπ(s)𝜇superscriptsubscript𝖯𝜇𝜋𝑠\displaystyle\mu\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}(s)italic_μ sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) =s0𝒮μ(s0)𝖯μπ(s0,s),absentsubscriptsubscript𝑠0𝒮𝜇subscript𝑠0superscriptsubscript𝖯𝜇𝜋subscript𝑠0𝑠\displaystyle=\sum_{s_{0}\in\mathcal{S}}\mu(s_{0})\mathsf{P}_{\mu}^{\hskip 0.4% 9005pt\pi}(s_{0},s)\;,= ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_μ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s ) , for s𝒮.for 𝑠𝒮\displaystyle\text{ for }s\in\mathcal{S}\;.for italic_s ∈ caligraphic_S . (1)

Denote λπ,μsubscript𝜆𝜋𝜇\lambda_{\pi,\mu}italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT the stationary distribution of the Markov chain 𝖯μπsuperscriptsubscript𝖯𝜇𝜋\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT. Let ΠΠ\Piroman_Π to be set the policies, i.e., the set of functions from 𝒮𝒮\mathcal{S}caligraphic_S to 𝒫(𝒜)𝒫𝒜\mathcal{P}(\mathcal{A})caligraphic_P ( caligraphic_A ). For s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, and π,πΠ𝜋superscript𝜋Π\pi,\pi^{\prime}\in\Piitalic_π , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π, the Kullback–Leibler (KL) divergence between the two distributions π(|s)\pi(\cdot|s)italic_π ( ⋅ | italic_s ) and π(|s)\pi^{\prime}(\cdot|s)italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) is defined as KL(π(|s)π(|s))=a𝒜π(a|s)log(π(a|s)/π(a|s))\mathrm{KL}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))=\sum_{a\in\mathcal{A}}\pi(a|s% )\log(\pi(a|s)/\pi^{\prime}(a|s))roman_KL ( italic_π ( ⋅ | italic_s ) ∥ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log ( italic_π ( italic_a | italic_s ) / italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_a | italic_s ) ), if π𝜋\piitalic_π is absolutely continuous to πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, KL(π(|s)π(|s))=\mathrm{KL}(\pi(\cdot|s)\|\pi^{\prime}(\cdot|s))=\inftyroman_KL ( italic_π ( ⋅ | italic_s ) ∥ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ) = ∞ otherwise. We use the notation O~()~𝑂\widetilde{O}(\cdot)over~ start_ARG italic_O end_ARG ( ⋅ ) to hide polylogarithmic factors in the asymptotic complexity.

Problem formulation.

As in Laurière et al. (2022), the discounted stationary problem within a mean-field interaction setting is designed to approximate the N𝑁Nitalic_N-player game in the ergodic regime, as γ1𝛾1\gamma\to 1italic_γ → 1. In this setting, a representative agent in the mean-field approximation seeks to maximize the expected discounted sum of rewards while interacting with the population distribution μ𝜇\muitalic_μ under a policy π𝜋\piitalic_π. The objective function is given by

JMFG(π,μ,ξ):=𝔼[t=0γt[𝗋(st,at,μ)+ηlog(π(at|st))]],assignsuperscript𝐽MFG𝜋𝜇𝜉𝔼delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡delimited-[]superscript𝗋absentsubscript𝑠𝑡subscript𝑎𝑡𝜇𝜂𝜋conditionalsubscript𝑎𝑡subscript𝑠𝑡\displaystyle\begin{split}&J^{\operatorname{MFG}}(\pi,\mu,\xi)\\ &:=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}\left[\mathsf{r}^{\hskip 0.490% 05pt}(s_{t},a_{t},\mu)+\eta\log\big{(}\pi(a_{t}|s_{t})\big{)}\right]\right],% \end{split}start_ROW start_CELL end_CELL start_CELL italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_ξ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL := blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT [ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) + italic_η roman_log ( italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] ] , end_CELL end_ROW (2)

Given an initial state s0ξsimilar-tosubscript𝑠0𝜉s_{0}\sim\xiitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ξ, actions are sampled at each time step as atπ(st)a_{t}\sim\pi(\cdot\mid s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), with state transitions governed by the kernel 𝖯superscript𝖯absent\mathsf{P}^{\hskip 0.49005pt}sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, i.e., st+1𝖯(st,at,μ)s_{t+1}\sim\mathsf{P}^{\hskip 0.49005pt}(\cdot\mid s_{t},a_{t},\mu)italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ). We consider an entropy-regularized variant of the classical MFG problem, where η𝜂\etaitalic_η denotes the entropy regularization parameter.

As MFG are an O~(1/N)~𝑂1𝑁\widetilde{O}(1/\sqrt{N})over~ start_ARG italic_O end_ARG ( 1 / square-root start_ARG italic_N end_ARG )-approximation of the N𝑁Nitalic_N-player game, it is natural to introduce an additional regularization term. If this extra bias remains within the order of the existing approximation error, model fidelity is thus preserved. Regularization enhances solution stability, which is particularly beneficial given the inherent nonlinearity of MFG optimization. This is especially advantageous in RL settings, where small perturbations in the value function or policy updates can otherwise lead to erratic behavior.

The value function is defined as

VMFG(μ,ξ)superscript𝑉MFG𝜇𝜉\displaystyle V^{\operatorname{MFG}}(\mu,\xi)italic_V start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_μ , italic_ξ ) :=maxπΠJMFG(π,μ,ξ).assignabsentsubscript𝜋Πsuperscript𝐽MFG𝜋𝜇𝜉\displaystyle:=\max_{\pi\in\Pi}J^{\operatorname{MFG}}(\pi,\mu,\xi)\;.:= roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_ξ ) . (3)

​With a slight abuse of notation, we use VMFG(μ,s)superscript𝑉MFG𝜇𝑠V^{\operatorname{MFG}}(\mu,s)italic_V start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_μ , italic_s ) (resp. VMFG(μ)superscript𝑉MFG𝜇V^{\operatorname{MFG}}(\mu)italic_V start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_μ ) and JMFG(π,μ)superscript𝐽MFG𝜋𝜇J^{\operatorname{MFG}}(\pi,\mu)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ )) to denote VMFG(μ,δs)superscript𝑉MFG𝜇subscript𝛿𝑠V^{\operatorname{MFG}}(\mu,\delta_{s})italic_V start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_μ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) for s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S (resp. VMFG(μ,μ)superscript𝑉MFG𝜇𝜇V^{\operatorname{MFG}}(\mu,\mu)italic_V start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_μ , italic_μ ) and JMFG(π,μ,μ)superscript𝐽MFG𝜋𝜇𝜇J^{\operatorname{MFG}}(\pi,\mu,\mu)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_μ )).

Furthermore, let Qμπsuperscriptsubscript𝑄𝜇𝜋Q_{\mu}^{\pi}italic_Q start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT denote the regularized Q𝑄Qitalic_Q-function, defined as

Qμπ(s,a)=𝗋(s,a,μ)+γs𝒮𝖯(s|s,a,μ)JMFG(π,μ,s).superscriptsubscript𝑄𝜇𝜋𝑠𝑎superscript𝗋absent𝑠𝑎𝜇𝛾subscriptsuperscript𝑠𝒮superscript𝖯absentconditionalsuperscript𝑠𝑠𝑎𝜇superscript𝐽MFG𝜋𝜇superscript𝑠\displaystyle\begin{split}&Q_{\mu}^{\pi}(s,a)\\ &=\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)+\gamma\sum_{s^{\prime}\in\mathcal{S}}% \mathsf{P}^{\hskip 0.49005pt}(s^{\prime}|s,a,\mu)\cdot J^{\operatorname{MFG}}(% \pi,\mu,s^{\prime})\;.\end{split}start_ROW start_CELL end_CELL start_CELL italic_Q start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) + italic_γ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_μ ) ⋅ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . end_CELL end_ROW (4)

We denote πμsubscript𝜋𝜇\pi_{\mu}italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT the optimal policy of optimization problem VMFG(μ)superscript𝑉MFG𝜇V^{\operatorname{MFG}}\left(\mu\right)italic_V start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_μ ), which is unique in the entropy-regularized RL (see, e.g. Haarnoja et al., 2017; Geist et al., 2019).

Define 𝖽ξ,μπsuperscriptsubscript𝖽𝜉𝜇𝜋\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT (resp. 𝖽s,μπsuperscriptsubscript𝖽𝑠𝜇𝜋\mathsf{d}_{s,\mu}^{\hskip 0.49005pt\pi}sansserif_d start_POSTSUBSCRIPT italic_s , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and 𝖽μπsuperscriptsubscript𝖽𝜇𝜋\mathsf{d}_{\mu}^{\hskip 0.49005pt\pi}sansserif_d start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT) the occupation measure of the process (st)tsubscriptsubscript𝑠𝑡𝑡(s_{t})_{t}( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT induced by this policy, under the population distribution μ𝜇\muitalic_μ and initial distribution ξ𝜉\xiitalic_ξ (resp. δssubscript𝛿𝑠\delta_{s}italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and μ𝜇\muitalic_μ), i.e.,

𝖽ξ,μπ(s,a)=(1γ)t=0γt𝖯((st,at)=(s,a)),superscriptsubscript𝖽𝜉𝜇𝜋𝑠𝑎1𝛾superscriptsubscript𝑡0superscript𝛾𝑡superscript𝖯absentsubscript𝑠𝑡subscript𝑎𝑡𝑠𝑎\displaystyle\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}(s,a)=(1-\gamma)\sum_{t% =0}^{\infty}\gamma^{t}\mathsf{P}^{\hskip 0.49005pt}\left((s_{t},a_{t})=(s,a)% \right)\;,sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_s , italic_a ) ) , (5)

​with s0ξsimilar-tosubscript𝑠0𝜉s_{0}\sim\xiitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ξ, atπt(|st)a_{t}\sim\pi_{t}(\cdot|s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and st+1𝖯(|st,at,μ)s_{t+1}\sim\mathsf{P}^{\hskip 0.49005pt}(\cdot|s_{t},a_{t},\mu)italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ), and 𝖽¯μ,ξπsuperscriptsubscript¯𝖽𝜇𝜉𝜋\overline{\mathsf{d}}_{\mu,\xi}^{\hskip 0.49005pt\pi}over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT its spatial marginal, i.e.,

𝖽¯μ,ξπ(s)=a𝒜𝖽ξ,μπ(s,a).superscriptsubscript¯𝖽𝜇𝜉𝜋𝑠subscript𝑎𝒜superscriptsubscript𝖽𝜉𝜇𝜋𝑠𝑎\displaystyle\overline{\mathsf{d}}_{\mu,\xi}^{\hskip 0.49005pt\pi}(s)=\sum_{a% \in\mathcal{A}}\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}(s,a)\;.over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) . (6)

Interactions with the environment.

In the RL literature, various paradigms have been proposed to structure the interaction between an agent and its environment, influencing how data is collected and utilized for policy updates. In our setting, two fundamental actions can be performed: reset and step. The reset action initializes the environment by sampling a new state from the distribution ν𝜈\nuitalic_ν, effectively allowing the agent to restart from a fresh initial condition. The step action, on the other hand, takes a chosen action a𝑎aitalic_a and the mean-field distribution profile μ𝜇\muitalic_μ, and progresses the environment based on the current state s𝑠sitalic_s. Specifically, it samples the next state from the transition kernel 𝖯(|s,a,μ)\mathsf{P}^{\hskip 0.49005pt}(\cdot|s,a,\mu)sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s , italic_a , italic_μ ) and updates the environment accordingly. This interaction model aligns with previous works (Kakade, 2003; Shani et al., 2020) and provides an intermediate assumption in the spectrum of data access models in RL. It is therefore weaker than assuming full access to the true model or a generative model, where arbitrary state-action pairs can be queried, but stronger than the setting where no restarts are allowed, restricting exploration to trajectories induced by the current policy. This ensures reliable state exploration and promotes stable convergence.

Nash Equilibrium.

Our work aims to compute a MFNE. A Nash equilibrium (NE), a fundamental concept in game theory, represents a stable state in which no player can improve her payoff by unilaterally changing her strategy, i.e., each player reacts optimally to the strategies of the others.

In the MFG framework, as the number of agents approaches infinity, this concept extends to a situation in which each agent optimally adapts its strategy to the behavior of the collective population. This leads to a mean-field fixed-point condition in which individual decisions shape and are shaped by the evolving population distribution. This formulation makes MFG a powerful tool for analyzing large-scale strategic interactions.

Definition 2.1 (MFNE).

A pair (π,μ)subscript𝜋subscript𝜇(\pi_{\star},\mu_{\star})( italic_π start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) is said to be a MFNE if it satisfies the following two conditions:

  1. 1.

    (Optimality under the population dynamics) The policy πsubscript𝜋\pi_{\star}italic_π start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT is an optimal solution to (3), given the population evolution μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT, i.e.,

    JMFG(π,μ,μ)=maxπΠJMFG(π,μ,μ).superscript𝐽MFGsubscript𝜋subscript𝜇subscript𝜇subscript𝜋Πsuperscript𝐽MFG𝜋subscript𝜇subscript𝜇\displaystyle J^{\operatorname{MFG}}(\pi_{\star},\mu_{\star},\mu_{\star})=\max% _{\pi\in\Pi}J^{\operatorname{MFG}}(\pi,\mu_{\star},\mu_{\star})\;.italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) .
  2. 2.

    (Consistency of the population evolution) The population distribution μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT is a fixed point of the mean-field evolution equation

    μ=μ𝖯πμ.subscript𝜇subscript𝜇superscriptsubscript𝖯subscript𝜋subscript𝜇\displaystyle\mu_{\star}=\mu_{\star}\mathsf{P}_{\pi_{\star}}^{\hskip 0.49005pt% \mu_{\star}}\;.italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT sansserif_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

In this definition, the first condition ensures that no individual agent can improve their long-term objective by unilaterally deviating from the equilibrium policy πsubscript𝜋\pi_{\star}italic_π start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT. Given the equilibrium population distribution μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT, this optimality condition guarantees that the policy πsubscript𝜋\pi_{\star}italic_π start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT remains the best possible strategy for each agent, thereby ensuring individual rationality.

The second condition enforces consistency in population dynamics: when all agents follow the equilibrium policy πsubscript𝜋\pi_{\star}italic_π start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT, the resulting population distribution remains stable over time. This fixed-point property ensures the system’s long-term evolution does not deviate from the equilibrium state.

This notion of equilibrium, introduced by Nash (1950) and extended to the mean-field setting by Lasry & Lions (2006a, b, 2007); Huang et al. (2003, 2005, 2006a), is fundamental in game theory. In these works, the authors generalize the results of Nash (1951), proving the existence of at least one NE using the Brouwer Fixed-Point Theorem. This non-constructive result holds under mild assumptions, such as the continuity of payoff functions and the compactness of strategy spaces, forming a cornerstone for analyzing strategic interactions. However, explicitly computing such equilibria remains challenging, particularly in the multi-player regime (Austrin et al., 2011).

Exploitability.

A key metric for evaluating the deviation from a NE is exploitability (Laurière et al., 2022). Formally, the mean-field setting, it is defined as:

ϕ(π,μ):=maxπΠJMFG(π,λπ,μ,λπ,μ)JMFG(π,λπ,μ,λπ,μ).assignitalic-ϕ𝜋𝜇subscriptsuperscript𝜋Πsuperscript𝐽MFGsuperscript𝜋subscript𝜆𝜋𝜇subscript𝜆𝜋𝜇superscript𝐽MFG𝜋subscript𝜆𝜋𝜇subscript𝜆𝜋𝜇\displaystyle\begin{split}&\phi(\pi,\mu):=\\ &\max_{\pi^{\prime}\in\Pi}J^{\operatorname{MFG}}\left(\pi^{\prime},\lambda_{% \pi,\mu},\lambda_{\pi,\mu}\right)-J^{\operatorname{MFG}}\left(\pi,\lambda_{\pi% ,\mu},\lambda_{\pi,\mu}\right)\;.\end{split}start_ROW start_CELL end_CELL start_CELL italic_ϕ ( italic_π , italic_μ ) := end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) . end_CELL end_ROW (7)

This quantity measures the potential improvement an individual agent could achieve by deviating unilaterally from the learned policy π𝜋\piitalic_π, given as mean-field parameter the stationary distribution λπ,μsubscript𝜆𝜋𝜇\lambda_{\pi,\mu}italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT.

Definition 2.2.

A pair (πε,με)subscript𝜋𝜀subscript𝜇𝜀(\pi_{\varepsilon},\mu_{\varepsilon})( italic_π start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) is said to be a ε𝜀\varepsilonitalic_ε-MFNE, if its exploitability is bounded by ε𝜀\varepsilonitalic_ε, i.e., ϕ(πε,με)εitalic-ϕsubscript𝜋𝜀subscript𝜇𝜀𝜀\phi(\pi_{\varepsilon},\mu_{\varepsilon})\leq\varepsilonitalic_ϕ ( italic_π start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) ≤ italic_ε.

3 Assumptions

We outline the key assumptions that guarantee the well-posedness and stability of the MFG problem, providing a foundation for deriving finite-sample complexity bounds for the proposed algorithm.

A- 1.

Let 𝗋superscript𝗋absent\mathsf{r}^{\hskip 0.49005pt}sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (resp. 𝖯superscript𝖯absent\mathsf{P}^{\hskip 0.49005pt}sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT) be Lipschitz continuous with respect to μ𝜇\muitalic_μ, with Lipschitz constant Lμ𝗋superscriptsubscript𝐿𝜇superscript𝗋absentL_{\mu}^{\mathsf{r}^{\hskip 0.35004pt}}italic_L start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT (resp. Lμ𝖯superscriptsubscript𝐿𝜇superscript𝖯absentL_{\mu}^{\mathsf{P}^{\hskip 0.35004pt}}italic_L start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT), i.e., for s,s𝒮𝑠superscript𝑠𝒮s,s^{\prime}\in\mathcal{S}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S, a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, and μ,μ𝒫(𝒮)𝜇superscript𝜇𝒫𝒮\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{S})italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_S ), we have

|𝗋(s,a,μ)𝗋(s,a,μ)|L𝗋μμ2,|𝖯(s|s,a,μ)𝖯(s|s,a,μ)|L𝖯μμ2.\displaystyle\begin{split}\left|\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)-\mathsf% {r}^{\hskip 0.49005pt}(s,a,\mu^{\prime})\right|&\leq L_{\mathsf{r}^{\hskip 0.3% 5004pt}}\left\|\mu-\mu^{\prime}\right\|_{{2}}\;,\\ \left|\mathsf{P}^{\hskip 0.49005pt}(s^{\prime}|s,a,\mu)-\mathsf{P}^{\hskip 0.4% 9005pt}(s^{\prime}|s,a^{\prime},\mu^{\prime})\right|&\leq L_{\mathsf{P}^{% \hskip 0.35004pt}}\left\|\mu-\mu^{\prime}\right\|_{{2}}\;.\end{split}start_ROW start_CELL | sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) - sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | end_CELL start_CELL ≤ italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL | sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_μ ) - sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | end_CELL start_CELL ≤ italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . end_CELL end_ROW

Lipschitz continuity for the MDP parameters ensures that small perturbations in the state or action lead to proportionally small changes in the transition dynamics. This assumption is well established in the RL literature (Asadi et al., 2018; Le Lan et al., 2021) and it facilitates the derivation of meaningful error bounds and convergence rates in policy optimization. From this, we establish Lipschitz continuity of the optimal policies w.r.t. mean-field parameter as follows (cf. Corollary E.4).

Proposition 3.1.

Suppose Assumption 1 holds. Then, there exists a constant Cπ,μ0subscript𝐶𝜋𝜇0C_{\pi,\mu}\geq 0italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ≥ 0 such that, for μ,μ𝒫(𝒮)𝜇superscript𝜇𝒫𝒮\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{S})italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_S ),

sups𝒮πμ(|s)πμ(|s)TV2Cπ,μμμ2,\displaystyle\sup_{s\in\mathcal{S}}\left\|\pi_{\mu}(\cdot|s)-\pi_{\mu^{\prime}% }(\cdot|s)\right\|_{\mathrm{TV}}^{2}\leq C_{\pi,\mu}\left\|\mu-\mu^{\prime}% \right\|_{{2}}\;,roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

​where πμsubscript𝜋𝜇\pi_{\mu}italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT is the optimal policy associated with the mean-field distribution μ𝜇\muitalic_μ.

This result connects the structural properties of the MFG framework with the behavior of the associated optimal policies, forming a key theoretical foundation for the analysis of the proposed algorithms.

A- 2.

There exist an integer M1𝑀1M\geq 1italic_M ≥ 1 and a real number Cop,MFG<1subscript𝐶opMFG1C_{\operatorname{op,MFG}}<1italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT < 1 such that, for μ𝒫(𝒮)𝜇𝒫𝒮\mu\in\mathcal{P}(\mathcal{S})italic_μ ∈ caligraphic_P ( caligraphic_S ), we have

μμ,μ(𝖯μπμ)Mμ(𝖯μπμ)MCop,MFGμμ2.\displaystyle\begin{split}&\left\langle\mu-\mu^{\prime},\mu\Big{(}\mathsf{P}_{% \mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}-\mu^{\prime}\Big{(}\mathsf{P}_{\mu% ^{\prime}}^{\hskip 0.49005pt\pi_{\mu^{\prime}}}\Big{)}^{M}\right\rangle\\ &\qquad\qquad\qquad\qquad\leq C_{\operatorname{op,MFG}}\left\|\mu-\mu^{\prime}% \right\|^{2}\;.\end{split}start_ROW start_CELL end_CELL start_CELL ⟨ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . end_CELL end_ROW

This monotonicity condition of the mean-field update is employed in various forms throughout the MFG literature (Angiuli et al., 2021, 2023; Yardim et al., 2023). This condition, originally introduced in the foundational works by Huang et al. (2003, 2005, 2006a, 2006b), ensures that iterative updates to the population distribution progressively aligns the system with the NE.

This condition extends the classical contractivity condition, which corresponds to M=1𝑀1M=1italic_M = 1 (see, e.g., Guo et al., 2019). The condition might fail for M=1𝑀1M=1italic_M = 1 but hold for some M>1𝑀1M>1italic_M > 1, reflecting the combined effect of the Lipschitz continuity of the regularized best response and the ergodicity of the Markov reward process 𝖯μπμsuperscriptsubscript𝖯𝜇subscript𝜋𝜇\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. A full in-depth discussion of this condition and its implications is provided in Appendix E.4. As the regularized best response admits a unique optimizer πμsubscript𝜋𝜇\pi_{\mu}italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT, the previous condition ensures the uniqueness of the fixed point of the associated operator, i.e.,

μ=μ𝖯μπμ,subscript𝜇subscript𝜇superscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇\displaystyle\mu_{\star}=\mu_{\star}\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt% \pi_{\mu_{\star}}}\;,italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (8)

with μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT the unique fixed point of this operator. This implies that, in this setting, we obtain the uniqueness of the MFNE, corresponding to the pair (πμ,μ)subscript𝜋subscript𝜇subscript𝜇(\pi_{\mu_{\star}},\mu_{\star})( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ).

A- 3.

The MF-MDP (𝒮,𝒜,𝖯,𝗋,γ)𝒮𝒜superscript𝖯absentsuperscript𝗋absent𝛾(\mathcal{S},\mathcal{A},\mathsf{P}^{\hskip 0.49005pt},\mathsf{r}^{\hskip 0.49% 005pt},\gamma)( caligraphic_S , caligraphic_A , sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_γ ) is unichain: for every fixed μ𝒫(𝒮)𝜇𝒫𝒮\mu\in\mathcal{P}(\mathcal{S})italic_μ ∈ caligraphic_P ( caligraphic_S ), every stationary policy π𝜋\piitalic_π induces a unique stationary distribution λπ,μsubscript𝜆𝜋𝜇\lambda_{\pi,\mu}italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT over the state space, satisfying, for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, the following equation:

λπ,μ(s)=(s,a)𝒮×𝒜𝖯(ss,a,μ)π(as)λπ,μ(s).subscript𝜆𝜋𝜇𝑠subscriptsuperscript𝑠superscript𝑎𝒮𝒜superscript𝖯absentconditional𝑠superscript𝑠superscript𝑎𝜇𝜋conditionalsuperscript𝑎superscript𝑠subscript𝜆𝜋𝜇superscript𝑠\displaystyle\lambda_{\pi,\mu}(s)=\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}% \times\mathcal{A}}\mathsf{P}^{\hskip 0.49005pt}(s\mid s^{\prime},a^{\prime},% \mu)\pi(a^{\prime}\mid s^{\prime})\lambda_{\pi,\mu}(s^{\prime})\;.italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ∣ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) italic_π ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

​Moreover, there exist constants ρ(0,1]𝜌01\rho\in(0,1]italic_ρ ∈ ( 0 , 1 ] and CErg>0subscript𝐶Erg0C_{\operatorname{Erg}}>0italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT > 0 such that the following uniform mixing property holds for all ξ,μ𝒫(𝒮)𝜉𝜇𝒫𝒮\xi,\mu\in\mathcal{P}(\mathcal{S})italic_ξ , italic_μ ∈ caligraphic_P ( caligraphic_S ) and t0𝑡0t\geq 0italic_t ≥ 0:

ξ(𝖯μπμ)tλπμ,μTVCErgρt.subscriptnorm𝜉superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑡subscript𝜆subscript𝜋𝜇𝜇TVsubscript𝐶Ergsuperscript𝜌𝑡\displaystyle\left\|\xi\left(\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}% \right)^{t}-\lambda_{\pi_{\mu},\mu}\right\|_{\mathrm{TV}}\leq C_{\operatorname% {Erg}}\rho^{t}\;.∥ italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . (9)

The unichain property eliminates ambiguities in the initial population distribution caused by multiple recurrent classes, which could otherwise complicate value function evaluation and policy improvement steps. As highlighted in Shani et al. (2020), optimal policies are defined only within the recurrent states of the Markov reward process at equilibrium. Moreover, the ergodicity property naturally aligns with the regularization framework, as it promotes exploration within the Markov chain, leading to a more robust and stable optimization landscape.

A- 4.

We have that

supμ𝒫(𝒮)sups𝒮|𝖽¯μ,μπμ(s)ν(s)|<.subscriptsupremum𝜇𝒫𝒮subscriptsupremum𝑠𝒮superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇𝑠𝜈𝑠\displaystyle\sup_{\mu\in\mathcal{P}(\mathcal{S})}\sup_{s\in\mathcal{S}}\left|% \frac{\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}(s)}{\nu(s)}% \right|<\infty\;.roman_sup start_POSTSUBSCRIPT italic_μ ∈ caligraphic_P ( caligraphic_S ) end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT | divide start_ARG over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s ) end_ARG start_ARG italic_ν ( italic_s ) end_ARG | < ∞ .

The structured interaction with the environment in the considered RL paradigm requires this concentration property of the occupation measure with respect to the reset distribution ν𝜈\nuitalic_ν. This property enhances sample efficiency and aids in the accurate estimation of key quantities, such as the value function.

4 Exact algorithm: Exact MF-TRPO

We now introduce Exact MF-TRPO to solve the optimization problem (3). Note that the entropic regularization used in Exact MF-TRPO is particularly well-suited for TRPO (see, e.g., Shani et al., 2020), as the proximal policy update fully exploits the entropy term, enabling a soft-max closed-form updates in terms of the Q𝑄Qitalic_Q-functions. This property enhances convergence by ensuring smoother and more stable policy iterations.

Although (3) is not a convex optimization problem, the adaptive nature of the regularization term allows us to use mirror descent techniques from convex analysis to establish strong convergence guarantees. In particular, we derive finite-sample complexity bounds showing that show that Exact MF-TRPO converges at a rate of O~(1/L)~𝑂1𝐿\widetilde{O}(1/L)over~ start_ARG italic_O end_ARG ( 1 / italic_L ).

TRPO.

For a fixed mean-field population distribution μ𝜇\muitalic_μ, Exact TRPO(μ)𝜇(\mu)( italic_μ ) provides a reliable approximation of the value function. The convergence of the algorithm is explicitly influenced by the regularization parameter η𝜂\etaitalic_η, which determines the optimal step size 1/(η(+2))1𝜂21/(\eta(\ell+2))1 / ( italic_η ( roman_ℓ + 2 ) ).

Algorithm 1 Exact TRPO(μ)𝜇(\mu)( italic_μ )
1:  Initialize: π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the uniform policy.
2:  Input: L𝐿Litalic_L.
3:  for [L]delimited-[]𝐿\ell\in[L]roman_ℓ ∈ [ italic_L ] do
4:     JMFG(π,μ,μ)μ(Iγ𝖯μπ)1𝗋μπksuperscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscriptI𝛾superscriptsubscript𝖯𝜇subscript𝜋1superscriptsubscript𝗋𝜇subscript𝜋𝑘J^{\operatorname{MFG}}(\pi_{\ell},\mu,\mu)\leftarrow\mu(\mathrm{I}-\gamma% \mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\ell}})^{-1}\mathsf{r}_{\mu}^{\hskip 0.% 49005pt\pi_{k}}italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) ← italic_μ ( roman_I - italic_γ sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT sansserif_r start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
5:     𝒮𝖽ν,μπ:={s𝒮:𝖽ν,μπ(s)>0}assignsubscript𝒮superscriptsubscript𝖽𝜈𝜇subscript𝜋conditional-set𝑠𝒮superscriptsubscript𝖽𝜈𝜇subscript𝜋𝑠0\mathcal{S}_{\mathsf{d}_{\nu,\mu}^{\hskip 0.35004pt\pi_{\ell}}}:=\{s\in% \mathcal{S}:\mathsf{d}_{\nu,\mu}^{\hskip 0.49005pt\pi_{\ell}}(s)>0\}caligraphic_S start_POSTSUBSCRIPT sansserif_d start_POSTSUBSCRIPT italic_ν , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT := { italic_s ∈ caligraphic_S : sansserif_d start_POSTSUBSCRIPT italic_ν , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s ) > 0 }
6:     for s𝒮𝖽ν,μπ𝑠subscript𝒮superscriptsubscript𝖽𝜈𝜇subscript𝜋s\in\mathcal{S}_{\mathsf{d}_{\nu,\mu}^{\hskip 0.35004pt\pi_{\ell}}}italic_s ∈ caligraphic_S start_POSTSUBSCRIPT sansserif_d start_POSTSUBSCRIPT italic_ν , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT do
7:        for a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A do
8:           Qπ,μ(s,a)𝗋(s,a,μ)subscript𝑄subscript𝜋𝜇𝑠𝑎superscript𝗋absent𝑠𝑎𝜇Q_{\pi_{\ell},\mu}(s,a)\leftarrow\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) ← sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) +γs𝖯(s|s,a,μ)JMFG(π,μ,s)𝛾subscriptsuperscript𝑠superscript𝖯absentconditionalsuperscript𝑠𝑠𝑎𝜇superscript𝐽MFGsubscript𝜋𝜇superscript𝑠\qquad\quad+\gamma\sum_{s^{\prime}}\displaystyle\mathsf{P}^{\hskip 0.49005pt}(% s^{\prime}|s,a,\mu)J^{\operatorname{MFG}}(\pi_{\ell},\mu,s^{\prime})+ italic_γ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_μ ) italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
9:        end for
10:        π+1(a|s)PolicyUpdate(π,Qπ,μ;)(a,s)subscript𝜋1conditional𝑎𝑠PolicyUpdatesubscript𝜋subscript𝑄subscript𝜋𝜇𝑎𝑠\pi_{\ell+1}(a|s)\leftarrow\hyperref@@ii[eq:PolicyUpdate]{\texttt{PolicyUpdate% }}{(\pi_{\ell},Q_{\pi_{\ell},\mu};\ell)}(a,s)italic_π start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) ← ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ; roman_ℓ ) ( italic_a , italic_s )
11:     end for
12:  end for
13:  Output: πLsubscript𝜋𝐿\pi_{L}italic_π start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT.

The proposed algorithm relies on the subroutine PolicyUpdate, a key step in refining the policy at each iteration. This update mechanism is inherently tied to the regularization scheme employed in the mirror ascent formulation of TRPO. Specifically, with entropic regularization, as shown in Beck (2017), the policy update admits a closed-form solution in the form of a softmax function. This structure inherently ensures that the updated policy remains within the probabilistic simplex without requiring additional projections. This fosters smooth and stable learning while preventing overly aggressive updates that could destabilize the optimization process.

This policy update leverages a function Q:𝒮×𝒜:𝑄𝒮𝒜Q\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R}italic_Q : caligraphic_S × caligraphic_A → blackboard_R and a policy πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π at iteration \ellroman_ℓ, leading to an explicit update rule given by

PolicyUpdate(π,Q;)(a,s)=π(a|s)exp(α(Q(s,a)ηlogπ(a|s)))a𝒜π(a|s)exp(α(Q(s,a)ηlogπ(a|s))),PolicyUpdate𝜋𝑄𝑎𝑠𝜋conditional𝑎𝑠subscript𝛼𝑄𝑠𝑎𝜂𝜋conditional𝑎𝑠subscriptsuperscript𝑎𝒜𝜋conditionalsuperscript𝑎𝑠subscript𝛼𝑄𝑠superscript𝑎𝜂𝜋conditionalsuperscript𝑎𝑠\displaystyle\begin{split}&\hyperref@@ii[eq:PolicyUpdate]{\texttt{PolicyUpdate% }}{(\pi,Q;\ell)}(a,s)\\ &=\frac{\pi(a|s)\exp\left(\alpha_{\ell}\left(Q(s,a)-\eta\log\pi(a|s)\right)% \right)}{\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s)\exp\left(\alpha_{% \ell}\left(Q(s,a^{\prime})-\eta\log\pi(a^{\prime}|s)\right)\right)}\;,\end{split}start_ROW start_CELL end_CELL start_CELL ( italic_π , italic_Q ; roman_ℓ ) ( italic_a , italic_s ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_π ( italic_a | italic_s ) roman_exp ( italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_Q ( italic_s , italic_a ) - italic_η roman_log italic_π ( italic_a | italic_s ) ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s ) roman_exp ( italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_Q ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_η roman_log italic_π ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s ) ) ) end_ARG , end_CELL end_ROW (10)

​with learning rate α=1/(η(+2))subscript𝛼1𝜂2\alpha_{\ell}=1/(\eta(\ell+2))italic_α start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = 1 / ( italic_η ( roman_ℓ + 2 ) ).

Shani et al. (2020) establishes error bounds for the approximation of value functions that can be directly generalized to our setting. Building on these results, we quantify the gap between the value function of the computed policy and the optimal value function under the given mean-field distribution. This allows to control the policy itself, and derive policy guarantees.

Proposition 4.1.

Suppose that Assumptions 1 and 3 hold. Let {π}=0Lsuperscriptsubscriptsubscript𝜋0𝐿\{\pi_{\ell}\}_{\ell=0}^{L}{ italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT be the sequence generated by the Exact TRPO(μ)μ(\mu)( italic_μ ) algorithm. Then, there exists a constant C>0𝐶0C>0italic_C > 0 such that, for all L1𝐿1L\geq 1italic_L ≥ 1, we have

𝔼sμ[πL(|s)πμ(|s)TV2]ClogLL.\displaystyle\mathbb{E}_{s\sim\mu}\left[\left\|\pi_{L}(\cdot|s)-\pi_{\mu}(% \cdot|s)\right\|_{\mathrm{TV}}^{2}\right]\leq C\frac{\log L}{L}\;.blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_μ end_POSTSUBSCRIPT [ ∥ italic_π start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_C divide start_ARG roman_log italic_L end_ARG start_ARG italic_L end_ARG .

MF-TRPO.

We now present Exact MF-TRPO, which iteratively solves the MFG problem (3) by updating the population distribution μ𝜇\muitalic_μ using the output of Exact TRPO(μ)𝜇(\mu)( italic_μ ). This approach assumes direct access to the transition kernel and cost function, eliminating the need for stochastic approximation in policy updates.

Algorithm 2 Exact MF-TRPO
1:  Input: Initial distribution μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, K𝐾Kitalic_K.
2:  Initialize: Initial policy π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the uniform policy.
3:  for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] do
4:     πksubscript𝜋𝑘absent\pi_{k}\leftarrowitalic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ←Exact TRPO(μk1)subscript𝜇𝑘1(\mu_{k-1})( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT )
5:     μkμk1+βk(μk1(𝖯μk1πk)Mμk1)subscript𝜇𝑘subscript𝜇𝑘1subscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1\mu_{k}\leftarrow\mu_{k-1}+\beta_{k}\left(\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}% }^{\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\right)italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) # population update
6:  end for
7:  Output: μKsubscript𝜇𝐾\mu_{K}italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT.

The analysis of Exact MF-TRPO is instrumental to understand the performances of its sample-based counterpart. We do this providing precise theoretical guarantees on convergence rates without the additional complexity of sampling-induced errors. Building on this deterministic setup, we then focus on the convergence behavior of the broader Sample-Based MF-TRPO framework. We use the label “informal” to avoid overloading the main text with technical assumptions (e.g., on learning rates), rigorously stated in Appendix C.

Proposition 4.2 (informal).

Suppose that Assumptions 12, and 3 hold. Then, there exists a constant C>0𝐶0C>0italic_C > 0 such that the sequence {μk}k0subscriptsubscript𝜇𝑘𝑘0\{\mu_{k}\}_{k\geq 0}{ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT generated by Exact MF-TRPO satisfies

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2exp(τ2j=1kβj)+Clog(L)L,2𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗𝐶𝐿𝐿\displaystyle 2\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)+C\frac{% \log(L)}{L}\;,2 roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_C divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG ,

for k1𝑘1k\geq 1italic_k ≥ 1, with τ:=1Cop,MFGassign𝜏1subscript𝐶opMFG\tau:=1-C_{\operatorname{op,MFG}}italic_τ := 1 - italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT.

All constants appearing in our results are explicitly defined and detailed in Appendix C, where we also provide complete proofs supporting our theoretical guarantees.

This convergence result proves its effectiveness in tackling the challenges inherent in the non-linear and non-gradient structure of MFNE. Below, we summarize the key insights derived from this result:

  • Convergence. We establish an exponential rate of convergence in the first term of the bound to the equilibrium population distribution μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT, while the second one captures the finite-sample bias in the best-response computation.

  • Learning Rates constraints. The theorem provides explicit constraints on the step size βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (cf. condition (17)), ensuring stability and preventing oscillations or divergence in the optimization process.

  • Explicit Dependence on Model Parameters. All constants in the convergence bound are fully characterized in terms of the structural parameters of the model (cf. Appendix C).

  • Controlled Policy Learning Bias. The bias introduced by policy updates in Exact TRPO, bounded by log(L)/L𝐿𝐿\log(L)/Lroman_log ( italic_L ) / italic_L (Proposition 4.2), remains controlled throughout the iterative process. This ensures algorithmic stability, even with approximations in policy optimization.

With these results, we can now explicitly quantify the parameter ε𝜀\varepsilonitalic_ε corresponding to the proximity of our obtained solution with respect to the MFNE.

Corollary 4.3.

Suppose that Assumptions 12, and 3 hold. Let μKsubscript𝜇𝐾\mu_{K}italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (resp. πK+1subscript𝜋𝐾1\pi_{K+1}italic_π start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT) the output of Exact MF-TRPO (resp. Exact TRPO(μK)subscriptμK(\mu_{K})( italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT )). Then, there exists a constant C>0𝐶0C>0italic_C > 0, such that (πK+1,μK)subscript𝜋𝐾1subscript𝜇𝐾(\pi_{K+1},\mu_{K})( italic_π start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) is εKsubscript𝜀𝐾\varepsilon_{K}italic_ε start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT-MFNE, with

εK=Cexp(τ4j=1Kβj)+Clog(L)L.subscript𝜀𝐾𝐶𝜏4superscriptsubscript𝑗1𝐾subscript𝛽𝑗𝐶𝐿𝐿\displaystyle\varepsilon_{K}=C\exp\left(-\frac{\tau}{4}\sum_{j=1}^{K}\beta_{j}% \right)+C\sqrt{\frac{\log(L)}{L}}\;.italic_ε start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_C roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_C square-root start_ARG divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG end_ARG .

5 Stochastic approximation: Sample-Based MF-TRPO

We provide Sample-Based MF-TRPO, a model-free variant of the previous algorithm designed to operate without explicit knowledge of the environment’s dynamics, nor the reward function. Sample-Based MF-TRPO utilizes sampled trajectories to estimate these updates, making it more applicable to real-world scenarios.

TRPO.

First, we adapt Exact TRPO to estimate policy updates in a data-driven manner. This approach leverages sampled trajectories to approximate the policy gradient, providing quantitative bounds on the proximity of the best response. Additionally, this TRPO formulation is particularly well-suited to the considered oracle-based framework, incorporating the ν𝜈\nuitalic_ν-restart modeling. This structure ensures robust exploration while maintaining stability in policy updates, aligning naturally with the trust-region optimization paradigm.

Algorithm 3 Sample-Based TRPO(μ)𝜇(\mu)( italic_μ )
1:  Initialize: π0(|s)=𝒰(𝒜)\pi_{0}(\cdot|s)=\mathcal{U}(\mathcal{A})italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) = caligraphic_U ( caligraphic_A ) for s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S.
2:  Input: Isubscript𝐼I_{\ell}italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, Tsubscript𝑇T_{\ell}italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ, δ>0𝛿0\delta>0italic_δ > 0, L𝐿Litalic_L.
3:  for [L]delimited-[]𝐿\ell\in[L]roman_ℓ ∈ [ italic_L ] do
4:     SI={}superscriptsubscript𝑆subscript𝐼S_{\ell}^{I_{\ell}}=\{\}italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = { }
5:     Qπ^,μ(s,a)=0subscript𝑄subscript^𝜋𝜇𝑠𝑎0Q_{\hat{\pi}_{\ell},\mu}(s,a)=0italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0, n(s,a)=0subscript𝑛𝑠𝑎0n_{\ell}(s,a)=0italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0, for any (s,a)𝑠𝑎(s,a)( italic_s , italic_a )
6:     for i=1,,I𝑖1subscript𝐼i=1,\dots,I_{\ell}italic_i = 1 , … , italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT do
7:        Sample si𝖽¯ν,μπ^()similar-tosubscript𝑠𝑖superscriptsubscript¯𝖽𝜈𝜇subscript^𝜋s_{i}\sim\overline{\mathsf{d}}_{\nu,\mu}^{\hskip 0.49005pt\hat{\pi}_{\ell}}(\cdot)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_ν , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ⋅ ), ai𝒰(𝒜)similar-tosubscript𝑎𝑖𝒰𝒜a_{i}\sim\mathcal{U}(\mathcal{A})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U ( caligraphic_A )
8:        Qπ^,μ(si,ai,i)𝗋(si,ai,μ)+subscript𝑄subscript^𝜋𝜇subscript𝑠𝑖subscript𝑎𝑖𝑖limit-fromsuperscript𝗋absentsubscript𝑠𝑖subscript𝑎𝑖𝜇Q_{\hat{\pi}_{\ell},\mu}(s_{i},a_{i},i)\leftarrow\mathsf{r}^{\hskip 0.49005pt}% (s_{i},a_{i},\mu)+italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) ← sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ ) +t=1Tγt𝔼stδsi(𝖯μπ^)t,atπ^(|st)[𝗋(st,at,μ)\sum_{t=1}^{T_{\ell}}\gamma^{t}\mathbb{E}_{s_{t}\sim\delta_{s_{i}}(\mathsf{P}_% {\mu}^{\hskip 0.35004pt\hat{\pi}_{\ell}})^{t},a_{t}\sim\hat{\pi}_{\ell}(\cdot|% s_{t})}\left[\mathsf{r}^{\hskip 0.49005pt}(s_{t},a_{t},\mu)\right.∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ )            +ηlog(π^(at|st))]\left.+\eta\log\big{(}\hat{\pi}_{\ell}(a_{t}|s_{t})\big{)}\right]+ italic_η roman_log ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]
9:        Qπ^,μ(si,ai)Qπ^,μ(si,ai)+Qπ^,μ(si,ai,i)subscript𝑄subscript^𝜋𝜇subscript𝑠𝑖subscript𝑎𝑖subscript𝑄subscript^𝜋𝜇subscript𝑠𝑖subscript𝑎𝑖subscript𝑄subscript^𝜋𝜇subscript𝑠𝑖subscript𝑎𝑖𝑖Q_{\hat{\pi}_{\ell},\mu}(s_{i},a_{i})\leftarrow Q_{\hat{\pi}_{\ell},\mu}(s_{i}% ,a_{i})+Q_{\hat{\pi}_{\ell},\mu}(s_{i},a_{i},i)italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i )
10:        n(si,ai)n(si,ai)+1subscript𝑛subscript𝑠𝑖subscript𝑎𝑖subscript𝑛subscript𝑠𝑖subscript𝑎𝑖1n_{\ell}(s_{i},a_{i})\leftarrow n_{\ell}(s_{i},a_{i})+1italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 1
11:        SI=SI{si}superscriptsubscript𝑆subscript𝐼superscriptsubscript𝑆subscript𝐼subscript𝑠𝑖S_{\ell}^{I_{\ell}}=S_{\ell}^{I_{\ell}}\cup\{s_{i}\}italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∪ { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }
12:     end for
13:     for sSI𝑠superscriptsubscript𝑆subscript𝐼s\in S_{\ell}^{I_{\ell}}italic_s ∈ italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT do
14:        for a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A do
15:           Qπ^,μ(s,a)|𝒜|Qπ^,μ(s,a)a𝒜n(s,a)subscript𝑄subscript^𝜋𝜇𝑠𝑎𝒜subscript𝑄subscript^𝜋𝜇𝑠𝑎subscriptsuperscript𝑎𝒜subscript𝑛𝑠superscript𝑎Q_{\hat{\pi}_{\ell},\mu}(s,a)\leftarrow\frac{|\mathcal{A}|Q_{\hat{\pi}_{\ell},% \mu}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}n_{\ell}(s,a^{\prime})}italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) ← divide start_ARG | caligraphic_A | italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG
16:        end for
17:        π^+1(a|s)PolicyUpdate(π^,Qπ^,μ;)(a,s)subscript^𝜋1conditional𝑎𝑠PolicyUpdatesubscript^𝜋subscript𝑄subscript^𝜋𝜇𝑎𝑠\hat{\pi}_{\ell+1}(a|s)\leftarrow\hyperref@@ii[eq:PolicyUpdate]{\texttt{% PolicyUpdate}}{(\hat{\pi}_{\ell},Q_{\hat{\pi}_{\ell},\mu};\ell)}(a,s)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) ← ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ; roman_ℓ ) ( italic_a , italic_s )
18:     end for
19:  end for
20:  Output: π^LUnif,μ^ksubscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT.

The inherent stochasticity in the updates prevents us from establishing sample complexity bounds on the last iterate of the algorithm. However, by leveraging an averaging scheme, implemented through a dedicated subroutine (cf. Remark D.1), we mitigate this variability and provide clear quantitative bounds on the desired gap. Specifically, the policy we focus on is the uniform mixture π^LUnif,μsubscriptsuperscript^𝜋Unif𝜇𝐿\hat{\pi}^{\texttt{Unif},\mu}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT over the first L+1𝐿1L+1italic_L + 1 policies. This averaging scheme is standard in the RL literature and, in the unregularized case, satisfies the identity

1L+1=0LJMFG(π^,μ,μ)=JMFG(π^LUnif,μ,μ,μ).1𝐿1superscriptsubscript0𝐿superscript𝐽MFGsubscript^𝜋𝜇𝜇superscript𝐽MFGsubscriptsuperscript^𝜋Unif𝜇𝐿𝜇𝜇\displaystyle\frac{1}{L+1}\sum_{\ell=0}^{L}J^{\operatorname{MFG}}(\hat{\pi}_{% \ell},\mu,\mu)=J^{\operatorname{MFG}}(\hat{\pi}^{\texttt{Unif},\mu}_{L},\mu,% \mu)\;.divide start_ARG 1 end_ARG start_ARG italic_L + 1 end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) = italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_μ , italic_μ ) . (11)

​While we do not have a statistically feasible expression for this mixture policy, it is straightforward to sample from π^LUnif,μsubscriptsuperscript^𝜋Unif𝜇𝐿\hat{\pi}^{\texttt{Unif},\mu}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, using Uniform-Mixture({π^}=0,,L)subscriptsubscript^𝜋0𝐿(\left\{\hat{\pi}_{\ell}\right\}_{\ell=0,\dots,L})( { over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 0 , … , italic_L end_POSTSUBSCRIPT ), as discussed in Remark D.1.

Proposition 5.1.

Suppose Assumptions 123, and 4 hold. Fix ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0. Let π^LUnif,μ^ksubscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT be the output of Sample-Based TRPO(μ)μ(\mu)( italic_μ ), over L𝐿Litalic_L iterations.

Then, there exists C>0𝐶0C>0italic_C > 0 such that the following holds with probability greater than 1δ1𝛿1-\delta1 - italic_δ

𝔼sμ[π^LUnif,μ(|s)πμ(|s)TV2]C(logLL+ϵ).\displaystyle\mathbb{E}_{s\sim\mu}\left[\left\|\hat{\pi}^{\texttt{Unif},\mu}_{% L}(\cdot|s)-\pi_{\mu}(\cdot|s)\right\|_{\mathrm{TV}}^{2}\right]\leq C\left(% \frac{\log L}{L}+\epsilon\right)\;.blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_μ end_POSTSUBSCRIPT [ ∥ over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_C ( divide start_ARG roman_log italic_L end_ARG start_ARG italic_L end_ARG + italic_ϵ ) .

All the constants appearing in our results are explicitly defined and detailed in Appendix D, where we also provide the complete proofs supporting our theoretical guarantees.

MF-TRPO.

We present a version of Sample-Based MF-TRPO, with the full algorithm and detailed implementation provided in Appendix D.3.

One key aspect of the algorithm is the initialization step. Unlike in a generative model paradigm, access to a state sμsimilar-to𝑠𝜇s\sim\muitalic_s ∼ italic_μ is only available through a subroutine initialized at the restart distribution ν𝜈\nuitalic_ν. For further details on this procedure, we refer to Section D.2.

The theoretical foundation of the convergence guarantees of this algorithm relies on deriving high-probability estimates

Algorithm 4 Sample-Based MF-TRPO (informal)
1:  Input: Initial distribution μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, K𝐾Kitalic_K.
2:  Initialize: Initial policy π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the uniform policy.
3:  for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] do
4:     π^ksubscript^𝜋𝑘absent\hat{\pi}_{k}\leftarrowover^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ←Sample-Based TRPO(μ^k1)subscript^𝜇𝑘1(\hat{\mu}_{k-1})( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ).
5:     for p[P]𝑝delimited-[]𝑃p\in[P]italic_p ∈ [ italic_P ] do
6:        Initialize s0,p,kμ^k1similar-tosubscript𝑠0𝑝𝑘subscript^𝜇𝑘1s_{0,p,k}\sim\hat{\mu}_{k-1}italic_s start_POSTSUBSCRIPT 0 , italic_p , italic_k end_POSTSUBSCRIPT ∼ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT.
7:        for m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ] do
8:           sm,p,k𝖯π^kμ^k1(|sm1,p,k)s_{m,p,k}\sim\mathsf{P}_{\hat{\pi}_{k}}^{\hskip 0.49005pt\hat{\mu}_{k-1}}(% \cdot|s_{m-1,p,k})italic_s start_POSTSUBSCRIPT italic_m , italic_p , italic_k end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_m - 1 , italic_p , italic_k end_POSTSUBSCRIPT )
9:        end for
10:        ζ^k,p𝟣{sM,p,k}()subscript^𝜁𝑘𝑝subscript1subscript𝑠𝑀𝑝𝑘\hat{\zeta}_{k,p}\leftarrow\mathsf{1}_{\{s_{M,p,k}\}}(\cdot)over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT ← sansserif_1 start_POSTSUBSCRIPT { italic_s start_POSTSUBSCRIPT italic_M , italic_p , italic_k end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( ⋅ ).
11:     end for
12:     ζ^k1Pp=1Pζ^k,psubscript^𝜁𝑘1𝑃superscriptsubscript𝑝1𝑃subscript^𝜁𝑘𝑝\hat{\zeta}_{k}\leftarrow\frac{1}{P}\sum_{p=1}^{P}\hat{\zeta}_{k,p}over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT.
13:     μ^kμ^k1+βk(ζ^kμ^k1)subscript^𝜇𝑘subscript^𝜇𝑘1subscript𝛽𝑘subscript^𝜁𝑘subscript^𝜇𝑘1\hat{\mu}_{k}\leftarrow\hat{\mu}_{k-1}+\beta_{k}\left(\hat{\zeta}_{k}-\hat{\mu% }_{k-1}\right)over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT )
14:  end for
15:  Output: μKsubscript𝜇𝐾\mu_{K}italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT.

(cf. Proposition D.3). This is a crucial step in sample complexity bounds and is achieved using a martingale-based argument (Harvey et al., 2019). By leveraging concentration inequalities for martingales, the analysis ensures that the error in estimating key quantities, such as the policy value and state distributions, remains controlled with high probability throughout the learning process.

Proposition 5.2 (informal).

Suppose Assumptions 123, and 4 hold. Fix ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0. Then, there exists a constant C>0𝐶0C>0italic_C > 0 such that the sequence {μk}k0subscriptsubscript𝜇𝑘𝑘0\{\mu_{k}\}_{k\geq 0}{ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT generated by Sample-Based MF-TRPO satisfies, for k1𝑘1k\geq 1italic_k ≥ 1,

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2exp(τ2j=1kβj)+Clog(L)L+Cϵ.2𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗𝐶𝐿𝐿𝐶italic-ϵ\displaystyle 2\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)+C\frac{% \log(L)}{L}+C\epsilon\;.2 roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_C divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + italic_C italic_ϵ .

All the constants appearing in our results are explicitly defined and detailed in Appendix D, where we also provide the complete proofs supporting our theoretical guarantees.

The sample-based convergence analysis of Sample-Based MF-TRPO  in the model-free setting demonstrates that the algorithm preserves the same desirata described in Section 4 driving the convergence of its exact counterpart Exact MF-TRPO—proximal updates, entropic regularization, and trust-region optimization—remain intact. While having a model-agnostic nature, its theoretical power of estimation of the MFNE is preserved, as explicit knowledge of the environment’s transition dynamics or reward structure is required. It builds on trajectory sampling to iteratively refine policy updates while maintaining stability and efficiency.

This result entails that the stochastic error in the estimation of the mean-field distribution at each iteration does not compound throughout the iterative process. By leveraging concentration inequalities and high-probability guarantees, the cumulative impact of these estimation errors remains controlled, preventing divergence or instability. As a result, Sample-Based MF-TRPO retains strong convergence guarantees, making it a practical approach for solving MFG in a data-driven manner.

Moreover, we can also bound the exploitability as follows.

Corollary 5.3.

Suppose that Assumptions 123, and 4 hold. Fix ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0. Let μ^Ksubscript^𝜇𝐾\hat{\mu}_{K}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT (resp. π^LUnif,μ^Ksubscriptsuperscript^𝜋Unifsubscript^𝜇𝐾𝐿\hat{\pi}^{\texttt{Unif},\hat{\mu}_{K}}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT) the output of Sample-Based MF-TRPO (resp. the iterates of Sample-Based TRPO(μ^k)subscript^μk(\hat{\mu}_{k})( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )). Then, there exists a constant C>0𝐶0C>0italic_C > 0 such that

ϕ(π^LUnif,μ^K,μ^k)italic-ϕsubscriptsuperscript^𝜋Unifsubscript^𝜇𝐾𝐿subscript^𝜇𝑘\displaystyle\phi(\hat{\pi}^{\texttt{Unif},\hat{\mu}_{K}}_{L},\hat{\mu}_{k})italic_ϕ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
C(exp(τ4j=1Kβj)+log(L)L+ϵ).absent𝐶𝜏4superscriptsubscript𝑗1𝐾subscript𝛽𝑗𝐿𝐿italic-ϵ\displaystyle\leq C\left(\exp\left(-\frac{\tau}{4}\sum_{j=1}^{K}\beta_{j}% \right)+\sqrt{\frac{\log(L)}{L}}+\epsilon\right)\;.≤ italic_C ( roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 4 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + square-root start_ARG divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG end_ARG + italic_ϵ ) .

The complexity analysis demonstrates that the proposed Sample-Based MF-TRPO algorithm achieves a computational cost scaling as O~(1/ε6)~𝑂1superscript𝜀6\widetilde{O}(1/\varepsilon^{6})over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) to reach an ε𝜀\varepsilonitalic_ε-MFNE (cf. Remark D.7). This scaling emerges naturally from two principal contributions: the inner loop performing the policy optimization via Sample-Based TRPO, and the outer population distribution update. These results align well with established convergence bounds in Shani et al. (2020), emphasizing a balanced trade-off between computational efficiency and the precision required to approximate the MFNE.

6 Numerical Experiments

We present here the results of the numerical experiments obtained with the Sample-Based MF-TRPO algorithm. The environment considered is a Grid-Based Crowd Modeling game where, from a given initial distribution, agents are tasked with moving through a grid, avoiding both static obstacles and potential overcrowding. A representative player’s state corresponds to her position within the grid, and at every time step, she can choose to move in any direction or stay in place. The reward structure imposes a small penalty for movement, offers a slight incentive for staying, and discourages agents from entering overcrowded areas. In addition, agents are encouraged to move toward a designated target, that is,

𝗋~(s,a,μ)=𝗋(s,a,μ)+max{0.30.1d(s,starget); 0},~superscript𝗋absent𝑠𝑎𝜇superscript𝗋absent𝑠𝑎𝜇0.30.1𝑑𝑠subscript𝑠target 0\tilde{\mathsf{r}^{\hskip 0.49005pt}}(s,a,\mu)=\mathsf{r}^{\hskip 0.49005pt}(s% ,a,\mu)+\max\left\{0.3-0.1\cdot d(s,s_{\text{target}});\;0\right\},over~ start_ARG sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG ( italic_s , italic_a , italic_μ ) = sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) + roman_max { 0.3 - 0.1 ⋅ italic_d ( italic_s , italic_s start_POSTSUBSCRIPT target end_POSTSUBSCRIPT ) ; 0 } ,

where d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) is a 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-distance between the corresponding states, and the crowd reward is defined as

𝗋(s,a,μ)=κlog(μ(s))+Γ(a),superscript𝗋absent𝑠𝑎𝜇𝜅𝜇𝑠Γ𝑎\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)=-\kappa\log(\mu(s))+\Gamma(a)\;,sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) = - italic_κ roman_log ( italic_μ ( italic_s ) ) + roman_Γ ( italic_a ) ,

where Γ(a)=0.2𝟣{a=Stay}0.2𝟣{aStay}Γ𝑎0.2subscript1𝑎Stay0.2subscript1𝑎Stay\Gamma(a)=0.2\cdot\mathsf{1}_{\{a=\text{Stay}\}}-0.2\cdot\mathsf{1}_{\{a\neq% \text{Stay}\}}roman_Γ ( italic_a ) = 0.2 ⋅ sansserif_1 start_POSTSUBSCRIPT { italic_a = Stay } end_POSTSUBSCRIPT - 0.2 ⋅ sansserif_1 start_POSTSUBSCRIPT { italic_a ≠ Stay } end_POSTSUBSCRIPT, with 𝟣1\mathsf{1}sansserif_1 being the indicator function and κ𝜅\kappaitalic_κ being a crowd-aversion parameter. We refer to Appendix F for additional details and experimental results related to the Exact MF-TRPO algorithm. The environment used here is a 5×5555\times 55 × 5 grid featuring three walls located at coordinates (1,2)12(1,2)( 1 , 2 ), (2,2)22(2,2)( 2 , 2 ), and (3,2)32(3,2)( 3 , 2 ). The point of interest is located in the bottom-right corner of the grid, and all the players start clustered in the top-left corner. In Figure 1 it is possible to observe that the exploitability behavior of the Sample-Based MF-TRPO algorithm converges after a few iterations, matching the theoretical predictions, whereas Figure 2 illustrates the progression of the mean field distribution across three different time steps during the learning phase, thus demonstrating that the players progressively learn to distribute themselves around the point of interest, preserving spread over the whole state space.

Refer to caption
Refer to caption
Figure 1: Exploitability achieved by the Sample-Based MF-TRPO algorithm in the 5×5555\times 55 × 5 Grid-Based Crowd Modeling game with the bottom-right corner being a point of interest. The left plot corresponds to η=0.05𝜂0.05\eta=0.05italic_η = 0.05, and the right to η=0.3𝜂0.3\eta=0.3italic_η = 0.3 with results averaged over 10101010 and 3333 random seeds, respectively.
Refer to caption
Refer to caption
Refer to caption
Figure 2: Evolution of the mean field distribution for η=0.05𝜂0.05\eta=0.05italic_η = 0.05 in the 5×5555\times 55 × 5 Grid-Based Crowd Modeling game with the bottom-right corner being a point of interest. From left to right: step 0, step 10 and step 200.

7 Conclusion

In this work, we introduced Exact MF-TRPO, a novel algorithm for computing MFNE in ergodic MFG. By leveraging the trust-region policy optimization framework, we established explicit non-asymptotic convergence guarantees, demonstrating that Exact MF-TRPO inherits the O~(1/L)~𝑂1𝐿\widetilde{O}(1/L)over~ start_ARG italic_O end_ARG ( 1 / italic_L ) rate from TRPO, ensuring efficient learning in structured multi-agent systems.

To bridge the gap between theoretical guarantees and practical applicability, we further developed Sample-Based MF-TRPO, a model-free variant that estimates policy updates solely from sampled trajectories, under the ν𝜈\nuitalic_ν-restart RL paradigm. Using concentration inequalities, we provided finite-sample complexity bounds for this algorithm, proving convergence under more relaxed assumptions compared to recent literature. Moreover, we show that a total number of calls to the environment that scales as O~(1/ε6)~𝑂1superscript𝜀6\widetilde{O}(1/\varepsilon^{6})over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ), consistent with standard RL results (see, e.g., Shani et al., 2020). This result highlights the potential of RL techniques for scalable and data-driven MFG solutions.

Overall, our work contributes to the growing intersection of MFG and RL, providing both theoretical insights and algorithmic advancements. Future directions include extending these methods to more general MFG settings, such as those with continuous state spaces, and exploring adaptive sampling techniques to further improve efficiency in real-world applications.

Acknowledgements

The work of A.O. and E.M. was funded by the European Union (ERC-2022-SYG-OCEAN-101071601). Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. The work of L.M. and E.M. has been supported by Technology Innovation Institute (TII), project Fed2Learn. The work of D.T. has been supported by the Paris Île-de-France Région in the framework of DIM AI4IDF.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Achdou & Capuzzo-Dolcetta (2010) Achdou, Y. and Capuzzo-Dolcetta, I. Mean field games: numerical methods. SIAM Journal on Numerical Analysis, 48(3):1136–1162, 2010.
  • Achdou & Porretta (2016) Achdou, Y. and Porretta, A. Convergence of a finite difference scheme to weak solutions of the system of partial differential equations arising in mean field games. SIAM Journal on Numerical Analysis, 54(1):161–186, 2016.
  • Achdou et al. (2012) Achdou, Y., Camilli, F., and Capuzzo-Dolcetta, I. Mean field games: numerical methods for the planning problem. SIAM Journal on Control and Optimization, 50(1):77–109, 2012.
  • Achdou et al. (2020) Achdou, Y., Cardaliaguet, P., Delarue, F., Porretta, A., Santambrogio, F., Achdou, Y., and Laurière, M. Mean field games and applications: Numerical aspects. Mean Field Games: Cetraro, Italy 2019, pp.  249–307, 2020.
  • Agarwal et al. (2020) Agarwal, A., Kakade, S., and Yang, L. F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pp.  67–83. PMLR, 2020.
  • Alasseur et al. (2020) Alasseur, C., Ben Taher, I., and Matoussi, A. An extended mean field game for storage in smart grids. Journal of Optimization Theory and Applications, 184:644–670, 2020.
  • Algumaei et al. (2023) Algumaei, T., Solozabal, R., Alami, R., Hacid, H., Debbah, M., and Takáč, M. Regularization of the policy updates for stabilizing mean field games. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp.  361–372. Springer, 2023.
  • Anahtarci et al. (2023a) Anahtarci, B., Kariksiz, C. D., and Saldi, N. Learning mean-field games with discounted and average costs. Journal of Machine Learning Research, 24(17):1–59, 2023a.
  • Anahtarci et al. (2023b) Anahtarci, B., Kariksiz, C. D., and Saldi, N. Q-learning in regularized mean-field games. Dynamic Games and Applications, 13(1):89–117, 2023b.
  • Anand et al. (2024) Anand, E., Karmarkar, I., and Qu, G. Mean-field sampling for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2412.00661, 2024.
  • Angiuli et al. (2021) Angiuli, A., Fouque, J.-P., and Laurière, M. Unified reinforcement q-learning for mean field game and control problems. arXiv preprint arXiv:2006.13912, 2021.
  • Angiuli et al. (2023) Angiuli, A., Fouque, J.-P., Laurière, M., and Zhang, M. Convergence of multi-scale reinforcement Qlearning algorithms for mean field game and control problems. arXiv preprint arXiv:2312.06659, 2023.
  • Arapostathis et al. (2017) Arapostathis, A., Biswas, A., and Carroll, J. On solutions of mean field games with ergodic cost. Journal de Mathématiques Pures et Appliquées, 107(2):205–251, 2017.
  • Asadi et al. (2018) Asadi, K., Misra, D., and Littman, M. Lipschitz continuity in model-based reinforcement learning. In International Conference on Machine Learning, pp.  264–273. PMLR, 2018.
  • Austrin et al. (2011) Austrin, P., Braverman, M., and Chlamtáč, E. Inapproximability of np-complete variants of nash equilibrium. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pp.  13–25. Springer, 2011.
  • Azar et al. (2013) Azar, G. M., Munos, R., and Kappen, H. J. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91:325–349, 2013.
  • Bardi & Priuli (2014) Bardi, M. and Priuli, F. S. Linear-quadratic n-person and mean-field games with ergodic cost. SIAM Journal on Control and Optimization, 52(5):3022–3052, 2014.
  • Bassière et al. (2024) Bassière, A., Dumitrescu, R., and Tankov, P. A mean-field game model of electricity market dynamics. In Quantitative Energy Finance: Recent Trends and Developments, pp.  181–219. Springer, 2024.
  • Becherer & Hesse (2024) Becherer, D. and Hesse, S. Common noise by random measures: Mean-field equilibria for competitive investment and hedging. arXiv preprint arXiv:2408.01175, 2024.
  • Beck (2017) Beck, A. First-order methods in optimization. SIAM, 2017.
  • Bhandari & Russo (2024) Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. Operations Research, 2024.
  • Braun et al. (2011) Braun, D. A., Ortega, P. A., Theodorou, E., and Schaal, S. Path integral control and bounded rationality. In 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp.  202–209. IEEE, 2011.
  • Cardaliaguet et al. (2019) Cardaliaguet, P., Delarue, F., Lasry, J.-M., and Lions, P.-L. The master equation and the convergence problem in mean field games:(ams-201). Princeton University Press, 2019.
  • Carmona & Laurière (2021) Carmona, R. and Laurière, M. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games i: The ergodic case. SIAM Journal on Numerical Analysis, 59(3):1455–1485, 2021.
  • Carmona et al. (2013) Carmona, R., Fouque, J.-P., and Sun, L.-H. Mean field games and systemic risk. arXiv preprint arXiv:1308.2172, 2013.
  • Carmona et al. (2018) Carmona, R., Delarue, F., et al. Probabilistic theory of mean field games with applications I-II. Springer, 2018.
  • Chassagneux et al. (2019) Chassagneux, J.-F., Crisan, D., and Delarue, F. Numerical method for fbsdes of mckean–vlasov type. The Annals of Applied Probability, 29(3):1640–1684, 2019.
  • Cover (1999) Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.
  • Cui & Koeppl (2021) Cui, K. and Koeppl, H. Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  1909–1917. PMLR, 2021.
  • Cui & Koeppl (2022) Cui, K. and Koeppl, H. Learning graphon mean field games and approximate nash equilibria. In International Conference on Learning Representations, 2022.
  • Doncel et al. (2022) Doncel, J., Gast, N., and Gaujal, B. A mean field game analysis of sir dynamics with vaccination. Probability in the Engineering and Informational Sciences, 36(2):482–499, 2022.
  • Espinosa & Touzi (2015) Espinosa, G.-E. and Touzi, N. Optimal investment under relative performance concerns. Mathematical Finance, 25(2):221–257, 2015.
  • Fischer & Silva (2021) Fischer, M. and Silva, F. J. On the asymptotic nature of first order mean field games. Applied Mathematics & Optimization, 84:2327–2357, 2021.
  • Flandoli et al. (2022) Flandoli, F., Ghio, M., and Livieri, G. N-player games and mean field games of moderate interactions. Applied Mathematics & Optimization, 85(3):38, 2022.
  • Foerster et al. (2018) Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. Proceedings of the AAAI conference on artificial intelligence, 32(1), 2018.
  • Fort et al. (2011) Fort, G., Moulines, E., and Priouret, P. Convergence of adaptive and interacting markov chain monte carlo algorithms. The Annals of Statistics, 39(6):3262–3289, 2011.
  • Fox et al. (2016) Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016.
  • Geist et al. (2019) Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized markov decision processes. In International Conference on Machine Learning, pp.  2160–2169. PMLR, 2019.
  • Geist et al. (2022) Geist, M., Pérolat, J., Laurière, M., Elie, R., Perrin, S., Bachem, O., Munos, R., and Pietquin, O. Concave utility reinforcement learning: the mean-field game viewpoint, 2022. URL https://arxiv.org/abs/2106.03787.
  • Germain et al. (2022) Germain, M., Mikael, J., and Warin, X. Numerical resolution of mckean-vlasov fbsdes using neural networks. Methodology and Computing in Applied Probability, 24(4):2557–2586, 2022.
  • Gronauer & Diepold (2022) Gronauer, S. and Diepold, K. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, 55(2):895–943, 2022.
  • Guo et al. (2019) Guo, X., Hu, A., Xu, R., and Zhang, J. Learning mean-field games. Advances in neural information processing systems, 32, 2019.
  • Guo et al. (2023) Guo, X., Hu, A., Xu, R., and Zhang, J. A general framework for learning mean-field games. Mathematics of Operations Research, 48(2):656–686, 2023.
  • Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pp.  1352–1361. PMLR, 2017.
  • Harvey et al. (2019) Harvey, N. J., Liaw, C., Plan, Y., and Randhawa, S. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pp.  1579–1613. PMLR, 2019.
  • Howard (1960) Howard, R. A. Dynamic programming and markov processes. John Wiley, 1960.
  • Howard & Matheson (1972) Howard, R. A. and Matheson, J. E. Risk-sensitive markov decision processes. Management science, 18(7):356–369, 1972.
  • Huang et al. (2003) Huang, M., Caines, P. E., and Malhamé, R. P. Individual and mass behaviour in large population stochastic wireless power control problems: centralized and nash equilibrium solutions. In 42nd IEEE international conference on decision and control (IEEE cat. No. 03CH37475), volume 1, pp.  98–103. IEEE, 2003.
  • Huang et al. (2005) Huang, M., Malhamé, R. P., and Caines, P. E. Nash equilibria for large-population linear stochastic systems of weakly coupled agents. In Analysis, control and optimization of complex dynamic systems, pp.  215–252. Springer, 2005.
  • Huang et al. (2006a) Huang, M., Malhamé, R. P., and Caines, P. E. Large population stochastic dynamic games: closed-loop mckean-vlasov systems and the nash certainty equivalence principle. Commun. Inf. Syst., 6(1):221–252, 2006a.
  • Huang et al. (2006b) Huang, M., Malhamé, R. P., and Caines, P. E. Nash certainty equivalence in large population stochastic dynamic games: Connections with the physics of interacting particle systems. In Proceedings of the 45th IEEE Conference on Decision and Control, pp.  4921–4926. IEEE, 2006b.
  • Iqbal & Sha (2019) Iqbal, S. and Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In International conference on machine learning, pp.  2961–2970. PMLR, 2019.
  • Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp.  267–274, 2002.
  • Kakade (2003) Kakade, S. M. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
  • Kearns & Singh (1998) Kearns, M. and Singh, S. Finite-sample convergence rates for q-learning and indirect algorithms. Advances in neural information processing systems, 11, 1998.
  • Kober et al. (2013) Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
  • Lacker & Zariphopoulou (2019) Lacker, D. and Zariphopoulou, T. Mean field and n-agent games for optimal investment under relative performance criteria. Mathematical Finance, 29(4):1003–1038, 2019.
  • Lasry & Lions (2006a) Lasry, J.-M. and Lions, P.-L. Jeux à champ moyen. i–le cas stationnaire. Comptes Rendus Mathématique, 343(9):619–625, 2006a.
  • Lasry & Lions (2006b) Lasry, J.-M. and Lions, P.-L. Jeux à champ moyen. ii–horizon fini et contrôle optimal. Comptes Rendus. Mathématique, 343(10):679–684, 2006b.
  • Lasry & Lions (2007) Lasry, J.-M. and Lions, P.-L. Mean field games. Japanese journal of mathematics, 2(1):229–260, 2007.
  • Laurière et al. (2022) Laurière, M., Perrin, S., Geist, M., and Pietquin, O. Learning mean field games: A survey. arXiv preprint arXiv:2205.12944, pp.  19–49, 2022.
  • Lavigne & Tankov (2023) Lavigne, P. and Tankov, P. Decarbonization of financial markets: a mean-field game approach. arXiv preprint arXiv:2301.09163, 2023.
  • Le Lan et al. (2021) Le Lan, C., Bellemare, M. G., and Castro, P. S. Metrics and continuity in reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):8261–8269, 2021.
  • Mao et al. (2022) Mao, W., Qiu, H., Wang, C., Franke, H., Kalbarczyk, Z., Iyer, R., and Basar, T. A mean-field game approach to cloud resource management with function approximation. Advances in Neural Information Processing Systems, 35:36243–36258, 2022.
  • Marcus et al. (1997) Marcus, S. I., Fernández-Gaucherand, E., Hernández-Hernandez, D., Coraluppi, S., and Fard, P. Risk sensitive markov decision processes. In Systems and control in the twenty-first century, pp.  263–279. Springer, 1997.
  • Matignon et al. (2007) Matignon, L., Laurent, G. J., and Le Fort-Piat, N. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  64–69. IEEE, 2007.
  • Mei et al. (2020) Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. On the global convergence rates of softmax policy gradient methods. In International conference on machine learning, pp.  6820–6829. PMLR, 2020.
  • Mnih (2016) Mnih, V. Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning, pp.  1928–1937, 2016.
  • Nash (1950) Nash, J. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48–49, 1950.
  • Nash (1951) Nash, J. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
  • Neu et al. (2017) Neu, G., Jonsson, A., and Gómez, V. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798, 2017.
  • O’Donoghue et al. (2017) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. Combining policy gradient and q-learning. 5th International Conference on Learning Representations, 2017.
  • Pérolat et al. (2022) Pérolat, J., Perrin, S., Elie, R., Laurière, M., Piliouras, G., Geist, M., Tuyls, K., and Pietquin, O. Scaling mean field games by online mirror descent. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pp.  1028–1037, 2022.
  • Perrin et al. (2020) Perrin, S., Pérolat, J., Laurière, M., Geist, M., Elie, R., and Pietquin, O. Fictitious play for mean field games: Continuous time analysis and applications. Advances in neural information processing systems, 33:13199–13213, 2020.
  • Perrin et al. (2021) Perrin, S., Laurière, M., Pérolat, J., Geist, M., Élie, R., and Pietquin, O. Mean field games flock! the reinforcement learning way. arXiv preprint arXiv:2105.07933, 2021.
  • Perrin et al. (2022) Perrin, S., Laurière, M., Pérolat, J., Élie, R., Geist, M., and Pietquin, O. Generalization in mean field games by learning master policies. Proceedings of the AAAI Conference on Artificial Intelligence, 36(9):9413–9421, 2022.
  • Peters et al. (2010) Peters, J., Mulling, K., and Altun, Y. Relative entropy policy search. Proceedings of the AAAI Conference on Artificial Intelligence, 24(1):1607–1612, 2010.
  • Puterman & Shin (1978) Puterman, M. L. and Shin, M. C. Modified policy iteration algorithms for discounted markov decision problems. Management Science, 24(11):1127–1137, 1978.
  • Ruszczyński (2010) Ruszczyński, A. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125:235–261, 2010.
  • Saldi (2020) Saldi, N. Discrete-time average-cost mean-field games on polish spaces. Turkish Journal of Mathematics, 44(2):463–480, 2020.
  • Samvelyan et al. (2019) Samvelyan, M., Rashid, T., De Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H., Foerster, J., and Whiteson, S. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019.
  • Scherrer & Geist (2014) Scherrer, B. and Geist, M. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part III 14, pp.  35–50. Springer, 2014.
  • Schulman et al. (2015) Schulman, J., Levine, S., Moritz, P., Jordan, M., and Abbeel, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp.  1889–1897, 2015.
  • Shalev-Shwartz et al. (2016) Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  • Shani et al. (2020) Shani, L., Efroni, Y., and Mannor, S. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):5668–5675, 2020.
  • Sidford et al. (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Near-optimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31, 2018.
  • Sutton & Barto (2018a) Sutton, R. and Barto, A. G. Reinforcement learning: An introduction. SIAM Rev, 6(2):423, 2018a.
  • Sutton & Barto (2018b) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018b.
  • Szepesvári (2022) Szepesvári, C. Algorithms for reinforcement learning. Springer nature, 2022.
  • Tangpi & Zhou (2024) Tangpi, L. and Zhou, X. Optimal investment in a large population of competitive and heterogeneous agents. Finance and Stochastics, 28(2):497–551, 2024.
  • Weinan et al. (2017) Weinan, E., Han, J., and Jentzen, A. Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 4(5):349–380, 2017.
  • Wiering et al. (2000) Wiering, M. A. et al. Multi-agent reinforcement learning for traffic light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000), pp.  1151–1158, 2000.
  • Williams & Peng (1991) Williams, R. J. and Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
  • Yardim et al. (2023) Yardim, B., Cayci, S., Geist, M., and He, N. Policy mirror ascent for efficient and independent learning in mean field games. In International Conference on Machine Learning, pp.  39722–39754. PMLR, 2023.
  • Zaman et al. (2023) Zaman, M. A. U., Koppel, A., Bhatt, S., and Basar, T. Oracle-free reinforcement learning in mean-field games along a single sample path. In International Conference on Artificial Intelligence and Statistics, pp.  10178–10206. PMLR, 2023.
  • Zhang et al. (2021) Zhang, K., Yang, Z., and Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of reinforcement learning and control, pp.  321–384, 2021.
  • Ziebart (2010) Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
  • Ziebart et al. (2010) Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Modeling interaction via the principle of maximum causal entropy. International Conference on Machine Learning (ICML), pp.  1255––1262, 2010.

Appendix

In the appendix, we provide a detailed exposition of the theoretical foundations and technical results supporting our main contributions. In Appendix A, we give a broad description of related work in MFG and MFRL, and in Appendix B, we lay out the fundamental framework of the MFG problem we aim to study. This section introduces the ergodic MFG formulation and its connection to the discounted setting. We also discuss the role of entropic regularization in RL and MFG, emphasizing its impact on stability and the approximation of Nash equilibria. In Section B.4, we present key assumptions we use in the proofs. In Appendix C, we present the exact TRPO-based algorithm for solving the MFG problem. This section provides a precise formulation of the exact methods and establishes its theoretical convergence guarantees. The adaptation of TRPO to the MFG setting is examined in detail, leveraging the structure of entropy-regularized RL. We present formal results on convergence rates and error bounds, ensuring the effectiveness and reliability of these methods in computing approximate MFNE. In Appendix D, we extend our analysis to the sample-based version of the algorithm. Here, we derive global sample complexity results and analyze the statistical error introduced by sampling. We establish high-probability bounds on the approximation error at each iterations. Next, in Appendix E, we provide additional proofs and auxiliary results that support the theoretical analysis conducted in the previous sections. These supplementary results play a crucial role in rigorously validating the convergence and stability properties of our proposed algorithms. Finally, in Appendix F we provide additional experimental details as well as experiments on exact versions of the algorithm.

Appendix A Related works

Except for the well-known Linear-Quadratic (LQ) case, where explicit solutions can be derived analytically or through simple ordinary differential equations, computing MFNE numerically remains a challenging and active research area. A vast body of literature has focused on addressing the computational complexity of these models, leading to three major methodological approaches.

The first class of methods relies on PDE approximations, leveraging the classical formulation of MFG through the Hamilton–Jacobi–Bellman equation coupled with the Fokker–Planck–Kolmogorov equation (Achdou & Capuzzo-Dolcetta, 2010; Achdou et al., 2012; Achdou & Porretta, 2016; Achdou et al., 2020). While mathematically elegant, these methods suffer from the well-documented curse of dimensionality, as solving PDEs numerically becomes intractable in high-dimensional state spaces.

The second approach leverages deep learning techniques with neural networks to approximate equilibrium solutions. These methods exploit function approximation to bypass explicit PDE resolution, making them a promising alternative in high-dimensional settings (Weinan et al., 2017; Chassagneux et al., 2019; Germain et al., 2022). However, they lack rigorous convergence guarantees. Additionally, these methods struggle to capture model specifications in a purely data-driven manner directly, limiting their adaptability in real-world applications.

The third category of numerical methods integrates RL techniques into the MFG framework, leading to two primary subcategories. The first subcategory employs RL as a solver for a given MFG model, using value-based or policy-based RL techniques to approximate Nash equilibria (Pérolat et al., 2022; Perrin et al., 2022). These approaches have achieved state-of-the-art performance in various settings and have been successfully deployed in large-scale simulations.

The second subcategory focuses on developing model-free RL algorithms for solving MFG, often incorporating regularization techniques to enhance stability (Cui & Koeppl, 2021; Perrin et al., 2021). While these methods show promising empirical performance, theoretical guarantees on finite-sample complexity remain limited, particularly for model-based RL approaches.

RL for MFG.

Among model-free RL approaches, we identify three main categories: value function-based methods, actor-critic methods—which combine value-based and policy-based approaches—and policy-oriented methods. As in classical RL, proximal methods have emerged as state-of-the-art techniques across a wide range of tasks, both in model-specific and model-agnostic settings.

Several studies have explored Q𝑄Qitalic_Q-learning-based algorithms for MFG, establishing theoretical convergence guarantees (Angiuli et al., 2021, 2023; Guo et al., 2019; Anahtarci et al., 2023b). However, these approaches rely on stringent assumptions that are often difficult to verify in practice. Moreover, Q𝑄Qitalic_Q-learning operates within the generative model oracle framework of RL, which assumes full query access to state-action transitions. In contrast, the restart oracle assumption provides a significantly weaker paradigm, offering a more practical and adaptable alternative for real-world learning settings.

Two-timescale updates have been employed in, e.g.Zaman et al. (2023) and Mao et al. (2022), where policy updates operate on a faster timescale, while the population distribution evolves more slowly based on model-based estimates of state dynamics. However, these approaches introduce significant challenges in both theoretical analysis and practical implementation. Their convergence often relies on strong assumptions that are difficult to verify, particularly in non-stationary environments.

In the MFG setting, proximal methods have been adopted for their stability properties and strong empirical performance. Among these approaches, we highlight the work of Pérolat et al. (2022), where the authors investigate Online Mirror Descent (OMD) from a model-specific perspective. Yardim et al. (2023) build up in this direction, developing the model-free analysis. Their approach, however, does not consider the crucial aspect of population updates, placing itself in the no-manipulation regime. In contrast, our work explicitly discusses how the monotonicity assumption can stabilize the population evolution up to a certain threshold, a perspective that aligns naturally with the ergodic nature of the Markov reward process once a policy has been selected. This additional consideration allows us to provide a more refined analysis of the learning dynamics in MFG, ensuring a more structured and well-posed approach to policy optimization.

Furthermore, Yardim et al. (2023) imposes stringent assumptions by requiring that policies remain uniformly bounded away from zero by a fixed constant, effectively enforcing an overly rigid form of regularization. This constraint exceeds the controlled bias typically introduced by entropic regularization. Notably, a similar assumption appears in Angiuli et al. (2021, 2023), where the Markov reward process is required to be absolutely continuous with respect to the uniform distribution over the state-action space—a significantly stronger condition than our unichain assumption.

By adopting a more flexible framework, we relax these restrictive conditions while improving sample complexity, particularly in terms of the number of required trajectories. Our analysis still achieves an O~(1/L)~𝑂1𝐿\widetilde{O}(1/L)over~ start_ARG italic_O end_ARG ( 1 / italic_L ) error bound in trajectory estimates but under significantly milder assumptions.

An interesting direction for future work would be to connect our framework with that of Anand et al. (2024), which focuses on mean-field control. While their setting assumes cooperative agents optimizing a common objective, our work addresses non-cooperative mean-field games. Despite this key difference, it would be valuable to explore how our analytical tools—particularly for handling weak structural assumptions—could generalize their approach to broader settings.

Appendix B Framework

B.1 Ergodic MFG problem

The ergodic problem focuses on optimizing the long-term average performance of a stochastic system over an infinite time horizon. In contrast to finite time horizon problems, the emphasis is on stability and efficiency over time, making it essential for operations such as energy systems, financial markets and resource management. The goal is to find stationary policies that balance short-term costs with long-term gains. This approach is robust to uncertainties and ensures consistent performance despite stochastic disturbances.

Ergodic MFG have been first studied in continuous time and space problems  (see, e.g., Bardi & Priuli, 2014; Arapostathis et al., 2017; Carmona & Laurière, 2021). In the context of discrete-time MFG, the ergodic setting has been studied by Saldi (2020) under the terminology of average-cost MFG. Anahtarci et al. (2023a) proposed a learning algorithm based on Q𝑄Qitalic_Q-learning and analyzed its convergence and sample complexity using a strict contraction argument.

Similarly, in Guo et al. (2023), the authors address the problem of evolving mean-field parameters, focusing on dynamically adjusting the population distribution over time. This differs from our approach, where we aim to learn policies without requiring explicit control or manipulation of the mean-field distribution at each step.

B.2 From ergodic MFG to discounted formulation

In this article, we focus on an ergodic equilibrium problem within the framework of mean-field games. This problem is traditionally defined as the unique solution resulting from the optimization of a long-term average cost function

JergMFG(π¯,μ,ξ)subscriptsuperscript𝐽MFGerg¯𝜋𝜇𝜉\displaystyle J^{\operatorname{MFG}}_{\rm erg}(\underline{\pi},\mu,\xi)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_erg end_POSTSUBSCRIPT ( under¯ start_ARG italic_π end_ARG , italic_μ , italic_ξ ) :=lim infT1T𝔼[t=0T𝗋(st,at,μt)|s0ξ,atπt(|st),μt+1=μt𝖯πt,μtst+1𝖯(|st,at,μt)],\displaystyle:=\liminf_{T\to\infty}\frac{1}{T}\mathbb{E}\left[\sum_{t=0}^{T}% \mathsf{r}^{\hskip 0.49005pt}(s_{t},a_{t},\mu_{t})\middle|\begin{subarray}{c}s% _{0}\sim\xi,~{}a_{t}\sim\pi_{t}(\cdot|s_{t}),~{}\mu_{t+1}=\mu_{t}\mathsf{P}_{% \pi_{t},\mu_{t}}^{\hskip 0.34303pt}\\ s_{t+1}\sim\mathsf{P}^{\hskip 0.34303pt}(\cdot|s_{t},a_{t},\mu_{t})\end{% subarray}\right]\;,:= lim inf start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ξ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sansserif_P start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] ,

for π¯=(πt)t=0¯𝜋superscriptsubscriptsubscript𝜋𝑡𝑡0\underline{\pi}=(\pi_{t})_{t=0}^{\infty}under¯ start_ARG italic_π end_ARG = ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT. We seek Nash equilibria with respect to this cost function, i.e., a tuple (π¯,μ,ξ)¯superscript𝜋𝜇𝜉(\underline{\pi^{\star}},\mu,\xi)( under¯ start_ARG italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG , italic_μ , italic_ξ ) such that

  • JergMFG(π¯,μ,ξ)=maxπ¯JergMFG(π¯,μ,ξ)subscriptsuperscript𝐽MFGerg¯superscript𝜋𝜇𝜉subscript¯𝜋subscriptsuperscript𝐽MFGerg¯𝜋𝜇𝜉J^{\operatorname{MFG}}_{\rm erg}(\underline{\pi^{\star}},\mu,\xi)=\max_{% \underline{\pi}}J^{\operatorname{MFG}}_{\rm erg}(\underline{\pi},\mu,\xi)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_erg end_POSTSUBSCRIPT ( under¯ start_ARG italic_π start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG , italic_μ , italic_ξ ) = roman_max start_POSTSUBSCRIPT under¯ start_ARG italic_π end_ARG end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_erg end_POSTSUBSCRIPT ( under¯ start_ARG italic_π end_ARG , italic_μ , italic_ξ ) ;

  • (st)=μtsubscript𝑠𝑡subscript𝜇𝑡\mathcal{L}(s_{t})=\mu_{t}caligraphic_L ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Such a problem is independent of the initial condition ξ𝜉\xiitalic_ξ and is stationary. Therefore, we can consider only constant vectors π¯=(πt)t=0¯𝜋superscriptsubscriptsubscript𝜋𝑡𝑡0\underline{\pi}=(\pi_{t})_{t=0}^{\infty}under¯ start_ARG italic_π end_ARG = ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT for πt=πsubscript𝜋𝑡𝜋\pi_{t}=\piitalic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π, for any t0𝑡0t\geq 0italic_t ≥ 0, aπΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π.

Moreover, given the stationarity of both the policy and the problem, we can further restrict ourselves to time-invariant policies and reformulate the problem as follows.

Jerg,statMFG(π,μ)subscriptsuperscript𝐽MFGergstat𝜋𝜇\displaystyle J^{\operatorname{MFG}}_{\rm erg,stat}(\pi,\mu)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_erg , roman_stat end_POSTSUBSCRIPT ( italic_π , italic_μ ) :=lim infT1T𝔼[t=0T𝗋(st,at,μ)|s0μ,atπt(|st),st+1𝖯(|st,at,μ)],\displaystyle:=\liminf_{T\to\infty}\frac{1}{T}\mathbb{E}\left[\sum_{t=0}^{T}% \mathsf{r}^{\hskip 0.49005pt}(s_{t},a_{t},\mu)\middle|\begin{subarray}{c}s_{0}% \sim\mu,~{}a_{t}\sim\pi_{t}(\cdot|s_{t}),\\ s_{t+1}\sim\mathsf{P}^{\hskip 0.34303pt}(\cdot|s_{t},a_{t},\mu)\end{subarray}% \right]\;,:= lim inf start_POSTSUBSCRIPT italic_T → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_T end_ARG blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) | start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_μ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) end_CELL end_ROW end_ARG ] ,

Moreover, as also noted in Carmona et al. (Chapter 7.1, Volume 1, 2018), we have that

Jerg,statMFG(π,μ)subscriptsuperscript𝐽MFGergstat𝜋𝜇\displaystyle J^{\operatorname{MFG}}_{\rm erg,stat}(\pi,\mu)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_erg , roman_stat end_POSTSUBSCRIPT ( italic_π , italic_μ ) :=limγ111γJγMFG(π,μ,μ),assignabsentsubscript𝛾111𝛾subscriptsuperscript𝐽MFG𝛾𝜋𝜇𝜇\displaystyle:=\lim_{\gamma\to 1}\frac{1}{1-\gamma}J^{\operatorname{MFG}}_{% \gamma}(\pi,\mu,\mu)\;,:= roman_lim start_POSTSUBSCRIPT italic_γ → 1 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_π , italic_μ , italic_μ ) ,

with

JγMFG(π,μ,ξ)=𝔼[t=0γt𝗋(st,at,μ)|s0ξ,atπt(|st),st+1𝖯(|st,at,μ)].\displaystyle J^{\operatorname{MFG}}_{\gamma}(\pi,\mu,\xi)=\mathbb{E}\left[% \sum_{t=0}^{\infty}\gamma^{t}\mathsf{r}^{\hskip 0.49005pt}(s_{t},a_{t},\mu)% \middle|\begin{subarray}{c}s_{0}\sim\xi,~{}a_{t}\sim\pi_{t}(\cdot|s_{t}),\\ s_{t+1}\sim\mathsf{P}^{\hskip 0.34303pt}(\cdot|s_{t},a_{t},\mu)\end{subarray}% \right]\;.italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_π , italic_μ , italic_ξ ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) | start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ξ , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) end_CELL end_ROW end_ARG ] .

For this reason, for a γ𝛾\gammaitalic_γ close to 1, we can consider the problem as formulated in Section 2. This is also the consideration behind the formulation of the ergodic problem presented in Laurière et al. (Remark 8, 2022).

B.3 Regularization

Entropy regularization in RL.

Entropy regularization has been a prominent concept across various fields, including RL (Sutton & Barto, 2018b; Szepesvári, 2022). In dynamic programming and RL contexts, entropy-regularized Bellman equations and corresponding algorithms have been extensively studied to address key challenges. These include inducing safe exploration (Fox et al., 2016) and designing risk-sensitive policies (Howard & Matheson, 1972; Marcus et al., 1997; Ruszczyński, 2010). Additionally, these methods have been employed to model behaviors of imperfect decision-makers, as demonstrated by Ziebart et al. (2010); Ziebart (2010); Braun et al. (2011).

Beyond dynamic programming approaches, direct policy search methods have emerged as a powerful alternative for optimizing entropy-regularized objectives. These methods, which aim to drive safe online exploration in unknown Markov decision processes, have been explored in works such as Williams & Peng (1991); Peters et al. (2010); Schulman et al. (2015); Mnih (2016); O’Donoghue et al. (2017). Notably, state-of-the-art RL methods, including those by Mnih (2016); Schulman et al. (2015), leverage entropy-regularized policy search to balance exploration and exploitation effectively, highlighting the central role of regularization in achieving robust and safe learning.

Regularization, particularly the entropic one, has been extensively studied in the theoretical literature. In Neu et al. (2017), the authors provide a comprehensive analysis of mirror descent methods for RL, highlighting how regularization influences policy optimization and convergence properties. Similarly, Geist et al. (2019) formalize the theoretical impact of entropy regularization, demonstrating its role in stabilizing policy updates and improving exploration.

Regularization in RL for MFG.

In the inherently non-linear MFG setting, stabilizing policy updates is essential to ensuring convergence and preventing oscillatory behavior. In this setting, the underlying dynamics involve the interplay of numerous agents and necessitate a precise balance between individual and collective objectives. Regularization plays a key role by smoothing the cost landscape, mitigating instability, and creating well-conditioned optimization problems, ultimately leading to more reliable and efficient learning dynamics.

Moreover, MFG serve as approximations of the N𝑁Nitalic_N-player game problem in MARL, leveraging assumptions like anonymity and homogeneity to simplify the otherwise intractable dynamics of joint policy updates in large-scale systems. Introducing regularization into the MFG framework not only enhances stability but is theoretically justified, as the additional approximation error introduced by regularization is comparable to the inherent 𝒪(1/N)𝒪1𝑁\mathcal{O}(1/\sqrt{N})caligraphic_O ( 1 / square-root start_ARG italic_N end_ARG ) error of the MFG approximation itself.

Moreover, in the context of RL for MFG, regularize policies has been used to facilitates convergence of learning algorithms. Cui & Koeppl (2021) show that strict contraction property used by several other works fail to holds in general. The authors then studied a modified MFG with an entropy-regularized reward and showed that, for a sufficiently large degree of regularization, policy-iteration type RL algorithms can be shown to converge using contraction techniques. A similar approach has been used by Anahtarci et al. (2023b) to prove convergence of a Q𝑄Qitalic_Q-learning algorithm, and by Yardim et al. (2023) to prove the convergence of policies learned by independent learners in a regularized MFG. On the empirical side, policy regularization has also been used by Algumaei et al. (2023) through an algorithm relying on proximal policy optimization (PPO).

B.4 Discussion on the assumptions

Lipschitz property of the MDP parameters.

Assumption 1 of Lipschitz continuity on the parameters of the MDP, reward function 𝗋superscript𝗋absent\mathsf{r}^{\hskip 0.49005pt}sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and transition probability matrix 𝖯superscript𝖯absent\mathsf{P}^{\hskip 0.49005pt}sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, implies that the MDP does not change abruptly with respect to the state or action. This ensures that small perturbations in these variables lead to correspondingly small changes in the MDP’s dynamics.

This assumption is well-established in the RL literature (see, e.g., Asadi et al., 2018; Le Lan et al., 2021) and serves several critical purposes. First, it ensures the smoothness of value functions, essential for the stability of iterative optimization methods Second, it enables the derivation of meaningful error bounds and convergence rates, as shown in foundational works on policy optimization.

Moreover, as the reward function 𝗋superscript𝗋absent\mathsf{r}^{\hskip 0.49005pt}sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is defined on a compact domain, since 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A are finite, it is guaranteed that 𝗋<subscriptnormsuperscript𝗋absent\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}<\infty∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT < ∞. A common practice in RL literature to normalize 𝗋superscript𝗋absent\mathsf{r}^{\hskip 0.49005pt}sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, i.e., 𝗋=1subscriptnormsuperscript𝗋absent1\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}=1∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = 1, without loss of generality (Mei et al., 2020). This normalization simplifies expressions and makes algorithms scale-independent. However, in this work, we retain 𝗋subscriptnormsuperscript𝗋absent\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT explicitly in our analysis to emphasize the clear dependence of all constants and bounds on the magnitude of the reward function. This approach ensures transparency in how the properties of 𝗋superscript𝗋absent\mathsf{r}^{\hskip 0.49005pt}sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT influence the theoretical results and practical performance.

Unique recurrence property of the Markov reward process.

To ensure the well-posedness of the TRPO algorithm and derive meaningful performance bounds, we impose the unichain assumption 3. This assumption guarantees that the Markov chain induced by any admissible policy has a single recurrent class, potentially accompanied by a set of transient states. Such a property ensures that the long-term behavior of the Markov chain is well-defined, with a unique stationary distribution for each policy.

The unichain property plays a pivotal role in stabilizing the analysis of RL algorithms, particularly TRPO, as highlighted in Neu et al. (2017). It eliminates ambiguities on the initial population distribution arising from multiple recurrent classes, which could otherwise complicate the evaluation of value functions and policy improvement steps. As shown in in Puterman & Shin (1978), this condition is satisfied if all policies induce an irreducible and aperiodic Markov chain. Moreover, in the regularized setting, the regularization term helps in its satisfaction. Therefore, for any mean-field population profile μ𝜇\muitalic_μ, the Markov chain 𝖯μπμsuperscriptsubscript𝖯𝜇subscript𝜋𝜇\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is irreducible and aperiodic, implying the existence of a unique stationary distribution λπμ,μsubscript𝜆subscript𝜋𝜇𝜇\lambda_{\pi_{\mu},\mu}italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT and establishes the foundation for the mixing property (9).

The ergodicity assumption has been explored in various forms by different authors in the MFG literature. Notably, Angiuli et al. (2021, 2023) impose the condition that the induced Markov chain is aperiodic and absolutely continuous with respect to the uniform distribution over the state space. While this guarantees strong mixing properties, it is a highly restrictive assumption, as it effectively enforces immediate communication between all states, which is often unrealistic in practical applications. Instead, we generalize this assumption by adopting a standard ergodicity condition widely used in RL literature (Mei et al., 2020). This approach maintains the necessary stability properties while allowing for more realistic transition dynamics, ensuring broader applicability in complex multi-agent systems.

Finite concentration of the occupation measure.

In this paper, we adopt a RL paradigm where access to the environment is structured through a ν𝜈\nuitalic_ν-restart, ensuring that each learning episode begins from a well-defined initial state distribution. The concentration of the occupation measure assumption is crucial for the convergence of the Sample-Based TRPO algorithm, ensuring that the estimation of the policy update remains stable and accurate over successive iterations. To compute this task, the algorithm operates in an episodic setting, where each episode begins by drawing an initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the restart distribution ν𝜈\nuitalic_ν, followed by collecting a trajectory (s0,𝗋0,s1,𝗋1,)subscript𝑠0subscript𝗋0subscript𝑠1subscript𝗋1(s_{0},\mathsf{r}_{0},s_{1},\mathsf{r}_{1},\dots)( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , sansserif_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , sansserif_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … ) under the current policy πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This episodic structure allows the algorithm to interact with the MDP in a controlled manner, facilitating the estimation of quantities like the value function JMFGsuperscript𝐽MFGJ^{\operatorname{MFG}}italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT.

This approach builds on the seminal work of Kakade (2003), which introduced the notion of a ν𝜈\nuitalic_ν-restart model as an intermediary assumption in RL. The ν𝜈\nuitalic_ν-restart model is weaker than having direct access to the true model or a generative model (Kearns & Singh, 1998; Azar et al., 2013; Sidford et al., 2018; Agarwal et al., 2020), as it does not require full knowledge of the transition kernel or the reward function. At the same time, it is stronger than the unrestricted case where no restarts are allowed, ensuring that the algorithm can sample states from a well-defined initial distribution at the start of each episode. This controlled interaction with the environment is crucial for accurately estimating value functions and gradients in the Sample-Based TRPO algorithm, ultimately enabling convergence guarantees.

The supremum in Assumption 4, often referred to as the concentrability coefficient, plays a critical role in the theoretical analysis of policy search algorithms. This concept was initially highlighted in the foundational work of Kakade & Langford (2002) and has since been extensively studied in the RL (RL) literature.

One of the reasons the concentrability coefficient has garnered attention is its frequent appearance in the analysis of approximate policy iteration schemes. Research by Scherrer & Geist (2014) and Bhandari & Russo (2024) shows that the concentrability coefficient often governs error propagation during learning. In essence, it provides bounds on how errors in approximating value functions or policies propagate through successive updates.

Appendix C Exact algorithms

Various algorithms have been proposed in the literature to address the exact MFG problem in scenarios where the MDP kernel and the reward function are fully accessible. In this case, the value function and a best response can be computed using dynamic programming and backward induction. This approach has been used, e.g., by Perrin et al. (2020) and Pérolat et al. (2022) to implement Fictitious Play (FP) and Online Mirror Descent (OMD) respectively. Cui & Koeppl (2022) presented a exact fixed point algorithm for graphon games. Angiuli et al. (2023) analyzed the convergence of a model-specific multi-scale algorithm for MFG.

This line of research often stems from the classical control theory and optimization frameworks, tailored to solve specific MFG problems with high precision. These methods focus on the exact representation of the MFG dynamics, providing critical insights into the equilibrium behavior of large-agent systems. In this work, we propose a novel adaptation of the TRPO algorithm, building on the robust framework of Shani et al. (2020). Our adaptation incorporates key elements of the MFG structure, leveraging entropic regularization and mean-field population dynamics. Moreover, we establish finite sample complexity results for this algorithm, demonstrating its theoretical convergence properties and its practical applicability in solving the ergodic MFG problem under a finite state-action setting.

C.1 TRPO - exact formulation

TRPO, inherently structured as a mirror descent method, proves particularly well-suited for entropy-regularized settings. This framework benefits from a significant simplification: the policy update admits a closed-form solution (Beck, 2017), expressed in terms of the Q𝑄Qitalic_Q-function associated with the current policy. By recasting the optimization problem in terms of Q𝑄Qitalic_Q-function computation, the algorithm focuses on the essential dynamics of the system, effectively bypassing the need for direct policy optimization over a high-dimensional space.

This closed-form update leverages the softmax form of the policy, a direct consequence of entropy regularization. The softmax structure ensures that the updated policies remain strictly in the interior of the probability simplex 𝒫(𝒮)𝒫𝒮\mathcal{P}(\mathcal{S})caligraphic_P ( caligraphic_S ), avoiding deterministic solutions. This property not only facilitates numerical stability but also aligns with the theoretical foundations of the regularized problem. The use of first-order conditions becomes feasible and efficient, as the regularization term enforces a smooth, convex optimization landscape.

Moreover, the reward function’s linear dependence on the policy pairs seamlessly with the coercive nature of the entropy-regularized optimization problem. The coercivity guarantees that the optimal policies minimize the objective within the confines of the simplex, effectively balancing exploration and exploitation. This alignment between the problem structure and the algorithm’s mechanics underscores the power of TRPO in achieving convergence while maintaining theoretical guarantees in entropy-regularized MFG settings. By translating the original optimization problem into Q𝑄Qitalic_Q-function evaluations, the algorithm provides a practical yet robust pathway to finding approximate Nash equilibria in complex systems.

In the case where the mean-field population distribution parameter μ𝜇\muitalic_μ is fixed, the algorithm Exact TRPO(μ)𝜇(\mu)( italic_μ ) provides a robust approximation to the value function. By iteratively updating the policy using trust region optimization techniques, the algorithm ensures convergence rates that explicitly depend on the regularization parameter η𝜂\etaitalic_η. Given a fixed η𝜂\etaitalic_η, the learning rate 1/(η(+2))1𝜂21/(\eta(\ell+2))1 / ( italic_η ( roman_ℓ + 2 ) ) is optimally chosen to balance stability and efficiency in the policy updates. The following result establishes the error bounds for the value function approximation, highlighting the role of entropic regularization in convergence guarantees. Specifically, the bounds quantify the discrepancy between the value function induced by the computed policy and the optimal value function for the given mean-field population profile.

Theorem C.1 (Theorem 16 in Shani et al. (2020)).

Fix μ𝒫(𝒮)𝜇𝒫𝒮\mu\in\mathcal{P}(\mathcal{S})italic_μ ∈ caligraphic_P ( caligraphic_S ) the initial distribution. Let {π}=0Lsuperscriptsubscriptsubscript𝜋0𝐿\{\pi_{\ell}\}_{\ell=0}^{L}{ italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT be the sequence generated by the Exact TRPO(μ)μ(\mu)( italic_μ ) algorithm. Then, there exists a constant CTRPO,0>0superscriptsubscript𝐶TRPO00C_{\operatorname{TRPO,0}}^{\prime}>0italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 such that

JMFG(πμ,μ,μ)JMFG(πL,μ,μ)CTRPO,0(𝗋+η2log2|𝒜|)η(1γ)3logLL.superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscript𝜋𝐿𝜇𝜇superscriptsubscript𝐶TRPO0subscriptnormsuperscript𝗋absentsuperscript𝜂2superscript2𝒜𝜂superscript1𝛾3𝐿𝐿\displaystyle J^{\operatorname{MFG}}(\pi_{\mu},\mu,\mu)-J^{\operatorname{MFG}}% (\pi_{L},\mu,\mu)\leq C_{\operatorname{TRPO,0}}^{\prime}\frac{\left(\left\|% \mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta^{2}\log^{2}|\mathcal{A}|% \right)}{\eta(1-\gamma)^{3}}\cdot\frac{\log L}{L}\;.italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_μ , italic_μ ) ≤ italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) end_ARG start_ARG italic_η ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG roman_log italic_L end_ARG start_ARG italic_L end_ARG . (12)
Corollary C.2.

Let {π}=0Lsuperscriptsubscriptsubscript𝜋0𝐿\{\pi_{\ell}\}_{\ell=0}^{L}{ italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT be the sequence generated by the Exact TRPO(μ)μ(\mu)( italic_μ ) algorithm. Then, we have that

s𝒮π^L(|s)πμ(|s)TV2μ(s)CTRPO,0logLL,\displaystyle\sum_{s\in\mathcal{S}}\left\|\hat{\pi}_{L}(\cdot|s)-\pi_{\mu}(% \cdot|s)\right\|_{\mathrm{TV}}^{2}\mu(s)\leq C_{\operatorname{TRPO,0}}\cdot% \frac{\log L}{L}\;,∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ ( italic_s ) ≤ italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT ⋅ divide start_ARG roman_log italic_L end_ARG start_ARG italic_L end_ARG , (13)

with

CTRPO,0:=CTRPO,02η(1γ)(𝗋+η2log2|𝒜|)η(1γ)3.assignsubscript𝐶TRPO0superscriptsubscript𝐶TRPO02𝜂1𝛾subscriptnormsuperscript𝗋absentsuperscript𝜂2superscript2𝒜𝜂superscript1𝛾3\displaystyle C_{\operatorname{TRPO,0}}:=C_{\operatorname{TRPO,0}}^{\prime}% \frac{2}{\eta(1-\gamma)}\cdot\frac{\left(\left\|\mathsf{r}^{\hskip 0.49005pt}% \right\|_{{\infty}}+\eta^{2}\log^{2}|\mathcal{A}|\right)}{\eta(1-\gamma)^{3}}\;.italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT := italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ divide start_ARG ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) end_ARG start_ARG italic_η ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG .
Proof.

Using Proposition E.2, we use the relationship between the total variation distance of a policy π𝜋\piitalic_π to the optimal policy πμsubscript𝜋𝜇\pi_{\mu}italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and the corresponding difference in their value functions obtainine

s𝒮π^L(|s)πμ(|s)TV2μ(s)2η(1γ)(JMFG(πμ,μ,μ)1L+1=0LJMFG(π,μ,μ))\displaystyle\sum_{s\in\mathcal{S}}\left\|\hat{\pi}_{L}(\cdot|s)-\pi_{\mu}(% \cdot|s)\right\|_{\mathrm{TV}}^{2}\mu(s)\leq\frac{2}{\eta(1-\gamma)}\left(J^{% \operatorname{MFG}}\left(\pi_{\mu},\mu,\mu\right)-\frac{1}{L+1}\sum_{\ell=0}^{% L}J^{\operatorname{MFG}}\left(\pi_{\ell},\mu,\mu\right)\right)∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ ( italic_s ) ≤ divide start_ARG 2 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - divide start_ARG 1 end_ARG start_ARG italic_L + 1 end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) )

Therefore, using (12), we get (13). ∎

C.2 Exact algorithm

We now analyze Algorithm 5, where no approximation is made. This algorithm is exact and does not involve any approximation. We provide a convergence result for this algorithm in the tabular setting.

Algorithm 5 ExactAlgo
1:  Input: M𝑀Mitalic_M.
2:  Initialize: μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.
3:  for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] do
4:     μk:=μk1+βk(μk1(𝖯μk1πμk1)Mμk1).assignsubscript𝜇𝑘subscript𝜇𝑘1subscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇𝑘1\mu_{k}:=\mu_{k-1}+\beta_{k}\left(\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{k-1}\right).italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) . # Update population distribution
5:  end for
6:  Output: μKsubscript𝜇𝐾\mu_{K}italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT.
Proposition C.3.

Suppose that Assumptions 12, and 3 hold. Assume that, for any k0𝑘0k\geq 0italic_k ≥ 0,

βk<ττCop,MFG+Cπ,μ2CErg,M2,subscript𝛽𝑘𝜏𝜏subscript𝐶opMFGsuperscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2\displaystyle\beta_{k}<\frac{\tau}{\tau-C_{\operatorname{op,MFG}}+C_{\pi,\mu}^% {2}C_{\operatorname{Erg,M}}^{2}}\;,italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < divide start_ARG italic_τ end_ARG start_ARG italic_τ - italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (14)

with τ:=1Cop,MFGassign𝜏1subscript𝐶opMFG\tau:=1-C_{\operatorname{op,MFG}}italic_τ := 1 - italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT. Then, the sequence {μk}k0subscriptsubscript𝜇𝑘𝑘0\{\mu_{k}\}_{k\geq 0}{ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT defined in ExactAlgo satisfies

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ (1τβk)μk1μ22.1𝜏subscript𝛽𝑘superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇22\displaystyle\left(1-\tau\beta_{k}\right)\left\|\mu_{k-1}-\mu_{\star}\right\|_% {{2}}^{2}\;.( 1 - italic_τ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (15)

Moreover, if the step-sizes βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfy

k=0βk=,superscriptsubscript𝑘0subscript𝛽𝑘\displaystyle\sum_{k=0}^{\infty}\beta_{k}=\infty\;,∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∞ , (16)

we have that the exact algorithm ExactAlgo converges to the optimal policy in the tabular setting.

Proof.

We focus on the convergence of the sequence μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT toward μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT. Recall that μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT is the fixed point (8). From this condition, we then have that

μ=μ𝖯μπμ=μ(𝖯μπμ)M.subscript𝜇subscript𝜇superscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\mu_{\star}=\mu_{\star}\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{% \star}}}=\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{% \star}}}\right)^{M}\;.italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT .

We then have that

μkμ22=superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}=∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = μk1μ+βk{μk1(𝖯μk1πμk1)Mμk1}22superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇subscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇𝑘122\displaystyle\left\|\mu_{k-1}-\mu_{\star}+\beta_{k}\left\{\mu_{k-1}\left(% \mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{k-1}% \right\}\right\|_{{2}}^{2}∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT } ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== μk1μ+βk{(μk1(𝖯μk1πμk1)Mμk1)(μ(𝖯μπμ)Mμ)}22superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇subscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇𝑘1subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀subscript𝜇22\displaystyle\left\|\mu_{k-1}-\mu_{\star}+\beta_{k}\left\{\left(\mu_{k-1}\left% (\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{k-1}% \right)-\left(\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{% \mu_{\star}}}\right)^{M}-\mu_{\star}\right)\right\}\right\|_{{2}}^{2}∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - ( italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) } ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (1βk)(μk1μ)+βk{μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)M}22superscriptsubscriptnorm1subscript𝛽𝑘subscript𝜇𝑘1subscript𝜇subscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀22\displaystyle\left\|(1-\beta_{k})\left(\mu_{k-1}-\mu_{\star}\right)+\beta_{k}% \left\{\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}% \right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{% \mu_{\star}}}\right)^{M}\right\}\right\|_{{2}}^{2}∥ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT } ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (1βk)2μk1μ22+2(1βk)βkμk1μ,μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)Msuperscript1subscript𝛽𝑘2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇2221subscript𝛽𝑘subscript𝛽𝑘subscript𝜇𝑘1subscript𝜇subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle(1-\beta_{k})^{2}\left\|\mu_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+2% (1-\beta_{k})\beta_{k}\left\langle\mu_{k-1}-\mu_{\star},\mu_{k-1}\left(\mathsf% {P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(% \mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right\rangle( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)M22.superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀22\displaystyle+\beta_{k}^{2}\left\|\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{% \star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right\|_{{2}}^{2}\;.+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Applying Assumptions 2, and Corollary E.4, together with Lemma E.1, the previous equality implies that

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ [(1βk)(1+(2Cop,MFG1)βk)+βk2Cπ,μ2CErg,M2]μk1μ22.delimited-[]1subscript𝛽𝑘12subscript𝐶opMFG1subscript𝛽𝑘superscriptsubscript𝛽𝑘2superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇22\displaystyle\left[(1-\beta_{k})\left(1+(2C_{\operatorname{op,MFG}}-1)\beta_{k% }\right)+\beta_{k}^{2}C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\right]\left% \|\mu_{k-1}-\mu_{\star}\right\|_{{2}}^{2}\;.[ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( 1 + ( 2 italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT - 1 ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Since βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfy (14), we then obtain 15. We see that 15 is a contraction inequality. Combining this with (16), it implies that the sequence μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT converges to μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT exponentially fast, i.e.,

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ j=1k(1τβj)μ0μ22exp(τj=1kβj)μ0μ22.superscriptsubscriptproduct𝑗1𝑘1𝜏subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇22𝜏superscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇22\displaystyle\prod_{j=1}^{k}\left(1-\tau\beta_{j}\right)\left\|\mu_{0}-\mu_{% \star}\right\|_{{2}}^{2}\leq\exp\left(-\tau\sum_{j=1}^{k}\beta_{j}\right)\left% \|\mu_{0}-\mu_{\star}\right\|_{{2}}^{2}\;.∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_τ italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ roman_exp ( - italic_τ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The rate of convergence is determined by the step-size βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. This concludes the proof. ∎

Remark C.4.

The exact algorithm ExactAlgo is a simplified version of the algorithm we consider in this paper. The exact algorithm does not involve any approximation, thus is deterministic. This convergence is in line with deterinistic optimization. In fact, to get a precision of ε𝜀\varepsilonitalic_ε, we need k𝑘kitalic_k to be of order

  • log(μ0μ22/ε)(12Cop,MFG+Cπ,μ2CErg,M2)superscriptsubscriptnormsubscript𝜇0subscript𝜇22𝜀12subscript𝐶opMFGsuperscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2\log(\left\|\mu_{0}-\mu_{\star}\right\|_{{2}}^{2}/\varepsilon)(1-2C_{% \operatorname{op,MFG}}+C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2})roman_log ( ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε ) ( 1 - 2 italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), if βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is constant equal to γ>0𝛾0\gamma>0italic_γ > 0 such that (14) is verified;

  • μ0μ22/εexp(12Cop,MFG+Cπ,μ2CErg,M2)superscriptsubscriptnormsubscript𝜇0subscript𝜇22𝜀12subscript𝐶opMFGsuperscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2\left\|\mu_{0}-\mu_{\star}\right\|_{{2}}^{2}/\varepsilon~{}\exp(1-2C_{% \operatorname{op,MFG}}+C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2})∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_ε roman_exp ( 1 - 2 italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), if βk=C/ksubscript𝛽𝑘𝐶𝑘\beta_{k}=C/kitalic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_C / italic_k for a certain C>0𝐶0C>0italic_C > 0 such that (14) is verified.

We now analyze the convergence of the algorithm with approximation, i.e., Algorithm 2. We consider the following algorithm.

Theorem C.5.

Suppose that Assumptions 1 and 2 hold. Assume that, for any k0𝑘0k\geq 0italic_k ≥ 0,

βk<b0:=τ4Cπ,μ2CErg,M2+2τ, for k1,formulae-sequencesubscript𝛽𝑘subscript𝑏0assign𝜏4superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM22𝜏 for 𝑘1\displaystyle\beta_{k}<b_{0}:=\frac{\tau}{4C_{\pi,\mu}^{2}C_{\operatorname{Erg% ,M}}^{2}+2\tau}\;,\qquad\text{ for }k\geq 1\;,italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := divide start_ARG italic_τ end_ARG start_ARG 4 italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_τ end_ARG , for italic_k ≥ 1 , (17)

Then, the exact algorithm Exact MF-TRPO converges to the optimal policy in the tabular setting. In particular, we have that

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ exp(τ2j=1kβj)μ0μ22+2CMF,0τlog(L)L,𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇222subscript𝐶MF0𝜏𝐿𝐿\displaystyle\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)\left\|\mu% _{0}-\mu_{\star}\right\|_{{2}}^{2}+\frac{2C_{\operatorname{MF,0}}}{\tau}\cdot% \frac{\log(L)}{L}\;,roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_MF , 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ⋅ divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG , (18)

with

τ𝜏\displaystyle\tauitalic_τ :=1Cop,MFG,assignabsent1subscript𝐶opMFG\displaystyle:=1-C_{\operatorname{op,MFG}}\;,:= 1 - italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT ,
CMF,0subscript𝐶MF0\displaystyle C_{\operatorname{MF,0}}italic_C start_POSTSUBSCRIPT roman_MF , 0 end_POSTSUBSCRIPT :=2+b0τCErg,MCTRPO,0.assignabsent2subscript𝑏0𝜏subscript𝐶ErgMsubscript𝐶TRPO0\displaystyle:=\frac{2+b_{0}}{\tau}\cdot C_{\operatorname{Erg,M}}C_{% \operatorname{TRPO,0}}\;.:= divide start_ARG 2 + italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ⋅ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT .
Proof.

We focus on the convergence of the sequence μkμsubscript𝜇𝑘subscript𝜇\mu_{k}\to\mu_{\star}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT, with μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT as in (8). Denote πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the output of Exact TRPO(μk)subscript𝜇𝑘(\mu_{k})( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) at each step. We then have that

μkμ22=superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}=∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = μk1μ+βk{μk1(𝖯μk1πk)Mμk1}22superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇subscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘122\displaystyle\left\|\mu_{k-1}-\mu_{\star}+\beta_{k}\left\{\mu_{k-1}\left(% \mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\right\}% \right\|_{{2}}^{2}∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT } ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== μk1μ+βk(μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M)+\displaystyle\left\|\mu_{k-1}-\mu_{\star}+\beta_{k}\left(\mu_{k-1}\left(% \mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\left(% \mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}\right)+\right.∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) +
+βk(μk1(𝖯μk1πμk1)Mμk1)22evaluated-atsubscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇𝑘122\displaystyle\phantom{\mu_{k-1}-\mu_{\star}}~{}~{}\left.+\beta_{k}\left(\mu_{k% -1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-% \mu_{k-1}\right)\right\|_{2}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== μk1μ+βk(μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M)\displaystyle\left\|\mu_{k-1}-\mu_{\star}+\beta_{k}\left(\mu_{k-1}\left(% \mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\left(% \mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}\right)\right.∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μk1(𝖯μk1πμk1)Mμk1)βk(μ(𝖯μπμ)Mμ)22subscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇𝑘1evaluated-atsubscript𝛽𝑘subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀subscript𝜇22\displaystyle\phantom{\mu_{k-1}-\mu_{\star}}~{}~{}\left.+\beta_{k}\left(\mu_{k% -1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-% \mu_{k-1}\right)-\beta_{k}\left(\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{% \hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}-\mu_{\star}\right)\right\|_{2}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (1βk)(μk1μ)+βk(μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M)\displaystyle\left\|(1-\beta_{k})\left(\mu_{k-1}-\mu_{\star}\right)+\beta_{k}% \left(\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{k}}\right)^{% M}-\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}% \right)^{M}\right)\right.∥ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)M)22evaluated-atsubscript𝛽𝑘subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀22\displaystyle\phantom{(1-\beta_{k})\left(\mu_{k-1}-\mu_{\star}\right)}~{}~{}% \left.+\beta_{k}\left(\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt% \pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0% .49005pt\pi_{\mu_{\star}}}\right)^{M}\right)\right\|_{2}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (1βk)2μk1μ22+2(1βk)βkμk1μ,μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)Msuperscript1subscript𝛽𝑘2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇2221subscript𝛽𝑘subscript𝛽𝑘subscript𝜇𝑘1subscript𝜇subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle(1-\beta_{k})^{2}\left\|\mu_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+2% (1-\beta_{k})\beta_{k}\left\langle\mu_{k-1}-\mu_{\star},\mu_{k-1}\left(\mathsf% {P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(% \mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right\rangle( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀22\displaystyle+\beta_{k}^{2}\left\|\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{% \star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right\|_{{2}}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2(1βk)βkμk1μ,μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M21subscript𝛽𝑘subscript𝛽𝑘subscript𝜇𝑘1subscript𝜇subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀\displaystyle+2(1-\beta_{k})\beta_{k}\left\langle\mu_{k-1}-\mu_{\star},\mu_{k-% 1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}% \left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}\right\rangle+ 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀22\displaystyle+\beta_{k}^{2}\left\|\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}\right\|_{{2}}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk2μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M,μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)M.2superscriptsubscript𝛽𝑘2subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle+2\beta_{k}^{2}\left\langle\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^% {\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M},\mu_{k-1}\left(\mathsf{P}_{\mu_{k-% 1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{% \mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right\rangle\;.+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩ .

Applying Assumptions 2 and Corollary E.4 together with Lemma E.1, and following the same lines as in the proof of Proposition C.3, the previous equality implies that

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ [(1βk)(1+(12τ)βk)+βk2Cπ,μ2CErg,M2]μk1μ22delimited-[]1subscript𝛽𝑘112𝜏subscript𝛽𝑘superscriptsubscript𝛽𝑘2superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇22\displaystyle\left[(1-\beta_{k})\left(1+(1-2\tau)\beta_{k}\right)+\beta_{k}^{2% }C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\right]\left\|\mu_{k-1}-\mu_{\star% }\right\|_{{2}}^{2}[ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( 1 + ( 1 - 2 italic_τ ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2(1βk)βkμk1μ,μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M21subscript𝛽𝑘subscript𝛽𝑘subscript𝜇𝑘1subscript𝜇subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀\displaystyle+2(1-\beta_{k})\beta_{k}\left\langle\mu_{k-1}-\mu_{\star},\mu_{k-% 1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}% \left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}\right\rangle+ 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀22\displaystyle+\beta_{k}^{2}\left\|\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}\right\|_{{2}}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk2μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M,μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)M2superscriptsubscript𝛽𝑘2subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle+2\beta_{k}^{2}\left\langle\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^% {\hskip 0.49005pt\pi_{k}}\right)^{M}-\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M},\mu_{k-1}\left(\mathsf{P}_{\mu_{k-% 1}}^{\hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{% \mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right\rangle+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
=:absent:\displaystyle=:= : [(1βk)(1+(12τ)βk)+βk2Cπ,μ2CErg,M2]μk1μ22delimited-[]1subscript𝛽𝑘112𝜏subscript𝛽𝑘superscriptsubscript𝛽𝑘2superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇22\displaystyle\left[(1-\beta_{k})\left(1+(1-2\tau)\beta_{k}\right)+\beta_{k}^{2% }C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\right]\left\|\mu_{k-1}-\mu_{\star% }\right\|_{{2}}^{2}[ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( 1 + ( 1 - 2 italic_τ ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk(1βk)𝐓1+βk2𝐓2+2βk2𝐓3.2subscript𝛽𝑘1subscript𝛽𝑘subscript𝐓1superscriptsubscript𝛽𝑘2subscript𝐓22superscriptsubscript𝛽𝑘2subscript𝐓3\displaystyle+2\beta_{k}(1-\beta_{k})\mathbf{T}_{1}+\beta_{k}^{2}\mathbf{T}_{2% }+2\beta_{k}^{2}\mathbf{T}_{3}\;.+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT .

We now proceed in studying the terms 𝐓1subscript𝐓1\mathbf{T}_{1}bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐓2subscript𝐓2\mathbf{T}_{2}bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝐓3subscript𝐓3\mathbf{T}_{3}bold_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Using Young’s inequality, we get that

|𝐓1|subscript𝐓1\displaystyle\left|\mathbf{T}_{1}\right|| bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | τ2μk1μ22+12τμk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M22absent𝜏2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇2212𝜏superscriptsubscriptnormsubscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀22\displaystyle\leq\frac{\tau}{2}\left\|\mu_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+% \frac{1}{2\tau}\left\|\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt% \pi_{k}}\right)^{M}-\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi% _{\mu_{k-1}}}\right)^{M}\right\|_{{2}}^{2}≤ divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_τ end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
τ2μk1μ22+12τ𝐓2,absent𝜏2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇2212𝜏subscript𝐓2\displaystyle\leq\frac{\tau}{2}\left\|\mu_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+% \frac{1}{2\tau}\mathbf{T}_{2}\;,≤ divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_τ end_ARG bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

and, using Lemma E.1 and Corollary E.4,

|𝐓3|subscript𝐓3\displaystyle\left|\mathbf{T}_{3}\right|| bold_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | 12μk1(𝖯μk1πμk1)Mμ(𝖯μπμ)M22+12μk1(𝖯μk1πk)Mμk1(𝖯μk1πμk1)M22absent12superscriptsubscriptnormsubscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀2212superscriptsubscriptnormsubscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋𝑘𝑀subscript𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript𝜇𝑘1subscript𝜋subscript𝜇𝑘1𝑀22\displaystyle\leq\frac{1}{2}\left\|\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{% \hskip 0.49005pt\pi_{\mu_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{% \star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right\|_{{2}}^{2}+\frac% {1}{2}\left\|\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{k}}% \right)^{M}-\mu_{k-1}\left(\mathsf{P}_{\mu_{k-1}}^{\hskip 0.49005pt\pi_{\mu_{k% -1}}}\right)^{M}\right\|_{{2}}^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
12Cπ,μ2CErg,M2μk1μ22+12𝐓2.absent12superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript𝜇𝑘1subscript𝜇2212subscript𝐓2\displaystyle\leq\frac{1}{2}C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\left\|% \mu_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+\frac{1}{2}\mathbf{T}_{2}\;.≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Since τ<1𝜏1\tau<1italic_τ < 1 from Assumption 2 and βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies (17), a straightforward computation shows that

(1βk)(1+(1τ)βk)+2βk2Cπ,μ2CErg,M2τ2βk(1τ2βk).1subscript𝛽𝑘11𝜏subscript𝛽𝑘2superscriptsubscript𝛽𝑘2superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2𝜏2subscript𝛽𝑘1𝜏2subscript𝛽𝑘\displaystyle(1-\beta_{k})\left(1+(1-\tau)\beta_{k}\right)+2\beta_{k}^{2}C_{% \pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}-\frac{\tau}{2}\beta_{k}\leq\left(1-% \frac{\tau}{2}\beta_{k}\right)\;.( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( 1 + ( 1 - italic_τ ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

Moreover, applying Lemma E.1, together with Theorem C.1 on the performances of TRPO, we have

𝐓2subscript𝐓2\displaystyle\mathbf{T}_{2}bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT CErg,M(s𝒮μk12(s)πk(|s)πμ(|s)TV2)\displaystyle\leq C_{\operatorname{Erg,M}}\left(\sum_{s\in\mathcal{S}}\mu_{k-1% }^{2}(s)\left\|\pi_{k}(\cdot|s)-\pi_{\mu}(\cdot|s)\right\|_{\mathrm{TV}}^{2}\right)≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s ) ∥ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
CErg,M(s𝒮μk1(s)πk(|s)πμk1(|s)TV2)\displaystyle\leq C_{\operatorname{Erg,M}}\left(\sum_{s\in\mathcal{S}}\mu_{k-1% }(s)\left\|\pi_{k}(\cdot|s)-\pi_{\mu_{k-1}}(\cdot|s)\right\|_{\mathrm{TV}}^{2}\right)≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_s ) ∥ italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
CErg,M(JMFG(πμk1,μk1,μk1)JMFG(πk,μk1,μk1))CErg,M 2CTRPO,0log(L)L.absentsubscript𝐶ErgMsuperscript𝐽MFGsubscript𝜋subscript𝜇𝑘1subscript𝜇𝑘1subscript𝜇𝑘1superscript𝐽MFGsubscript𝜋𝑘subscript𝜇𝑘1subscript𝜇𝑘1subscript𝐶ErgM2subscript𝐶TRPO0𝐿𝐿\displaystyle\leq C_{\operatorname{Erg,M}}\left(J^{\operatorname{MFG}}(\pi_{% \mu_{k-1}},\mu_{k-1},\mu_{k-1})-J^{\operatorname{MFG}}(\pi_{k},\mu_{k-1},\mu_{% k-1})\right)\leq C_{\operatorname{Erg,M}}\ 2C_{\operatorname{TRPO,0}}\frac{% \log(L)}{L}\;.≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) ≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT 2 italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG .

Therefore, combining the previous inequalities, we have that

μkμ22(1τ2βk)μk1μ22+βk(1βkτ+3βk2)2CErg,MCTRPO,0log(L)L(1τ2βk)μk1μ22+βk2+b0τCErg,MCTRPO,0log(L)L(1τ2βk)μk1μ22+βkCMF,0log(L)L.superscriptsubscriptdelimited-∥∥subscript𝜇𝑘subscript𝜇221𝜏2subscript𝛽𝑘superscriptsubscriptdelimited-∥∥subscript𝜇𝑘1subscript𝜇22subscript𝛽𝑘1subscript𝛽𝑘𝜏3subscript𝛽𝑘22subscript𝐶ErgMsubscript𝐶TRPO0𝐿𝐿1𝜏2subscript𝛽𝑘superscriptsubscriptdelimited-∥∥subscript𝜇𝑘1subscript𝜇22subscript𝛽𝑘2subscript𝑏0𝜏subscript𝐶ErgMsubscript𝐶TRPO0𝐿𝐿1𝜏2subscript𝛽𝑘superscriptsubscriptdelimited-∥∥subscript𝜇𝑘1subscript𝜇22subscript𝛽𝑘subscript𝐶MF0𝐿𝐿\displaystyle\begin{split}\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq&% \left(1-\frac{\tau}{2}\beta_{k}\right)\left\|\mu_{k-1}-\mu_{\star}\right\|_{{2% }}^{2}+\beta_{k}\left(\frac{1-\beta_{k}}{\tau}+\frac{3\beta_{k}}{2}\right)% \cdot 2C_{\operatorname{Erg,M}}C_{\operatorname{TRPO,0}}~{}\frac{\log(L)}{L}\\ \leq&\left(1-\frac{\tau}{2}\beta_{k}\right)\left\|\mu_{k-1}-\mu_{\star}\right% \|_{{2}}^{2}+\beta_{k}\frac{2+b_{0}}{\tau}\cdot C_{\operatorname{Erg,M}}C_{% \operatorname{TRPO,0}}~{}\frac{\log(L)}{L}\\ \leq&\left(1-\frac{\tau}{2}\beta_{k}\right)\left\|\mu_{k-1}-\mu_{\star}\right% \|_{{2}}^{2}+\beta_{k}C_{\operatorname{MF,0}}~{}\frac{\log(L)}{L}\;.\end{split}start_ROW start_CELL ∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ end_CELL start_CELL ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG + divide start_ARG 3 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) ⋅ 2 italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG 2 + italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ⋅ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_MF , 0 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG . end_CELL end_ROW (19)

Developping the recursion (19), we obtain

μkμ22superscriptsubscriptnormsubscript𝜇𝑘subscript𝜇22absent\displaystyle\left\|\mu_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ j=1k(1τ2βj)μ0μ22+CMF,0j=1klog(L)Lβj=j+1k(1τ2β)superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇22subscript𝐶MF0superscriptsubscript𝑗1𝑘𝐿𝐿subscript𝛽𝑗superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽\displaystyle\prod_{j=1}^{k}\left(1-\frac{\tau}{2}\beta_{j}\right)\left\|\mu_{% 0}-\mu_{\star}\right\|_{{2}}^{2}+C_{\operatorname{MF,0}}~{}\sum_{j=1}^{k}\frac% {\log(L)}{L}\beta_{j}\prod_{\ell=j+1}^{k}\left(1-\frac{\tau}{2}\beta_{\ell}\right)∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT roman_MF , 0 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT )
\displaystyle\leq exp(τ2j=1kβj)μ0μ22+CMF,0log(L)Lj=1kβj=j+1k(1τ2β).𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇22subscript𝐶MF0𝐿𝐿superscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽\displaystyle\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)\left\|\mu% _{0}-\mu_{\star}\right\|_{{2}}^{2}+C_{\operatorname{MF,0}}~{}\frac{\log(L)}{L}% \sum_{j=1}^{k}\beta_{j}\prod_{\ell=j+1}^{k}\left(1-\frac{\tau}{2}\beta_{\ell}% \right)\;.roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C start_POSTSUBSCRIPT roman_MF , 0 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) .

Note that the second term of the r.h.s. of the previous inequality is a telescopic sum, as the central term can be rewritten as

βj=j+1k(1τ2β)=2τ[=j+1k(1τ2β)=jk(1τ2β)].subscript𝛽𝑗superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽2𝜏delimited-[]superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽superscriptsubscriptproduct𝑗𝑘1𝜏2subscript𝛽\displaystyle\beta_{j}\prod_{\ell=j+1}^{k}\left(1-\frac{\tau}{2}\beta_{\ell}% \right)=\frac{2}{\tau}\left[\prod_{\ell=j+1}^{k}\left(1-\frac{\tau}{2}\beta_{% \ell}\right)-\prod_{\ell=j}^{k}\left(1-\frac{\tau}{2}\beta_{\ell}\right)\right% ]\;.italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) = divide start_ARG 2 end_ARG start_ARG italic_τ end_ARG [ ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) - ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ] .

Therefore, we get (18). ∎

ε𝜀\varepsilonitalic_ε-MFNE.

With the theoretical foundations established in Theorem C.1 and Theorem C.5, we can now derive a result on the closeness of the proposed algorithm to the MFNE. Specifically, we show that Exact MF-TRPO achieves an ε𝜀\varepsilonitalic_ε-MFNE, where the approximation error ε𝜀\varepsilonitalic_ε is explicitly quantified as follows.

Corollary C.6.

Suppose that Assumptions 12, and 3 hold. Assume that, for any k0𝑘0k\geq 0italic_k ≥ 0, the learning rate βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies (17). Let μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (resp. πk+1subscript𝜋𝑘1\pi_{k+1}italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT) the output of Exact MF-TRPO (resp. Exact TRPO(μk)subscriptμk(\mu_{k})( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )). Then, (πk+1,μk)subscript𝜋𝑘1subscript𝜇𝑘(\pi_{k+1},\mu_{k})( italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is εksubscript𝜀𝑘\varepsilon_{k}italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-MFNE, with

εksubscript𝜀𝑘\displaystyle\varepsilon_{k}italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT :=δNE,1,k+Cϕ((1+CErg,(1+Cπ,μ))δNE,2,k+2CErg,|𝒮|η(1γ)δNE,1,k),assignabsentsubscript𝛿NE1𝑘subscript𝐶italic-ϕ1subscript𝐶Erg1subscript𝐶𝜋𝜇subscript𝛿NE2𝑘2subscript𝐶Erg𝒮𝜂1𝛾subscript𝛿NE1𝑘\displaystyle:=\delta_{\text{NE},1,k}+C_{\phi}\left(\left(1+C_{\operatorname{% Erg,\infty}}\left(1+C_{\pi,\mu}\right)\right)\sqrt{\delta_{\text{NE},2,k}}+% \frac{2C_{\operatorname{Erg,\infty}}\sqrt{|\mathcal{S}|}}{\eta(1-\gamma)}\cdot% \sqrt{\delta_{\text{NE},1,k}}\right)\;,:= italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ( 1 + italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ) square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 2 , italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG ) ,
δNE,1,ksubscript𝛿NE1𝑘\displaystyle\delta_{\text{NE},1,k}italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT =CTRPO,0(𝗋+η2log2|𝒜|)η(1γ)3logLL,absentsuperscriptsubscript𝐶TRPO0subscriptnormsuperscript𝗋absentsuperscript𝜂2superscript2𝒜𝜂superscript1𝛾3𝐿𝐿\displaystyle=C_{\operatorname{TRPO,0}}^{\prime}\frac{\left(\left\|\mathsf{r}^% {\hskip 0.49005pt}\right\|_{{\infty}}+\eta^{2}\log^{2}|\mathcal{A}|\right)}{% \eta(1-\gamma)^{3}}\cdot\frac{\log L}{L}\;,= italic_C start_POSTSUBSCRIPT roman_TRPO , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) end_ARG start_ARG italic_η ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG roman_log italic_L end_ARG start_ARG italic_L end_ARG ,
δNE,2,ksubscript𝛿NE2𝑘\displaystyle\delta_{\text{NE},2,k}italic_δ start_POSTSUBSCRIPT NE , 2 , italic_k end_POSTSUBSCRIPT =exp(τ2j=1kβj)μ0μ2+2CMF,0τlog(L)L.absent𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗subscriptnormsubscript𝜇0subscript𝜇22subscript𝐶MF0𝜏𝐿𝐿\displaystyle=\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)\left\|% \mu_{0}-\mu_{\star}\right\|_{{2}}+\frac{2C_{\operatorname{MF,0}}}{\tau}\cdot% \frac{\log(L)}{L}\;.= roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_MF , 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ⋅ divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG .
Proof.

From Proposition E.5, we have that the exploitability of a policy π𝜋\piitalic_π and a mean-field parameter μ𝜇\muitalic_μ can be bound by the gap of optimality of the π𝜋\piitalic_π w.r.t. the value function JMFG(,μ,μ)superscript𝐽MFG𝜇𝜇J^{\operatorname{MFG}}(\cdot,\mu,\mu)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( ⋅ , italic_μ , italic_μ ) and the distance between μ𝜇\muitalic_μ and the stationary distribution λπ,μsubscript𝜆𝜋𝜇\lambda_{\pi,\mu}italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT.

From Theorem C.1, we have that

maxπΠJMFG(π,μk,μk)JMFG(πk,μk,μk)subscript𝜋Πsuperscript𝐽MFG𝜋subscript𝜇𝑘subscript𝜇𝑘superscript𝐽MFGsubscript𝜋𝑘subscript𝜇𝑘subscript𝜇𝑘\displaystyle\max_{\pi\in\Pi}J^{\operatorname{MFG}}(\pi,\mu_{k},\mu_{k})-J^{% \operatorname{MFG}}(\pi_{k},\mu_{k},\mu_{k})roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) δNE,1,k.absentsubscript𝛿NE1𝑘\displaystyle\leq\delta_{\text{NE},1,k}\;.≤ italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT . (20)

On the other hand, using the fact that (πμ,μ)subscript𝜋subscript𝜇subscript𝜇(\pi_{\mu_{\star}},\mu_{\star})( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) is a MFNE, we have

μkλπk+1,μk=subscript𝜇𝑘subscript𝜆subscript𝜋𝑘1subscript𝜇𝑘absent\displaystyle\mu_{k}-\lambda_{\pi_{k+1},\mu_{k}}=italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = (μkμ)+(λπμ,μλπμk,μk)+(λπμk,μkλπk+1,μk).subscript𝜇𝑘subscript𝜇subscript𝜆subscript𝜋subscript𝜇subscript𝜇subscript𝜆subscript𝜋subscript𝜇𝑘subscript𝜇𝑘subscript𝜆subscript𝜋subscript𝜇𝑘subscript𝜇𝑘subscript𝜆subscript𝜋𝑘1subscript𝜇𝑘\displaystyle\left(\mu_{k}-\mu_{\star}\right)+\left(\lambda_{\pi_{\mu_{\star}}% ,\mu_{\star}}-\lambda_{\pi_{\mu_{k}},\mu_{k}}\right)+\left(\lambda_{\pi_{\mu_{% k}},\mu_{k}}-\lambda_{\pi_{k+1},\mu_{k}}\right)\;.( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) + ( italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

Then, applying Lemma E.1, together with Corollary E.4, we obtain

λπμ,μλπμk,μk2CErg,(1+Cπ,μ)μμk2.subscriptnormsubscript𝜆subscript𝜋subscript𝜇subscript𝜇subscript𝜆subscript𝜋subscript𝜇𝑘subscript𝜇𝑘2subscript𝐶Erg1subscript𝐶𝜋𝜇subscriptnormsubscript𝜇subscript𝜇𝑘2\displaystyle\left\|\lambda_{\pi_{\mu_{\star}},\mu_{\star}}-\lambda_{\pi_{\mu_% {k}},\mu_{k}}\right\|_{{2}}\leq C_{\operatorname{Erg,\infty}}\left(1+C_{\pi,% \mu}\right)\left\|\mu_{\star}-\mu_{k}\right\|_{{2}}\;.∥ italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Moreover, from Theorem C.1 on the performances of Exact TRPO, together with Lemma E.1 and Proposition E.2, we have

λπμk,μkλπk+1,μk2subscriptnormsubscript𝜆subscript𝜋subscript𝜇𝑘subscript𝜇𝑘subscript𝜆subscript𝜋𝑘1subscript𝜇𝑘2\displaystyle\left\|\lambda_{\pi_{\mu_{k}},\mu_{k}}-\lambda_{\pi_{k+1},\mu_{k}% }\right\|_{{2}}∥ italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT CErg,s𝒮μk(s)πk+1(|s)πμk(|s)TV\displaystyle\leq C_{\operatorname{Erg,\infty}}\sum_{s\in\mathcal{S}}\mu_{k}(s% )\left\|\pi_{k+1}(\cdot|s)-\pi_{\mu_{k}}(\cdot|s)\right\|_{\mathrm{TV}}≤ italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ∥ italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT
2CErg,|𝒮|η(1γ)JMFG(πμk,μk,μk)JMFG(πk+1,μk,μk)2CErg,|𝒮|η(1γ)δNE,1,k.absent2subscript𝐶Erg𝒮𝜂1𝛾superscript𝐽MFGsubscript𝜋subscript𝜇𝑘subscript𝜇𝑘subscript𝜇𝑘superscript𝐽MFGsubscript𝜋𝑘1subscript𝜇𝑘subscript𝜇𝑘2subscript𝐶Erg𝒮𝜂1𝛾subscript𝛿NE1𝑘\displaystyle\leq\frac{2C_{\operatorname{Erg,\infty}}\ \sqrt{|\mathcal{S}|}}{% \eta(1-\gamma)}\sqrt{J^{\operatorname{MFG}}(\pi_{\mu_{k}},\mu_{k},\mu_{k})-J^{% \operatorname{MFG}}(\pi_{k+1},\mu_{k},\mu_{k})}\leq\frac{2C_{\operatorname{Erg% ,\infty}}\sqrt{|\mathcal{S}|}}{\eta(1-\gamma)}\cdot\sqrt{\delta_{\text{NE},1,k% }}\;.≤ divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG square-root start_ARG italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG ≤ divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG .

Using the triangle inequality, together with Theorem C.5, we can bound μkλπk+1,μk2subscriptnormsubscript𝜇𝑘subscript𝜆subscript𝜋𝑘1subscript𝜇𝑘2\left\|\mu_{k}-\lambda_{\pi_{k+1},\mu_{k}}\right\|_{{2}}∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as

μkλπk+1,μk2subscriptnormsubscript𝜇𝑘subscript𝜆subscript𝜋𝑘1subscript𝜇𝑘2\displaystyle\left\|\mu_{k}-\lambda_{\pi_{k+1},\mu_{k}}\right\|_{{2}}∥ italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (1+CErg,(1+Cπ,μ))μμk2+2CErg,|𝒮|η(1γ)δNE,1,kabsent1subscript𝐶Erg1subscript𝐶𝜋𝜇subscriptnormsubscript𝜇subscript𝜇𝑘22subscript𝐶Erg𝒮𝜂1𝛾subscript𝛿NE1𝑘\displaystyle\leq\left(1+C_{\operatorname{Erg,\infty}}\left(1+C_{\pi,\mu}% \right)\right)\left\|\mu_{\star}-\mu_{k}\right\|_{{2}}+\frac{2C_{\operatorname% {Erg,\infty}}\sqrt{|\mathcal{S}|}}{\eta(1-\gamma)}\cdot\sqrt{\delta_{\text{NE}% ,1,k}}≤ ( 1 + italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ) ∥ italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG
(1+CErg,(1+Cπ,μ))δNE,2,k+2CErg,|𝒮|η(1γ)δNE,1,k.absent1subscript𝐶Erg1subscript𝐶𝜋𝜇subscript𝛿NE2𝑘2subscript𝐶Erg𝒮𝜂1𝛾subscript𝛿NE1𝑘\displaystyle\leq\left(1+C_{\operatorname{Erg,\infty}}\left(1+C_{\pi,\mu}% \right)\right)\sqrt{\delta_{\text{NE},2,k}}+\frac{2C_{\operatorname{Erg,\infty% }}\sqrt{|\mathcal{S}|}}{\eta(1-\gamma)}\cdot\sqrt{\delta_{\text{NE},1,k}}\;.≤ ( 1 + italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ) square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 2 , italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG .

Using the last inequality and (20), together with Proposition E.5, we have that

ϕ(πk+1,μk)δNE,1,k+Cϕ((1+CErg,(1+Cπ,μ))δNE,2,k+2CErg,|𝒮|η(1γ)δNE,1,k)=εk.italic-ϕsubscript𝜋𝑘1subscript𝜇𝑘subscript𝛿NE1𝑘subscript𝐶italic-ϕ1subscript𝐶Erg1subscript𝐶𝜋𝜇subscript𝛿NE2𝑘2subscript𝐶Erg𝒮𝜂1𝛾subscript𝛿NE1𝑘subscript𝜀𝑘\displaystyle\phi(\pi_{k+1},\mu_{k})\leq\delta_{\text{NE},1,k}+C_{\phi}\left(% \left(1+C_{\operatorname{Erg,\infty}}\left(1+C_{\pi,\mu}\right)\right)\sqrt{% \delta_{\text{NE},2,k}}+\frac{2C_{\operatorname{Erg,\infty}}\sqrt{|\mathcal{S}% |}}{\eta(1-\gamma)}\cdot\sqrt{\delta_{\text{NE},1,k}}\right)=\varepsilon_{k}\;.italic_ϕ ( italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ( 1 + italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ) square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 2 , italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG italic_δ start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG ) = italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

Therefore, (μk,πk+1)subscript𝜇𝑘subscript𝜋𝑘1(\mu_{k},\pi_{k+1})( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) is εksubscript𝜀𝑘\varepsilon_{k}italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-MFNE as defined in Definition (2.1). ∎

Remark C.7.

From this corollary, it follows directly that to achieve an ε𝜀\varepsilonitalic_ε-MFNE, the required number of inner policy updates L𝐿Litalic_L and outer population updates K𝐾Kitalic_K must satisfy the following scaling conditions:

LO~(1/ε2)andKO~(log(1/ε)).formulae-sequence𝐿~𝑂1superscript𝜀2and𝐾~𝑂1𝜀\displaystyle L\in\widetilde{O}\left(1/\varepsilon^{2}\right)\quad\text{and}% \quad K\in\widetilde{O}\left(\log(1/\varepsilon)\right).italic_L ∈ over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and italic_K ∈ over~ start_ARG italic_O end_ARG ( roman_log ( 1 / italic_ε ) ) .

This implies that the sample complexity of the proposed algorithm scales polynomially in 1/ε1𝜀1/\varepsilon1 / italic_ε with respect to the inner optimization steps and logarithmically with respect to the outer mean-field updates. This confirms the efficiency of our approach, ensuring that even for small values of ε𝜀\varepsilonitalic_ε, convergence to an approximate MFNE remains computationally feasible.

Appendix D Model free algorithms

Model-free approaches play a fundamental role in developing model-agnostic algorithms capable of autonomously adapting to diverse and evolving environments. In the MFG context, various models have been proposed across different domains (Perrin et al., 2021; Yardim et al., 2023), and recent efforts have explored data-driven methodologies to enhance their applicability. Following this line of research, we introduce Sample-Based MF-TRPO, a model-free approach tailored for MFG problems. By leveraging RL techniques with scalable sample-based updates, our method contributes to the growing body of work on data-driven MFG solutions, providing finite-sample complexity guarantees in this setting. This framework further aligns MFG with model-free learning paradigms, broadening their potential for real-world deployment in complex decision-making environments.

D.1 TRPO - sample-based formulation

In the framework established by Shani et al. (2020), it is important to note that the sample-based algorithm does not provide a last-iterate sample complexity guarantee. This contrasts with the exact algorithm, where the policy improvement lemma (Lemma 15, Shani et al., 2020) serves as a foundation for analyzing the convergence properties of the last iterate. In the exact update setting, this guarantee is analogous to Howard’s lemma (Howard, 1960). However, the presence of sampling errors in the sample-based setting hinders the attainment of such guarantees, necessitating a more refined approach when designing and analyzing RL algorithms for MFG.

To address this limitation, it becomes essential to consider alternative strategies than Theorem 5 in Shani et al. (2020). This theorem, however, still provides a framework for analyzing the uniform mixture of the policies generated during the iterative procedure, rather than relying solely on the last iterate. By shifting focus to such policy, we can generalize the theoretical guarantees of the algorithm—a property inherent to the MFG setting.

Additionally, the connection between the value function and the policy space plays a crucial role. Proposition E.2 ensures that the gap in value functions directly bounds the differences between policies. This property provides a pathway to refine the policy improvement process and derive meaningful finite-sample complexity guarantees. By combining these insights, we can propose a robust methodology where the sample-based algorithm achieves convergence with high probability, utilizing uniform mixture policies to overcome the challenges posed by the lack of last-iterate guarantees.

Overall, this refinement introduces a smarter utilization of the sample-based algorithm, emphasizing the role of averaging in mitigating the variability and uncertainty inherent in sample-based methods. This approach not only aligns with the theoretical underpinnings of convex optimization but also strengthens the practical applicability of RL algorithms in MFG, delivering finite-sample complexity results with rigorous probabilistic guarantees.

Algorithm 6 Sample-Based TRPO(μ)𝜇(\mu)( italic_μ )
1:  Initialize: π0(|s)=𝒰(𝒜)\pi_{0}(\cdot|s)=\mathcal{U}(\mathcal{A})italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) = caligraphic_U ( caligraphic_A ) for any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S.
2:  Input: ϵitalic-ϵ\epsilonitalic_ϵ, δ>0𝛿0\delta>0italic_δ > 0, L𝐿Litalic_L.
3:  for [L]delimited-[]𝐿\ell\in[L]roman_ℓ ∈ [ italic_L ] do
4:     SI={}superscriptsubscript𝑆subscript𝐼S_{\ell}^{I_{\ell}}=\{\}italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = { }, s,afor-all𝑠𝑎\forall s,a∀ italic_s , italic_a, Qπ,μ(s,a)=0subscript𝑄subscript𝜋𝜇𝑠𝑎0Q_{\pi_{\ell},\mu}(s,a)=0italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0, n(s,a)=0subscript𝑛𝑠𝑎0n_{\ell}(s,a)=0italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s , italic_a ) = 0
5:     I|𝒜|2(𝗋+η2log2|𝒜|)(|𝒮|log2|𝒜|+log1δ)(1γ)2ϵ2subscript𝐼superscript𝒜2subscriptnormsuperscript𝗋absentsuperscript𝜂2superscript2𝒜𝒮2𝒜1𝛿superscript1𝛾2superscriptitalic-ϵ2I_{\ell}\geq\frac{|\mathcal{A}|^{2}\left(\left\|\mathsf{r}^{\hskip 0.35004pt}% \right\|_{{\infty}}+\eta^{2}\log^{2}|\mathcal{A}|\right)\left(|\mathcal{S}|% \log 2|\mathcal{A}|+\log\frac{1}{\delta}\right)}{(1-\gamma)^{2}\epsilon^{2}}italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) ( | caligraphic_S | roman_log 2 | caligraphic_A | + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG # Sample Trajectories
6:     T11γlog(ϵ|𝒜|(𝗋+ηlog|𝒜|))subscript𝑇11𝛾italic-ϵ𝒜subscriptnormsuperscript𝗋absent𝜂𝒜T_{\ell}\geq\frac{1}{1-\gamma}\log\left(\frac{\epsilon}{|\mathcal{A}|\left(% \left\|\mathsf{r}^{\hskip 0.35004pt}\right\|_{{\infty}}+\eta\log|\mathcal{A}|% \right)}\right)italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG roman_log ( divide start_ARG italic_ϵ end_ARG start_ARG | caligraphic_A | ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG ) # Rollout horizon
7:     for p=1,,I𝑝1subscript𝐼p=1,\dots,I_{\ell}italic_p = 1 , … , italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT do
8:        Sample si𝖽¯ν,μπ()similar-tosubscript𝑠𝑖superscriptsubscript¯𝖽𝜈𝜇subscript𝜋s_{i}\sim\overline{\mathsf{d}}_{\nu,\mu}^{\hskip 0.49005pt\pi_{\ell}}(\cdot)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_ν , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ⋅ ), ai𝒰(𝒜)similar-tosubscript𝑎𝑖𝒰𝒜a_{i}\sim\mathcal{U}(\mathcal{A})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_U ( caligraphic_A )
9:        Qπ^,μ(si,ai,i)𝗋(si,ai,μ)+t=1Tγt𝔼stδsi(𝖯μπ^)t,atπ^(|st)[𝗋(st,at,μ)+ηlog(π^(at|st))]Q_{\hat{\pi}_{\ell},\mu}(s_{i},a_{i},i)\leftarrow\mathsf{r}^{\hskip 0.49005pt}% (s_{i},a_{i},\mu)+\sum_{t=1}^{T_{\ell}}\gamma^{t}\mathbb{E}_{s_{t}\sim\delta_{% s_{i}}(\mathsf{P}_{\mu}^{\hskip 0.35004pt\hat{\pi}_{\ell}})^{t},a_{t}\sim\hat{% \pi}_{\ell}(\cdot|s_{t})}\left[\mathsf{r}^{\hskip 0.49005pt}(s_{t},a_{t},\mu)+% \eta\log\big{(}\hat{\pi}_{\ell}(a_{t}|s_{t})\big{)}\right]italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) ← sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ ) + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ ) + italic_η roman_log ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] # Truncated rollout
10:        Qπ,μ(si,ai)Qπ,μ(si,ai)+Qπ,μ(si,ai,i)subscript𝑄subscript𝜋𝜇subscript𝑠𝑖subscript𝑎𝑖subscript𝑄subscript𝜋𝜇subscript𝑠𝑖subscript𝑎𝑖subscript𝑄subscript𝜋𝜇subscript𝑠𝑖subscript𝑎𝑖𝑖Q_{\pi_{\ell},\mu}(s_{i},a_{i})\leftarrow Q_{\pi_{\ell},\mu}(s_{i},a_{i})+Q_{% \pi_{\ell},\mu}(s_{i},a_{i},i)italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i )
11:        n(si,ai)n(si,ai)+1subscript𝑛subscript𝑠𝑖subscript𝑎𝑖subscript𝑛subscript𝑠𝑖subscript𝑎𝑖1n_{\ell}(s_{i},a_{i})\leftarrow n_{\ell}(s_{i},a_{i})+1italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 1
12:        SI=SI{si}superscriptsubscript𝑆subscript𝐼superscriptsubscript𝑆subscript𝐼subscript𝑠𝑖S_{\ell}^{I_{\ell}}=S_{\ell}^{I_{\ell}}\cup\{s_{i}\}italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∪ { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }
13:     end for
14:     for sSI𝑠superscriptsubscript𝑆subscript𝐼s\in S_{\ell}^{I_{\ell}}italic_s ∈ italic_S start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT do
15:        for a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A do
16:           Qπ^,μ(s,a)|𝒜|Qπ^,μ(s,a)a𝒜n(s,a)subscript𝑄subscript^𝜋𝜇𝑠𝑎𝒜subscript𝑄subscript^𝜋𝜇𝑠𝑎subscriptsuperscript𝑎𝒜subscript𝑛𝑠superscript𝑎Q_{\hat{\pi}_{\ell},\mu}(s,a)\leftarrow\frac{|\mathcal{A}|Q_{\hat{\pi}_{\ell},% \mu}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}n_{\ell}(s,a^{\prime})}italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) ← divide start_ARG | caligraphic_A | italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG
17:        end for
18:        π^+1(a|s)π^(a|s)exp(1η(+2)(Qπ^,μ(s,a)ηlogπ^(a|s)))a𝒜π^(a|s)exp(1η(+2)(Qπ^,μ(s,a)ηlogπ^(a|s)))subscript^𝜋1conditional𝑎𝑠subscript^𝜋conditional𝑎𝑠1𝜂2subscript𝑄subscript^𝜋𝜇𝑠𝑎𝜂subscript^𝜋conditional𝑎𝑠subscriptsuperscript𝑎𝒜subscript^𝜋conditionalsuperscript𝑎𝑠1𝜂2subscript𝑄subscript^𝜋𝜇𝑠superscript𝑎𝜂subscript^𝜋conditionalsuperscript𝑎𝑠\hat{\pi}_{\ell+1}(a|s)\leftarrow\frac{\hat{\pi}_{\ell}(a|s)\exp\left(\frac{1}% {\eta(\ell+2)}\left(Q_{\hat{\pi}_{\ell},\mu}(s,a)-\eta\log\hat{\pi}_{\ell}(a|s% )\right)\right)}{\sum_{a^{\prime}\in\mathcal{A}}\hat{\pi}_{\ell}(a^{\prime}|s)% \exp\left(\frac{1}{\eta(\ell+2)}\left(Q_{\hat{\pi}_{\ell},\mu}(s,a^{\prime})-% \eta\log\hat{\pi}_{\ell}(a^{\prime}|s)\right)\right)}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) ← divide start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a | italic_s ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_η ( roman_ℓ + 2 ) end_ARG ( italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a ) - italic_η roman_log over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s ) roman_exp ( divide start_ARG 1 end_ARG start_ARG italic_η ( roman_ℓ + 2 ) end_ARG ( italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_η roman_log over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s ) ) ) end_ARG
19:     end for
20:  end for
21:  Output: π^LUnif,μsubscriptsuperscript^𝜋Unif𝜇𝐿\hat{\pi}^{\texttt{Unif},\mu}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT.
Remark D.1.

For a fixed μ𝜇\muitalic_μ, the output of Sample-Based TRPO(μ)𝜇(\mu)( italic_μ ) is the uniform mixture policy π^LUnif,μsubscriptsuperscript^𝜋Unif𝜇𝐿\hat{\pi}^{\texttt{Unif},\mu}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. This policy is such that, in the unregularized case, we have (11). It consists on a mixture of π^subscript^𝜋\hat{\pi}_{\ell}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, for =0,,L0𝐿\ell=0,\dots,Lroman_ℓ = 0 , … , italic_L. The following dedicated subroutine achieve the sampling process in a computationally efficient manner, without performing a direct mixture at every decision step.

Algorithm 7 Uniform-Mixture({π^}=0,,L)subscriptsubscript^𝜋0𝐿(\left\{\hat{\pi}_{\ell}\right\}_{\ell=0,\dots,L})( { over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 0 , … , italic_L end_POSTSUBSCRIPT )
1:  Input: {π^}=0,,Lsubscriptsubscript^𝜋0𝐿\left\{\hat{\pi}_{\ell}\right\}_{\ell=0,\dots,L}{ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 0 , … , italic_L end_POSTSUBSCRIPT.
2:  Draw a random variable ^𝒰({0,1,,L})similar-to^𝒰01𝐿\hat{\ell}\sim\mathcal{U}(\{0,1,\dots,L\})over^ start_ARG roman_ℓ end_ARG ∼ caligraphic_U ( { 0 , 1 , … , italic_L } ).
3:  Output: π^^subscript^𝜋^\hat{\pi}_{\hat{\ell}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT over^ start_ARG roman_ℓ end_ARG end_POSTSUBSCRIPT

This policy is defined is to sample from this policy efficiently without explicitly computing an arithmetic average at the sampling level, particularly in its use within the inner loop of Sample-Based TRPO(μ)𝜇(\mu)( italic_μ ). Moroever, given that the number of iterations L𝐿Litalic_L is fixed beforehand, the procedure begins by drawing a random variable ^^\hat{\ell}over^ start_ARG roman_ℓ end_ARG uniformly from the set {0,1,,L}01𝐿\{0,1,\dots,L\}{ 0 , 1 , … , italic_L }. Once ^^\hat{\ell}over^ start_ARG roman_ℓ end_ARG is selected, the sampling step follows the policy π^^subscript^𝜋^\hat{\pi}_{\hat{\ell}}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT over^ start_ARG roman_ℓ end_ARG end_POSTSUBSCRIPT. This approach ensures that the selected action is drawn a policy π^LUnif,μsubscriptsuperscript^𝜋Unif𝜇𝐿\hat{\pi}^{\texttt{Unif},\mu}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT without incurring unnecessary computational overhead during execution.

In particular, due to the regularization term, we have that the following inequality holds:

1L+1=0LJMFG(π^,μ,μ)JMFG(π^LUnif,μ,μ,μ).1𝐿1superscriptsubscript0𝐿superscript𝐽MFGsubscript^𝜋𝜇𝜇superscript𝐽MFGsubscriptsuperscript^𝜋Unif𝜇𝐿𝜇𝜇\displaystyle\frac{1}{L+1}\sum_{\ell=0}^{L}J^{\operatorname{MFG}}(\hat{\pi}_{% \ell},\mu,\mu)\leq J^{\operatorname{MFG}}(\hat{\pi}^{\texttt{Unif},\mu}_{L},% \mu,\mu)\;.divide start_ARG 1 end_ARG start_ARG italic_L + 1 end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) ≤ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_μ , italic_μ ) . (21)

In the absence of regularization, the objective is linear in the occupancy measure, which allows for exact equalities when considering mixtures of policies. However, once the entropic regularization term is introduced, the objective becomes concave in the occupancy measure (see, e.g., Neu et al. (2017) for a proof). As a result, we only obtain the previous inequality rather than (11) when averaging over iterates, as in the relation involving the mixture policy.

Theorem D.2 (Based on Theorem 5 in Shani et al. (2020)).

Suppose Assumption 4 holds. Fix ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0. Let {π^}0subscriptsubscript^𝜋0\{\hat{\pi}_{\ell}\}_{\ell\geq 0}{ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT } start_POSTSUBSCRIPT roman_ℓ ≥ 0 end_POSTSUBSCRIPT be the sequence generated by Sample-Based TRPO(μ)μ(\mu)( italic_μ ), using

I|𝒜|2(𝗋2+η2log2|𝒜|)(|𝒮|log2|𝒜|+log1δ)(1γ)2ϵ2subscript𝐼superscript𝒜2superscriptsubscriptnormsuperscript𝗋absent2superscript𝜂2superscript2𝒜𝒮2𝒜1𝛿superscript1𝛾2superscriptitalic-ϵ2\displaystyle I_{\ell}\geq\frac{|\mathcal{A}|^{2}\left(\left\|\mathsf{r}^{% \hskip 0.49005pt}\right\|_{{\infty}}^{2}+\eta^{2}\log^{2}|\mathcal{A}|\right)% \left(|\mathcal{S}|\log 2|\mathcal{A}|+\log\frac{1}{\delta}\right)}{(1-\gamma)% ^{2}\epsilon^{2}}italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) ( | caligraphic_S | roman_log 2 | caligraphic_A | + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

trajectories in each iteration and a rollout up to time Tsubscript𝑇T_{\ell}italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT with

T11γlog(ϵ|𝒜|(𝗋+ηlog|𝒜|)).subscript𝑇11𝛾italic-ϵ𝒜subscriptnormsuperscript𝗋absent𝜂𝒜\displaystyle T_{\ell}\geq\frac{1}{1-\gamma}\log\left(\frac{\epsilon}{|% \mathcal{A}|\left(\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta% \log|\mathcal{A}|\right)}\right)\;.italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG roman_log ( divide start_ARG italic_ϵ end_ARG start_ARG | caligraphic_A | ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG ) .

Then, there exists CTRPO,1>0superscriptsubscript𝐶TRPO10C_{\operatorname{TRPO,1}}^{\prime}>0italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 such that for all L1𝐿1L\geq 1italic_L ≥ 1, the following holds with probability greater than 1δ1𝛿1-\delta1 - italic_δ

JMFG(πμ,μ,μ)JMFG(π^LUnif,μ,μ,μ)JMFG(πμ,μ,μ)1L+1=0LJMFG(π^,μ,μ)CTRPO,1((𝗋2+η2log2|𝒜|)|𝒜|2logLη(1γ)3(L+1)+ϵ(1γ)2𝖽¯μ,μπμν).superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscriptsuperscript^𝜋Unif𝜇𝐿𝜇𝜇superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇1𝐿1superscriptsubscript0𝐿superscript𝐽MFGsubscript^𝜋𝜇𝜇superscriptsubscript𝐶TRPO1superscriptsubscriptnormsuperscript𝗋absent2superscript𝜂2superscript2𝒜superscript𝒜2𝐿𝜂superscript1𝛾3𝐿1italic-ϵsuperscript1𝛾2subscriptdelimited-∥∥superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇𝜈\displaystyle\begin{split}&J^{\operatorname{MFG}}\left(\pi_{\mu},\mu,\mu\right% )-J^{\operatorname{MFG}}(\hat{\pi}^{\texttt{Unif},\mu}_{L},\mu,\mu)\\ &\leq J^{\operatorname{MFG}}\left(\pi_{\mu},\mu,\mu\right)-\frac{1}{L+1}\sum_{% \ell=0}^{L}J^{\operatorname{MFG}}\left(\hat{\pi}_{\ell},\mu,\mu\right)\\ &\leq C_{\operatorname{TRPO,1}}^{\prime}\left(\frac{\left(\left\|\mathsf{r}^{% \hskip 0.49005pt}\right\|_{{\infty}}^{2}+\eta^{2}\log^{2}|\mathcal{A}|\right)|% \mathcal{A}|^{2}\log L}{\eta(1-\gamma)^{3}(L+1)}+\frac{\epsilon}{(1-\gamma)^{2% }}\left\|\frac{\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}}{% \nu}\right\|_{{\infty}}\right)\;.\end{split}start_ROW start_CELL end_CELL start_CELL italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_μ , italic_μ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - divide start_ARG 1 end_ARG start_ARG italic_L + 1 end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( divide start_ARG ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_L end_ARG start_ARG italic_η ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_L + 1 ) end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ divide start_ARG over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) . end_CELL end_ROW (22)
Proof.

The proof of this result is based on the proof of Shani et al. (Theorem 5, 2020). The main difference is that we are considering the uniform mixture of the policies generated during the iterative procedure.

Applying Shani et al. (Lemma 19, 2020), we get

1γη(+2)(JMFG(πμ,μ,μ)JMFG(π^,μ,μ))1𝛾𝜂2superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscript^𝜋𝜇𝜇\displaystyle\frac{1-\gamma}{\eta(\ell+2)}\left(J^{\operatorname{MFG}}\left(% \pi_{\mu},\mu,\mu\right)-J^{\operatorname{MFG}}\left(\hat{\pi}_{\ell},\mu,\mu% \right)\right)divide start_ARG 1 - italic_γ end_ARG start_ARG italic_η ( roman_ℓ + 2 ) end_ARG ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) )
𝖽¯μ,μπμ((11+2)𝐃Ω(πμ,π^)𝐃Ω(πμ,π^+1))+h2()2η2(+2)2+𝖽¯μ,μπμϵkabsentsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇112subscript𝐃Ωsubscript𝜋𝜇subscript^𝜋subscript𝐃Ωsubscript𝜋𝜇subscript^𝜋1superscript22superscript𝜂2superscript22superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ𝑘\displaystyle\qquad\leq\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{% \mu}}\left(\left(1-\frac{1}{\ell+2}\right)\mathbf{D}_{\Omega}(\pi_{\mu},\hat{% \pi}_{\ell})-\mathbf{D}_{\Omega}(\pi_{\mu},\hat{\pi}_{\ell+1})\right)+\frac{h^% {2}(\ell)}{2\eta^{2}(\ell+2)^{2}}+\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49% 005pt\pi_{\mu}}\epsilon_{k}≤ over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( 1 - divide start_ARG 1 end_ARG start_ARG roman_ℓ + 2 end_ARG ) bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) - bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ) ) + divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ ) end_ARG start_ARG 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ + 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
𝖽¯μ,μπμ(+1+2𝐃Ω(πμ,π^)𝐃Ω(πμ,π^+1))+h2(L)2η2(+2)2+𝖽¯μ,μπμϵk,absentsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇12subscript𝐃Ωsubscript𝜋𝜇subscript^𝜋subscript𝐃Ωsubscript𝜋𝜇subscript^𝜋1superscript2𝐿2superscript𝜂2superscript22superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ𝑘\displaystyle\qquad\leq\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{% \mu}}\left(\frac{\ell+1}{\ell+2}\mathbf{D}_{\Omega}(\pi_{\mu},\hat{\pi}_{\ell}% )-\mathbf{D}_{\Omega}(\pi_{\mu},\hat{\pi}_{\ell+1})\right)+\frac{h^{2}(L)}{2% \eta^{2}(\ell+2)^{2}}+\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{% \mu}}\epsilon_{k}\;,≤ over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( divide start_ARG roman_ℓ + 1 end_ARG start_ARG roman_ℓ + 2 end_ARG bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) - bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ) ) + divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L ) end_ARG start_ARG 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ + 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

with

𝖽¯μ,μπμϵk=superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ𝑘absent\displaystyle\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}% \epsilon_{k}=over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = s𝒮𝖽¯μ,μπμ(s)I𝖽¯ν,μπμ(s)i=1I𝟙{s=si}a𝒜(1η(+2)(|𝒜|Qπ^,μ(s,a,i)η(1+logπ^(a|s)))\displaystyle\sum_{s\in\mathcal{S}}\frac{\overline{\mathsf{d}}_{\mu,\mu}^{% \hskip 0.49005pt\pi_{\mu}}(s)}{I_{\ell}\overline{\mathsf{d}}_{\nu,\mu}^{\hskip 0% .49005pt\pi_{\mu}}(s)}\sum_{i=1}^{I_{\ell}}\mathbbm{1}_{\{s=s_{i}\}}\sum_{a\in% \mathcal{A}}\bigg{(}\frac{1}{\eta(\ell+2)}\big{(}|\mathcal{A}|Q_{\hat{\pi}_{% \ell},\mu}(s,a,i)-\eta\left(1+\log\hat{\pi}_{\ell}(a|s)\right)\big{)}∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT divide start_ARG over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s ) end_ARG start_ARG italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_ν , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s ) end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_s = italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_η ( roman_ℓ + 2 ) end_ARG ( | caligraphic_A | italic_Q start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ( italic_s , italic_a , italic_i ) - italic_η ( 1 + roman_log over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) )
logπ^+1(a|s)+logπ^(a|s))(π^+1(a|s)πμ(a|s))\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad-\log\hat{\pi}_{\ell+1}% (a|s)+\log\hat{\pi}_{\ell}(a|s)\bigg{)}\left(\hat{\pi}_{\ell+1}(a|s)-\pi_{\mu}% (a|s)\right)- roman_log over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) + roman_log over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ( italic_a | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_a | italic_s ) )
h()=absent\displaystyle h(\ell)=italic_h ( roman_ℓ ) = (1+8η)𝗋+ηlog|𝒜|1γlog18𝜂subscriptnormsuperscript𝗋absent𝜂𝒜1𝛾\displaystyle(1+8\eta)\frac{\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{% \infty}}+\eta\log|\mathcal{A}|}{1-\gamma}\log\ell( 1 + 8 italic_η ) divide start_ARG ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | end_ARG start_ARG 1 - italic_γ end_ARG roman_log roman_ℓ

using that hhitalic_h is a non-decreasing function. Multiplying both sides by η(+2)𝜂2\eta(\ell+2)italic_η ( roman_ℓ + 2 ), summing from =00\ell=0roman_ℓ = 0 to L𝐿Litalic_L, and using the linearity of expectation, we get

(1γ)=0L(JMFG(πμ,μ,μ)JMFG(π^,μ,μ))1𝛾superscriptsubscript0𝐿superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscript^𝜋𝜇𝜇\displaystyle(1-\gamma)\sum_{\ell=0}^{L}\left(J^{\operatorname{MFG}}\left(\pi_% {\mu},\mu,\mu\right)-J^{\operatorname{MFG}}\left(\hat{\pi}_{\ell},\mu,\mu% \right)\right)( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) )
𝖽¯μ,μπμ(𝐃Ω(πμ,π0)(L+2)𝐃Ω(πμ,πL+1))+=0Lh2(L)2η(+2)+=0Lη(+2)𝖽¯μ,μπμϵabsentsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscript𝐃Ωsubscript𝜋𝜇subscript𝜋0𝐿2subscript𝐃Ωsubscript𝜋𝜇subscript𝜋𝐿1superscriptsubscript0𝐿superscript2𝐿2𝜂2superscriptsubscript0𝐿𝜂2superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ\displaystyle\leq\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}% \big{(}\mathbf{D}_{\Omega}(\pi_{\mu},\pi_{0})-(L+2)\mathbf{D}_{\Omega}(\pi_{% \mu},\pi_{L+1})\big{)}+\sum_{\ell=0}^{L}\frac{h^{2}(L)}{2\eta(\ell+2)}+\sum_{% \ell=0}^{L}\eta(\ell+2)\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{% \mu}}\epsilon_{\ell}≤ over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ( italic_L + 2 ) bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L ) end_ARG start_ARG 2 italic_η ( roman_ℓ + 2 ) end_ARG + ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_η ( roman_ℓ + 2 ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT
𝖽¯μ,μπμ𝐃Ω(πμ,π0)+=0Lh2(L)2η(+2)+=0Lη(+2)𝖽¯μ,μπμϵabsentsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscript𝐃Ωsubscript𝜋𝜇subscript𝜋0superscriptsubscript0𝐿superscript2𝐿2𝜂2superscriptsubscript0𝐿𝜂2superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ\displaystyle\leq\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}% \mathbf{D}_{\Omega}(\pi_{\mu},\pi_{0})+\sum_{\ell=0}^{L}\frac{h^{2}(L)}{2\eta(% \ell+2)}+\sum_{\ell=0}^{L}\eta(\ell+2)\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0% .49005pt\pi_{\mu}}\epsilon_{\ell}≤ over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L ) end_ARG start_ARG 2 italic_η ( roman_ℓ + 2 ) end_ARG + ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_η ( roman_ℓ + 2 ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT
log|𝒜|+=0Lh2(L)2η(+2)+=0Lη(+2)𝖽¯μ,μπμϵ,absent𝒜superscriptsubscript0𝐿superscript2𝐿2𝜂2superscriptsubscript0𝐿𝜂2superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ\displaystyle\leq\log|\mathcal{A}|+\sum_{\ell=0}^{L}\frac{h^{2}(L)}{2\eta(\ell% +2)}+\sum_{\ell=0}^{L}\eta(\ell+2)\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49% 005pt\pi_{\mu}}\epsilon_{\ell}\;,≤ roman_log | caligraphic_A | + ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L ) end_ARG start_ARG 2 italic_η ( roman_ℓ + 2 ) end_ARG + ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_η ( roman_ℓ + 2 ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ,

with the occupancy measure 𝖽¯μ,μπμsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT defined as (6), where the second relation holds by the positivity of the Bregman distance, and the third relation by Shani et al. (Lemma 28, 2020) for uniformly initialized π0subscript𝜋0\pi_{0}italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

=0L(JMFG(πμ,μ,μ)JMFG(π^,μ,μ))superscriptsubscript0𝐿superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscript^𝜋𝜇𝜇\displaystyle\sum_{\ell=0}^{L}\left(J^{\operatorname{MFG}}\left(\pi_{\mu},\mu,% \mu\right)-J^{\operatorname{MFG}}\left(\hat{\pi}_{\ell},\mu,\mu\right)\right)∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) ) log|𝒜|1γ+CTRPO,1h2(L)logLη(1γ)+11γ=0Lη(+2)𝖽¯μ,μπμϵ.absent𝒜1𝛾superscriptsubscript𝐶TRPO1superscript2𝐿𝐿𝜂1𝛾11𝛾superscriptsubscript0𝐿𝜂2superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ\displaystyle\leq\frac{\log|\mathcal{A}|}{1-\gamma}+C_{\operatorname{TRPO,1}}^% {\prime}\frac{h^{2}(L)\log L}{\eta(1-\gamma)}+\frac{1}{1-\gamma}\sum_{\ell=0}^% {L}\eta(\ell+2)\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}% \epsilon_{\ell}\;.≤ divide start_ARG roman_log | caligraphic_A | end_ARG start_ARG 1 - italic_γ end_ARG + italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L ) roman_log italic_L end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG + divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_η ( roman_ℓ + 2 ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT .

Dividing by (L+1)𝐿1(L+1)( italic_L + 1 ), we obtain

1L+1=0L(JMFG(πμ,μ,μ)JMFG(π^,μ,μ))1𝐿1superscriptsubscript0𝐿superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscript^𝜋𝜇𝜇\displaystyle\frac{1}{L+1}\sum_{\ell=0}^{L}\left(J^{\operatorname{MFG}}\left(% \pi_{\mu},\mu,\mu\right)-J^{\operatorname{MFG}}\left(\hat{\pi}_{\ell},\mu,\mu% \right)\right)divide start_ARG 1 end_ARG start_ARG italic_L + 1 end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) )
log|𝒜|(1γ)(L+1)+CTRPO,1h2(L)logLη(1γ)(L+1)+1(1γ)(L+1)=0Lη(+2)𝖽¯μ,μπμϵ.absent𝒜1𝛾𝐿1superscriptsubscript𝐶TRPO1superscript2𝐿𝐿𝜂1𝛾𝐿111𝛾𝐿1superscriptsubscript0𝐿𝜂2superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵ\displaystyle\leq\frac{\log|\mathcal{A}|}{(1-\gamma)(L+1)}+C_{\operatorname{% TRPO,1}}^{\prime}\frac{h^{2}(L)\log L}{\eta(1-\gamma)(L+1)}+\frac{1}{(1-\gamma% )(L+1)}\sum_{\ell=0}^{L}\eta(\ell+2)\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.% 49005pt\pi_{\mu}}\epsilon_{\ell}\;.≤ divide start_ARG roman_log | caligraphic_A | end_ARG start_ARG ( 1 - italic_γ ) ( italic_L + 1 ) end_ARG + italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L ) roman_log italic_L end_ARG start_ARG italic_η ( 1 - italic_γ ) ( italic_L + 1 ) end_ARG + divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) ( italic_L + 1 ) end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_η ( roman_ℓ + 2 ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT .

Plugging in Shani et al. (Lemma 22 and Lemma 23, 2020), we get that for any (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ), if the number of trajectories in the \ellroman_ℓ-th iteration satisfies

I8|𝒜|2(𝗋2+η2log2|𝒜|)ϵ2(1γ)2(|𝒮|log2|𝒜|+logπ2(+1)26δ),subscript𝐼8superscript𝒜2superscriptsubscriptnormsuperscript𝗋absent2superscript𝜂2superscript2𝒜superscriptitalic-ϵ2superscript1𝛾2𝒮2𝒜superscript𝜋2superscript126𝛿\displaystyle I_{\ell}\geq\frac{8|\mathcal{A}|^{2}\left(\left\|\mathsf{r}^{% \hskip 0.49005pt}\right\|_{{\infty}}^{2}+\eta^{2}\log^{2}|\mathcal{A}|\right)}% {\epsilon^{2}(1-\gamma)^{2}}\left(|\mathcal{S}|\log 2|\mathcal{A}|+\log\frac{% \pi^{2}(\ell+1)^{2}}{6\delta}\right)\;,italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG 8 | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( | caligraphic_S | roman_log 2 | caligraphic_A | + roman_log divide start_ARG italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( roman_ℓ + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 6 italic_δ end_ARG ) ,

and the rollout is performed up to time Tsubscript𝑇T_{\ell}italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT with

T11γlog(ϵ|𝒜|(𝗋+ηlog|𝒜|)),subscript𝑇11𝛾italic-ϵ𝒜subscriptnormsuperscript𝗋absent𝜂𝒜\displaystyle T_{\ell}\geq\frac{1}{1-\gamma}\log\left(\frac{\epsilon}{|% \mathcal{A}|\left(\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta% \log|\mathcal{A}|\right)}\right)\;,italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG roman_log ( divide start_ARG italic_ϵ end_ARG start_ARG | caligraphic_A | ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG ) ,

then with probability at least 1δ1𝛿1-\delta1 - italic_δ,

1L+1=0L(JMFG(πμ,μ,μ)JMFG(π^,μ,μ))1𝐿1superscriptsubscript0𝐿superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscript^𝜋𝜇𝜇\displaystyle\frac{1}{L+1}\sum_{\ell=0}^{L}\left(J^{\operatorname{MFG}}\left(% \pi_{\mu},\mu,\mu\right)-J^{\operatorname{MFG}}\left(\hat{\pi}_{\ell},\mu,\mu% \right)\right)divide start_ARG 1 end_ARG start_ARG italic_L + 1 end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) )
log|𝒜|(1γ)(L+1)+CTRPO,1h2(L)logLη(1γ)(L+1)+1(1γ)(L+1)=0Lη(+2)𝖽¯μ,μπμϵ+CTRPO,1ϵ(1γ)2𝖽¯μ,μπμνTV,𝒜1𝛾𝐿1superscriptsubscript𝐶TRPO1superscript2𝐿𝐿𝜂1𝛾𝐿111𝛾𝐿1superscriptsubscript0𝐿𝜂2superscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇subscriptitalic-ϵsuperscriptsubscript𝐶TRPO1italic-ϵsuperscript1𝛾2subscriptnormsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇𝜈TV\displaystyle\frac{\log|\mathcal{A}|}{(1-\gamma)(L+1)}+C_{\operatorname{TRPO,1% }}^{\prime}\frac{h^{2}(L)\log L}{\eta(1-\gamma)(L+1)}+\frac{1}{(1-\gamma)(L+1)% }\sum_{\ell=0}^{L}\eta(\ell+2)\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005% pt\pi_{\mu}}\epsilon_{\ell}+C_{\operatorname{TRPO,1}}^{\prime}\frac{\epsilon}{% (1-\gamma)^{2}}\left\|\frac{\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt% \pi_{\mu}}}{\nu}\right\|_{\mathrm{TV}},divide start_ARG roman_log | caligraphic_A | end_ARG start_ARG ( 1 - italic_γ ) ( italic_L + 1 ) end_ARG + italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L ) roman_log italic_L end_ARG start_ARG italic_η ( 1 - italic_γ ) ( italic_L + 1 ) end_ARG + divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) ( italic_L + 1 ) end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_η ( roman_ℓ + 2 ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG italic_ϵ end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ divide start_ARG over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν end_ARG ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ,

where we used Assumption 4 to bound the last term. Thus, combining this with (21), we obtain that

JMFG(πμ,μ,μ)JMFG(π^LUnif,μ,μ,μ)superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscriptsuperscript^𝜋Unif𝜇𝐿𝜇𝜇\displaystyle J^{\operatorname{MFG}}\left(\pi_{\mu},\mu,\mu\right)-J^{% \operatorname{MFG}}(\hat{\pi}^{\texttt{Unif},\mu}_{L},\mu,\mu)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , italic_μ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_μ , italic_μ )
1L+1=0L(JMFG(πμ,μ,μ)JMFG(π^,μ,μ))absent1𝐿1superscriptsubscript0𝐿superscript𝐽MFGsubscript𝜋𝜇𝜇𝜇superscript𝐽MFGsubscript^𝜋𝜇𝜇\displaystyle\leq\frac{1}{L+1}\sum_{\ell=0}^{L}\left(J^{\operatorname{MFG}}% \left(\pi_{\mu},\mu,\mu\right)-J^{\operatorname{MFG}}\left(\hat{\pi}_{\ell},% \mu,\mu\right)\right)≤ divide start_ARG 1 end_ARG start_ARG italic_L + 1 end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_μ , italic_μ ) )
CTRPO,1((𝗋2+η2log2|𝒜|)|𝒜|2logLη(1γ)3(L+1)+ϵ(1γ)2𝖽¯μ,μπμνTV).absentsuperscriptsubscript𝐶TRPO1superscriptsubscriptnormsuperscript𝗋absent2superscript𝜂2superscript2𝒜superscript𝒜2𝐿𝜂superscript1𝛾3𝐿1italic-ϵsuperscript1𝛾2subscriptnormsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇𝜈TV\displaystyle\leq C_{\operatorname{TRPO,1}}^{\prime}\left(\frac{\left(\left\|% \mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}^{2}+\eta^{2}\log^{2}|\mathcal% {A}|\right)|\mathcal{A}|^{2}\log L}{\eta(1-\gamma)^{3}(L+1)}+\frac{\epsilon}{(% 1-\gamma)^{2}}\left\|\frac{\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt% \pi_{\mu}}}{\nu}\right\|_{\mathrm{TV}}\right)\;.≤ italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( divide start_ARG ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_L end_ARG start_ARG italic_η ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_L + 1 ) end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ divide start_ARG over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν end_ARG ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ) .

D.2 Initialization step in the sample-based algorithm

The initialization step of the Sample-Based MF-TRPO algorithm presents particular challenges due to the limited operations allowed, specifically the reset and action operations, as described in Section 2. During each iteration of the algorithm, the initial state must be sampled from the distribution μ^k1subscript^𝜇𝑘1\hat{\mu}_{k-1}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT.

As detailed in Section 5, the distribution update in the algorithm follows the iterative rule:

μ^ksubscript^𝜇𝑘\displaystyle\hat{\mu}_{k}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT μ^k1+βk(ζ^kμ^k1),absentsubscript^𝜇𝑘1subscript𝛽𝑘subscript^𝜁𝑘subscript^𝜇𝑘1\displaystyle\leftarrow\hat{\mu}_{k-1}+\beta_{k}\left(\hat{\zeta}_{k}-\hat{\mu% }_{k-1}\right),← over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ,

where ζ^ksubscript^𝜁𝑘\hat{\zeta}_{k}over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the output of a single iteration of the Sample-Based MF-TRPO algorithm.

Consequently, at iteration k𝑘kitalic_k, the distribution μ^k1subscript^𝜇𝑘1\hat{\mu}_{k-1}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT is an estimator of

j=1k1(1βj)ν+=1k1βj=+1k1(1βj)ν(𝖯μ^1π^1)M(𝖯μ^π^)M,superscriptsubscriptproduct𝑗1𝑘11subscript𝛽𝑗𝜈superscriptsubscript1𝑘1subscript𝛽superscriptsubscriptproduct𝑗1𝑘11subscript𝛽𝑗𝜈superscriptsuperscriptsubscript𝖯subscript^𝜇1subscript^𝜋1𝑀superscriptsuperscriptsubscript𝖯subscript^𝜇subscript^𝜋𝑀\displaystyle\prod_{j=1}^{k-1}(1-\beta_{j})\nu+\sum_{\ell=1}^{k-1}\beta_{\ell}% \prod_{j=\ell+1}^{k-1}(1-\beta_{j})\nu\left(\mathsf{P}_{\hat{\mu}_{1}}^{\hskip 0% .49005pt\hat{\pi}_{1}}\right)^{M}\cdots\left(\mathsf{P}_{\hat{\mu}_{\ell}}^{% \hskip 0.49005pt\hat{\pi}_{\ell}}\right)^{M}\;,∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_ν + ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_ν ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⋯ ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ,

since the update ζ^subscript^𝜁\hat{\zeta}_{\ell}over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT at iteration \ellroman_ℓ of Sample-Based MF-TRPO, is an unbiased estimator of the product distribution ν(𝖯μ^1π^1)M(𝖯μ^iπ^i)M𝜈superscriptsuperscriptsubscript𝖯subscript^𝜇1subscript^𝜋1𝑀superscriptsuperscriptsubscript𝖯subscript^𝜇𝑖subscript^𝜋𝑖𝑀\nu(\mathsf{P}_{\hat{\mu}_{1}}^{\hskip 0.49005pt\hat{\pi}_{1}})^{M}\cdots(% \mathsf{P}_{\hat{\mu}_{i}}^{\hskip 0.49005pt\hat{\pi}_{i}})^{M}italic_ν ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⋯ ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT.

To correctly initialize the environment to a state s such that sμ^ksimilar-to𝑠subscript^𝜇𝑘s\sim\hat{\mu}_{k}italic_s ∼ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the following subroutine is applied:

  1. 1.

    Sampling a Level. Define the categorical random variable CatksubscriptCat𝑘{\rm Cat}_{k}roman_Cat start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that takes value in the discrete space {0,1,,k}01𝑘\{0,1,\dots,k\}{ 0 , 1 , … , italic_k } with probabilities given by

    (j=1k(1βj),β1j=2k(1βj),,βk1(1βk),βk),superscriptsubscriptproduct𝑗1𝑘1subscript𝛽𝑗subscript𝛽1superscriptsubscriptproduct𝑗2𝑘1subscript𝛽𝑗subscript𝛽𝑘11subscript𝛽𝑘subscript𝛽𝑘\displaystyle\left(\prod_{j=1}^{k}(1-\beta_{j}),\beta_{1}\prod_{j=2}^{k}(1-% \beta_{j}),\dots,\beta_{k-1}(1-\beta_{k}),\beta_{k}\right),( ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , … , italic_β start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

    i.e.,

    (Catk=)=βj=+1k(1βj).subscriptCat𝑘subscript𝛽superscriptsubscriptproduct𝑗1𝑘1subscript𝛽𝑗\displaystyle\mathbb{P}\left({\rm Cat}_{k}=\ell\right)=\beta_{\ell}\prod_{j=% \ell+1}^{k}(1-\beta_{j})\;.blackboard_P ( roman_Cat start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_ℓ ) = italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = roman_ℓ + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .
  2. 2.

    Selecting a Level. Draw a sample ^Catksimilar-to^subscriptCat𝑘\hat{\ell}\sim{\rm Cat}_{k}over^ start_ARG roman_ℓ end_ARG ∼ roman_Cat start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

  3. 3.

    Rollout Procedure. Starting from an initial state sampled as s0,p,kinitνsimilar-tosubscriptsuperscript𝑠init0𝑝𝑘𝜈s^{\rm init}_{0,p,k}\sim\nuitalic_s start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_p , italic_k end_POSTSUBSCRIPT ∼ italic_ν, execute a rollout of the Markov transition kernels up to level ^^\hat{\ell}over^ start_ARG roman_ℓ end_ARG, having μ^k1subscript^𝜇𝑘1\hat{\mu}_{k-1}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT to be the particle approximation of ν(𝖯μ^1π^1)M(𝖯μ^π^)M.𝜈superscriptsuperscriptsubscript𝖯subscript^𝜇1subscript^𝜋1𝑀superscriptsuperscriptsubscript𝖯subscript^𝜇subscript^𝜋𝑀\nu(\mathsf{P}_{\hat{\mu}_{1}}^{\hskip 0.49005pt\hat{\pi}_{1}})^{M}\cdots(% \mathsf{P}_{\hat{\mu}_{\ell}}^{\hskip 0.49005pt\hat{\pi}_{\ell}})^{M}.italic_ν ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⋯ ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT .

D.3 Sample based algorithm and High Probability Estimates

Transitioning from exact computations to a sample-based setting, we introduce estimators for the key quantities involved in the learning process. These estimators leverage sampled trajectories to approximate the necessary expectations while maintaining computational efficiency.

To ensure the reliability of these approximations, we establish high-probability error bounds by leveraging concentration inequalities. This allows us to rigorously assess the performance of the algorithm, providing quantitative guarantees on the estimation error and its impact on the overall convergence rate. Through this probabilistic framework, we ensure that the sample-based algorithm retains stability and efficiency despite the inherent stochasticity.

Algorithm 8 Sample-Based MF-TRPO
1:  Input: K𝐾Kitalic_K.
2:  Initialize: Initial policy π0(|s)=𝒰(𝒜)\pi_{0}(\cdot|s)=\mathcal{U}(\mathcal{A})italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ | italic_s ) = caligraphic_U ( caligraphic_A ), for any s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. Initial distribution μ0=νsubscript𝜇0𝜈\mu_{0}=\nuitalic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ν.
3:  for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] do
4:     π^ksubscript^𝜋𝑘absent\hat{\pi}_{k}\leftarrowover^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ←Sample-Based TRPO(μ^k1)subscript^𝜇𝑘1(\hat{\mu}_{k-1})( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ). # Update of the policy.
5:     for p[P]𝑝delimited-[]𝑃p\in[P]italic_p ∈ [ italic_P ] do
6:        ^Catksimilar-to^subscriptCat𝑘\hat{\ell}\sim{\rm Cat}_{k}over^ start_ARG roman_ℓ end_ARG ∼ roman_Cat start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT # Sampling a level.
7:        Sample s0,p,kinitνsimilar-tosubscriptsuperscript𝑠init0𝑝𝑘𝜈s^{\rm init}_{0,p,k}\sim\nuitalic_s start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_p , italic_k end_POSTSUBSCRIPT ∼ italic_ν.
8:        for [^1]delimited-[]^1\ell\in[\hat{\ell}-1]roman_ℓ ∈ [ over^ start_ARG roman_ℓ end_ARG - 1 ] do
9:           for m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ] do
10:              Sample sm+(1)M,p,kinit𝖯μ^π^(|s(m1)+(1)M,p,kinit)s^{\rm init}_{m+(\ell-1)M,p,k}\sim\mathsf{P}_{\hat{\mu}_{\ell}}^{\hskip 0.4900% 5pt\hat{\pi}_{\ell}}(\cdot|s^{\rm init}_{(m-1)+(\ell-1)M,p,k})italic_s start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m + ( roman_ℓ - 1 ) italic_M , italic_p , italic_k end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_m - 1 ) + ( roman_ℓ - 1 ) italic_M , italic_p , italic_k end_POSTSUBSCRIPT ). # Rollout from the MDP for level \ellroman_ℓ.
11:           end for
12:        end for
13:        Initialize s0,p,k=s^M,p,kinitsubscript𝑠0𝑝𝑘subscriptsuperscript𝑠init^𝑀𝑝𝑘s_{0,p,k}=s^{\rm init}_{\hat{\ell}M,p,k}italic_s start_POSTSUBSCRIPT 0 , italic_p , italic_k end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG roman_ℓ end_ARG italic_M , italic_p , italic_k end_POSTSUBSCRIPT.# Initialization.
14:        for m[M]𝑚delimited-[]𝑀m\in[M]italic_m ∈ [ italic_M ] do
15:           Sample sm,p,ka𝒜𝖯(|sm1,p,k,a,μ^k1)π^k(a|sm1,p,k)s_{m,p,k}\sim\sum_{a\in\mathcal{A}}\mathsf{P}^{\hskip 0.49005pt}(\cdot|s_{m-1,% p,k},a,\hat{\mu}_{k-1})\hat{\pi}_{k}(a|s_{m-1,p,k})italic_s start_POSTSUBSCRIPT italic_m , italic_p , italic_k end_POSTSUBSCRIPT ∼ ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_m - 1 , italic_p , italic_k end_POSTSUBSCRIPT , italic_a , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT italic_m - 1 , italic_p , italic_k end_POSTSUBSCRIPT ). # Rollout from the MDP.
16:        end for
17:        ζ^k,p𝟣{sM,p,k}()subscript^𝜁𝑘𝑝subscript1subscript𝑠𝑀𝑝𝑘\hat{\zeta}_{k,p}\leftarrow\mathsf{1}_{\{s_{M,p,k}\}}(\cdot)over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT ← sansserif_1 start_POSTSUBSCRIPT { italic_s start_POSTSUBSCRIPT italic_M , italic_p , italic_k end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( ⋅ ).
18:     end for
19:     ζ^k1Pp=1Pζ^k,psubscript^𝜁𝑘1𝑃superscriptsubscript𝑝1𝑃subscript^𝜁𝑘𝑝\hat{\zeta}_{k}\leftarrow\frac{1}{P}\sum_{p=1}^{P}\hat{\zeta}_{k,p}over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT.
20:     μ^kμ^k1+βk(ζ^kμ^k1)subscript^𝜇𝑘subscript^𝜇𝑘1subscript𝛽𝑘subscript^𝜁𝑘subscript^𝜇𝑘1\hat{\mu}_{k}\leftarrow\hat{\mu}_{k-1}+\beta_{k}\left(\hat{\zeta}_{k}-\hat{\mu% }_{k-1}\right)over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ). # Update population distribution.
21:  end for
22:  Output: μKsubscript𝜇𝐾\mu_{K}italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT.

Examining the Sample-Based MF-TRPO algorithm, we observe that two key approximations are introduced in the learning process. First, the policy update is performed through Sample-Based TRPO, whose finite-sample analysis in high probability is established in Theorem D.2. This result ensures that the policy iterates remain well-controlled throughout the optimization process in high probability. Secondly, in order to analyze the evolution of the mean-field population distribution, we need to establish a similar high-probability bound on the estimation of the term μ^k1(𝖯μ^k1π^k)Msubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}% )^{M}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, which represents the transition dynamics under the estimated policy.

The unbiased estimator of this term uses the trajectories {sm,p,k1}m=0Msuperscriptsubscriptsubscript𝑠𝑚𝑝𝑘1𝑚0𝑀\{s_{m,p,k-1}\}_{m=0}^{M}{ italic_s start_POSTSUBSCRIPT italic_m , italic_p , italic_k - 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT and is given by the empirical sum of ζ^k,p=𝟣{sM,p,k}()subscript^𝜁𝑘𝑝subscript1subscript𝑠𝑀𝑝𝑘\hat{\zeta}_{k,p}=\mathsf{1}_{\{s_{M,p,k}\}}(\cdot)over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT = sansserif_1 start_POSTSUBSCRIPT { italic_s start_POSTSUBSCRIPT italic_M , italic_p , italic_k end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ( ⋅ ). Note that each component of this vector is distributed according to a Bernoulli distribution and is centered in μ^k1(𝖯μ^k1π^k)Msubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}% )^{M}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. Therefore, define ϵksubscriptitalic-ϵ𝑘\epsilon_{k}italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the following martingale difference term

ϵk=1Pp=1Pζ^k,pμ^k1(𝖯μ^k1π^k)M,subscriptitalic-ϵ𝑘1𝑃superscriptsubscript𝑝1𝑃subscript^𝜁𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀\displaystyle\epsilon_{k}=\frac{1}{P}\sum_{p=1}^{P}\hat{\zeta}_{k,p}-\hat{\mu}% _{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}})^{M}\;,italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ,

with sM,p,k1subscript𝑠𝑀𝑝𝑘1s_{M,p,k-1}italic_s start_POSTSUBSCRIPT italic_M , italic_p , italic_k - 1 end_POSTSUBSCRIPT defined as in Sample-Based MF-TRPO.

To address this, we first derive Proposition D.3, which is a preliminary concentration result that quantifies the approximation error in the estimation of this key quantity. The first one is Proposition D.3, which provides guarantees on the deviation of the error incurred in a single iteration of the algorithm, with high probability. Specifically, it establishes that the error made while estimating the error at each iteration, which is a bounded increment of a martingale. This result is pivotal, as it ensures that the errors introduced in each iteration of the algorithm are controlled and do not diverge as the algorithm progresses, and lays the foundation for a rigorous convergence analysis of Sample-Based MF-TRPO.

Proposition D.3.

For any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and δ>0𝛿0\delta>0italic_δ > 0, if the number of trajectories in the k𝑘kitalic_k-th iteration satisfies:

P64ϵ2log2δ,𝑃64superscriptitalic-ϵ22𝛿\displaystyle P\geq\frac{64}{\epsilon^{2}}\log\frac{2}{\delta}\;,italic_P ≥ divide start_ARG 64 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ,

then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds:

ϵk2subscriptnormsubscriptitalic-ϵ𝑘2\displaystyle\left\|\epsilon_{k}\right\|_{{2}}∥ italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ϵ.absentitalic-ϵ\displaystyle\leq\epsilon\;.≤ italic_ϵ .
Proof.

From previous consideration, we have that ζ^ksubscript^𝜁𝑘\hat{\zeta}_{k}over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is unbiased and ϵksubscriptitalic-ϵ𝑘\epsilon_{k}italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a martingale difference. Moreover, note that ζ^k,pμ^k1(𝖯μ^k1π^k)Msubscript^𝜁𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀\hat{\zeta}_{k,p}-\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.490% 05pt\hat{\pi}_{k}})^{M}over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT is a bounded vectore, i.e.

ζ^k,pμ^k1(𝖯μ^k1π^k)M2subscriptnormsubscript^𝜁𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀2\displaystyle\left\|\hat{\zeta}_{k,p}-\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_% {k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}})^{M}\right\|_{{2}}∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 2ζ^k,pμ^k1(𝖯μ^k1π^k)M1absent2subscriptnormsubscript^𝜁𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀1\displaystyle\leq 2\left\|\hat{\zeta}_{k,p}-\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat% {\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}})^{M}\right\|_{{1}}≤ 2 ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
=2s𝒮|ζ^k,p(s)μ^k1(𝖯μ^k1π^k)M(s)|absent2subscript𝑠𝒮subscript^𝜁𝑘𝑝𝑠subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀𝑠\displaystyle=2\sum_{s\in\mathcal{S}}\left|\hat{\zeta}_{k,p}(s)-\hat{\mu}_{k-1% }\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}})^{M}(s)\right|= 2 ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT | over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT ( italic_s ) - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_s ) |
2(s𝒮ζ^k,p(s)+s𝒮μ^k1(𝖯μ^k1π^k)M(s))=4.absent2subscript𝑠𝒮subscript^𝜁𝑘𝑝𝑠subscript𝑠𝒮subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀𝑠4\displaystyle\leq 2\left(\sum_{s\in\mathcal{S}}\hat{\zeta}_{k,p}(s)+\sum_{s\in% \mathcal{S}}\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt% \hat{\pi}_{k}})^{M}(s)\right)=4\;.≤ 2 ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT ( italic_s ) + ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_s ) ) = 4 .

Moreover, using Jenses’s inequality, we have that

ϵk2=1Pp=1Pζ^k,pμ^k1(𝖯μ^k1π^k)M21Pp=1Pζ^k,pμ^k1(𝖯μ^k1π^k)M24.subscriptnormsubscriptitalic-ϵ𝑘2subscriptnorm1𝑃superscriptsubscript𝑝1𝑃subscript^𝜁𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀21𝑃superscriptsubscript𝑝1𝑃subscriptnormsubscript^𝜁𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀24\displaystyle\left\|\epsilon_{k}\right\|_{{2}}=\left\|\frac{1}{P}\sum_{p=1}^{P% }\hat{\zeta}_{k,p}-\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49% 005pt\hat{\pi}_{k}})^{M}\right\|_{{2}}\leq\frac{1}{P}\sum_{p=1}^{P}\left\|\hat% {\zeta}_{k,p}-\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt% \hat{\pi}_{k}})^{M}\right\|_{{2}}\leq 4\;.∥ italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 4 .

Therefore, ϵksubscriptitalic-ϵ𝑘\epsilon_{k}italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a bounded martingale difference. To show that the increment is bounded with high probability, we use Hoeffding’s inequality. Let ϵk=p=1Pϵk,psubscriptitalic-ϵ𝑘superscriptsubscript𝑝1𝑃subscriptitalic-ϵ𝑘𝑝\epsilon_{k}=\sum_{p=1}^{P}\epsilon_{k,p}italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT, where ϵk,p=ζ^k,pμ^k1(𝖯μ^k1π^k)Msubscriptitalic-ϵ𝑘𝑝subscript^𝜁𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀\epsilon_{k,p}=\hat{\zeta}_{k,p}-\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}% }^{\hskip 0.49005pt\hat{\pi}_{k}})^{M}italic_ϵ start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT = over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. Then, for any tk>0subscript𝑡𝑘0t_{k}>0italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0, we have

(1Pp=0Ps𝒮|ϵk,p(s)μ^k1(𝖯μ^k1π^k)M(s)|ϵ4)1𝑃superscriptsubscript𝑝0𝑃subscript𝑠𝒮subscriptitalic-ϵ𝑘𝑝𝑠subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀𝑠italic-ϵ4\displaystyle\mathbb{P}\left(\frac{1}{P}\sum_{p=0}^{P}\sum_{s\in\mathcal{S}}% \left|\epsilon_{k,p}(s)-\hat{\mu}_{k-1}\ (\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0% .49005pt\hat{\pi}_{k}})^{M}(s)\right|\geq\frac{\epsilon}{4}\right)blackboard_P ( divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT | italic_ϵ start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT ( italic_s ) - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_s ) | ≥ divide start_ARG italic_ϵ end_ARG start_ARG 4 end_ARG ) =(1Pp=0Pϵk,pμ^k1(𝖯μ^k1π^k)M1ϵ4)absent1𝑃superscriptsubscript𝑝0𝑃subscriptnormsubscriptitalic-ϵ𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀1italic-ϵ4\displaystyle=\mathbb{P}\left(\frac{1}{P}\sum_{p=0}^{P}\left\|\epsilon_{k,p}-% \hat{\mu}_{k-1}\ \left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}% _{k}}\right)^{M}\right\|_{{1}}\geq\frac{\epsilon}{4}\right)= blackboard_P ( divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∥ italic_ϵ start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ divide start_ARG italic_ϵ end_ARG start_ARG 4 end_ARG )
2exp(Pϵ264)=:δ.\displaystyle\leq 2\exp\left(-\frac{P\epsilon^{2}}{64}\right)=:\delta\;.≤ 2 roman_exp ( - divide start_ARG italic_P italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 64 end_ARG ) = : italic_δ .

This consideration is a special case of the Generalized Freedman inequality as presented in Harvey et al. (2019). The inequality provides sharp high-probability bounds for the sum of bounded, dependent random variables. In our case, the formulation is simplified due to the presence of a uniform bound on the variables we aim to control.

Therefore, in order to guarantee that

ϵkμ^k1(𝖯μ^k1π^k)M2subscriptnormsubscriptitalic-ϵ𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀2\displaystyle\left\|\epsilon_{k}-\hat{\mu}_{k-1}\ \left(\mathsf{P}_{\hat{\mu}_% {k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}\right\|_{{2}}∥ italic_ϵ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 4Pp=0Pϵk,pμ^k1(𝖯μ^k1π^k)M1absent4𝑃superscriptsubscript𝑝0𝑃subscriptnormsubscriptitalic-ϵ𝑘𝑝subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀1\displaystyle\leq\frac{4}{P}\sum_{p=0}^{P}\left\|\epsilon_{k,p}-\hat{\mu}_{k-1% }\ \left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{% M}\right\|_{{1}}≤ divide start_ARG 4 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∥ italic_ϵ start_POSTSUBSCRIPT italic_k , italic_p end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
ϵ,absentitalic-ϵ\displaystyle\leq\epsilon\;,≤ italic_ϵ ,

we need the number of trajectories P𝑃Pitalic_P to be at least

P64ϵ2log2δ.𝑃64superscriptitalic-ϵ22𝛿\displaystyle P\geq\frac{64}{\epsilon^{2}}\log\frac{2}{\delta}\;.italic_P ≥ divide start_ARG 64 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG .

D.4 Convergence of Sample-Based MF-TRPO

We now extend the exact analysis of Exact MF-TRPO to its sample-based counterpart, establishing global sample complexity bounds. While the exact algorithm benefits from having full knowledge of the transition kernel and reward function, the sample-based version introduces additional approximation errors due to finite sampling. We quantify these errors and derive high-probability guarantees on the convergence of the algorithm. This requires adapting the theoretical tools developed in the exact setting to account for trajectory-based estimations and ensuring that the resulting policy updates remain stable despite stochastic approximations.

Theorem D.4.

Suppose that Assumptions 123, and 4 hold. Assume that the following holds

βk<b1:=τ6Cπ,μ2CErg,M2+2τ, for k1.formulae-sequencesubscript𝛽𝑘subscript𝑏1assign𝜏6superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM22𝜏 for 𝑘1\displaystyle\beta_{k}<b_{1}:=\frac{\tau}{6C_{\pi,\mu}^{2}C_{\operatorname{Erg% ,M}}^{2}+2\tau}\;,\qquad\text{ for }k\geq 1\;.italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT := divide start_ARG italic_τ end_ARG start_ARG 6 italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_τ end_ARG , for italic_k ≥ 1 . (23)

For any ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and δ>0𝛿0\delta>0italic_δ > 0, if the number of trajectories in each iteration for Sample-Based MF-TRPO satisfies

P64ϵ2log2δ,𝑃64superscriptitalic-ϵ22𝛿\displaystyle P\geq\frac{64}{\epsilon^{2}}\log\frac{2}{\delta}\;,italic_P ≥ divide start_ARG 64 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG , (24)

and the number of iteration in each epoch of Sample-Based TRPO satisfies

I|𝒜|2(𝗋+η2log2|𝒜|)(|𝒮|log2|𝒜|+log1δ)(1γ)2ϵ2subscript𝐼superscript𝒜2subscriptnormsuperscript𝗋absentsuperscript𝜂2superscript2𝒜𝒮2𝒜1𝛿superscript1𝛾2superscriptitalic-ϵ2\displaystyle I_{\ell}\geq\frac{|\mathcal{A}|^{2}\left(\left\|\mathsf{r}^{% \hskip 0.49005pt}\right\|_{{\infty}}+\eta^{2}\log^{2}|\mathcal{A}|\right)\left% (|\mathcal{S}|\log 2|\mathcal{A}|+\log\frac{1}{\delta}\right)}{(1-\gamma)^{2}% \epsilon^{2}}\;italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) ( | caligraphic_S | roman_log 2 | caligraphic_A | + roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (25)

and the rollout is performed up to time Tsubscript𝑇T_{\ell}italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT with

T11γlog(ϵ|𝒜|(𝗋+ηlog|𝒜|)).subscript𝑇11𝛾italic-ϵ𝒜subscriptnormsuperscript𝗋absent𝜂𝒜\displaystyle T_{\ell}\geq\frac{1}{1-\gamma}\log\left(\frac{\epsilon}{|% \mathcal{A}|\left(\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta% \log|\mathcal{A}|\right)}\right)\;.italic_T start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 1 - italic_γ end_ARG roman_log ( divide start_ARG italic_ϵ end_ARG start_ARG | caligraphic_A | ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG ) . (26)

Then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have that

μ^Kμ22superscriptsubscriptnormsubscript^𝜇𝐾subscript𝜇22absent\displaystyle\left\|\hat{\mu}_{K}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ exp(τ2j=1kβj)μ0μ22+2CMF,1τlog(L)L+2CMF,2τϵ.𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇222subscript𝐶MF1𝜏𝐿𝐿2subscript𝐶MF2𝜏italic-ϵ\displaystyle\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)\left\|\mu% _{0}-\mu_{\star}\right\|_{{2}}^{2}+\frac{2C_{\operatorname{MF,1}}}{\tau}\frac{% \log(L)}{L}+\frac{2C_{\operatorname{MF,2}}}{\tau}\epsilon\;.roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_MF , 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_MF , 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG italic_ϵ . (27)

with

τ𝜏\displaystyle\tauitalic_τ :=1Cop,MFGassignabsent1subscript𝐶opMFG\displaystyle:=1-C_{\operatorname{op,MFG}}:= 1 - italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT
CMF,1subscript𝐶MF1\displaystyle C_{\operatorname{MF,1}}italic_C start_POSTSUBSCRIPT roman_MF , 1 end_POSTSUBSCRIPT :=CErg,MCTRPO,12+b1τ,assignabsentsubscript𝐶ErgMsubscript𝐶TRPO12subscript𝑏1𝜏\displaystyle:=C_{\operatorname{Erg,M}}\cdot C_{\operatorname{TRPO,1}}~{}\cdot% \frac{2+b_{1}}{\tau}\;,:= italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ⋅ italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT ⋅ divide start_ARG 2 + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ,
CMF,2subscript𝐶MF2\displaystyle C_{\operatorname{MF,2}}italic_C start_POSTSUBSCRIPT roman_MF , 2 end_POSTSUBSCRIPT :=2+b1τ(CErg,MCTRPO,2+1),assignabsent2subscript𝑏1𝜏subscript𝐶ErgMsubscript𝐶TRPO21\displaystyle:=\frac{2+b_{1}}{\tau}\big{(}C_{\operatorname{Erg,M}}~{}C_{% \operatorname{TRPO,2}}+1\big{)}\;,:= divide start_ARG 2 + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG ( italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_TRPO , 2 end_POSTSUBSCRIPT + 1 ) ,
CTRPO,1subscript𝐶TRPO1\displaystyle C_{\operatorname{TRPO,1}}italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT :=2CTRPO,1η(1γ)(𝗋2+η2log2|𝒜|)|𝒜|2logLη(1γ)3L,assignabsent2superscriptsubscript𝐶TRPO1𝜂1𝛾superscriptsubscriptnormsuperscript𝗋absent2superscript𝜂2superscript2𝒜superscript𝒜2𝐿𝜂superscript1𝛾3𝐿\displaystyle:=\frac{2C_{\operatorname{TRPO,1}}^{\prime}}{\eta(1-\gamma)}\cdot% \frac{\left(\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}^{2}+\eta^{% 2}\log^{2}|\mathcal{A}|\right)|\mathcal{A}|^{2}\log L}{\eta(1-\gamma)^{3}L}\;,:= divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ divide start_ARG ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_L end_ARG start_ARG italic_η ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_L end_ARG ,
CTRPO,2subscript𝐶TRPO2\displaystyle C_{\operatorname{TRPO,2}}italic_C start_POSTSUBSCRIPT roman_TRPO , 2 end_POSTSUBSCRIPT :=2CTRPO,1η(1γ)1(1γ)2𝖽¯μ,μπμν,assignabsent2superscriptsubscript𝐶TRPO1𝜂1𝛾1superscript1𝛾2subscriptnormsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇𝜈\displaystyle:=\frac{2C_{\operatorname{TRPO,1}}^{\prime}}{\eta(1-\gamma)}\cdot% \frac{1}{(1-\gamma)^{2}}\left\|\frac{\overline{\mathsf{d}}_{\mu,\mu}^{\hskip 0% .49005pt\pi_{\mu}}}{\nu}\right\|_{{\infty}}\;,:= divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ divide start_ARG over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,

with CTRPO,1superscriptsubscript𝐶TRPO1C_{\operatorname{TRPO,1}}^{\prime}italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT the constant coming from Theorem D.2.

Proof.

We focus on the convergence of the sequence μ^kμsubscript^𝜇𝑘subscript𝜇\hat{\mu}_{k}\to\mu_{\star}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT, with μsubscript𝜇\mu_{\star}italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT as in (8). Denote π^ksubscript^𝜋𝑘\hat{\pi}_{k}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (resp. ζ^ksubscript^𝜁𝑘\hat{\zeta}_{k}over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) the output of Sample-Based TRPO(μ^k)subscript^𝜇𝑘(\hat{\mu}_{k})( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) at each step (resp. the estimator used in the update of μ^ksubscript^𝜇𝑘\hat{\mu}_{k}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Sample-Based MF-TRPO). We then have that

μ^kμ22=superscriptsubscriptnormsubscript^𝜇𝑘subscript𝜇22absent\displaystyle\left\|\hat{\mu}_{k}-\mu_{\star}\right\|_{{2}}^{2}=∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = μ^k1μ+βk(ζ^kμ^k1)22superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇subscript𝛽𝑘subscript^𝜁𝑘subscript^𝜇𝑘122\displaystyle\left\|\hat{\mu}_{k-1}-\mu_{\star}+\beta_{k}\left(\hat{\zeta}_{k}% -\hat{\mu}_{k-1}\right)\right\|_{{2}}^{2}∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== μ^k1μ+βk(ζ^kμ^k1(𝖯μ^k1π^k)M)+βk(μ^k1(𝖯μ^k1π^k)Mμ^k1)22superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇subscript𝛽𝑘subscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript𝛽𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘122\displaystyle\left\|\hat{\mu}_{k-1}-\mu_{\star}+\beta_{k}\left(\hat{\zeta}_{k}% -\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_% {k}}\right)^{M}\right)+\beta_{k}\left(\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{% \mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\right)% \right\|_{{2}}^{2}∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== μ^k1μ+βk(ζ^kμ^k1(𝖯μ^k1π^k)M)\displaystyle\left\|\hat{\mu}_{k-1}-\mu_{\star}+\beta_{k}\left(\hat{\zeta}_{k}% -\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_% {k}}\right)^{M}\right)\right.∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M)subscript𝛽𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀\displaystyle\phantom{\hat{\mu}_{k-1}-\mu_{\star}}~{}~{}\left.+\beta_{k}\left(% \hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{% k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.490% 05pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right)\right.+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μ^k1(𝖯μ^k1πμ^k1)Mμ^k1)22evaluated-atsubscript𝛽𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript^𝜇𝑘122\displaystyle\phantom{\hat{\mu}_{k-1}-\mu_{\star}}~{}~{}\left.+\beta_{k}\left(% \hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{% \mu}_{k-1}}}\right)^{M}-\hat{\mu}_{k-1}\right)\right\|_{2}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== μ^k1μ+βk(ζ^kμ^k1(𝖯μ^k1π^k)M)\displaystyle\left\|\hat{\mu}_{k-1}-\mu_{\star}+\beta_{k}\left(\hat{\zeta}_{k}% -\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_% {k}}\right)^{M}\right)\right.∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M)subscript𝛽𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀\displaystyle\phantom{\hat{\mu}_{k-1}-\mu_{\star}}~{}~{}\left.+\beta_{k}\left(% \hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{% k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.490% 05pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right)\right.+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μ^k1(𝖯μ^k1πμ^k1)Mμ^k1)βk(μ(𝖯μπμ)Mμ)22subscript𝛽𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript^𝜇𝑘1evaluated-atsubscript𝛽𝑘subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀subscript𝜇22\displaystyle\phantom{\hat{\mu}_{k-1}-\mu_{\star}}~{}~{}\left.+\beta_{k}\left(% \hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{% \mu}_{k-1}}}\right)^{M}-\hat{\mu}_{k-1}\right)-\beta_{k}\left(\mu_{\star}\left% (\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}-\mu_{% \star}\right)\right\|_{2}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (1βk)(μ^k1μ)+βk(ζ^kμ^k1(𝖯μ^k1π^k)M)\displaystyle\left\|(1-\beta_{k})\left(\hat{\mu}_{k-1}-\mu_{\star}\right)+% \beta_{k}\left(\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1% }}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}\right)\right.∥ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M)subscript𝛽𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀\displaystyle\phantom{(1-\beta_{k})\left(\hat{\mu}_{k-1}-\mu_{\star}\right)}~{% }~{}\left.+\beta_{k}\left(\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{% \hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{% \hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right)\right.+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
+βk(μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M)22evaluated-atsubscript𝛽𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀22\displaystyle\phantom{(1-\beta_{k})\left(\hat{\mu}_{k-1}-\mu_{\star}\right)}~{% }~{}\left.+\beta_{k}\left(\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{% \hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_% {\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right)\right\|_{2% }^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== (1βk)2μ^k1μ22superscript1subscript𝛽𝑘2superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇22\displaystyle(1-\beta_{k})^{2}\left\|\hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}}% ^{2}( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2(1βk)βkμ^k1μ,ζ^kμ^k1(𝖯μ^k1π^k)M21subscript𝛽𝑘subscript𝛽𝑘subscript^𝜇𝑘1subscript𝜇subscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀\displaystyle+2(1-\beta_{k})\beta_{k}\left\langle\hat{\mu}_{k-1}-\mu_{\star},% \hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49% 005pt\hat{\pi}_{k}}\right)^{M}\right\rangle+ 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2ζ^kμ^k1(𝖯μ^k1π^k)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀22\displaystyle+\beta_{k}^{2}\left\|\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf% {P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}\right\|_{{2}}% ^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk2ζ^kμ^k1(𝖯μ^k1π^k)M,μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M2superscriptsubscript𝛽𝑘2subscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀\displaystyle+2\beta_{k}^{2}\left\langle\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(% \mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M},\hat{% \mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}% \right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005% pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right\rangle+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+2βk2ζ^kμ^k1(𝖯μ^k1π^k)M,μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M2superscriptsubscript𝛽𝑘2subscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle+2\beta_{k}^{2}\left\langle\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(% \mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M},\hat{% \mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{% k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt% \pi_{\mu_{\star}}}\right)^{M}\right\rangle+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+2(1βk)βkμ^k1μ,μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M21subscript𝛽𝑘subscript𝛽𝑘subscript^𝜇𝑘1subscript𝜇subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle+2(1-\beta_{k})\beta_{k}\left\langle\hat{\mu}_{k-1}-\mu_{\star},% \hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{% \mu}_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.49% 005pt\pi_{\mu_{\star}}}\right)^{M}\right\rangle+ 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀22\displaystyle+\beta_{k}^{2}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{% k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}-\mu_{\star}\left(% \mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right% \|_{{2}}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2(1βk)βkμ^k1μ,μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M21subscript𝛽𝑘subscript𝛽𝑘subscript^𝜇𝑘1subscript𝜇subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀\displaystyle+2(1-\beta_{k})\beta_{k}\left\langle\hat{\mu}_{k-1}-\mu_{\star},% \hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{% k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.490% 05pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right\rangle+ 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀22\displaystyle+\beta_{k}^{2}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{% k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{% P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right% \|_{{2}}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk2μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M,μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M.2superscriptsubscript𝛽𝑘2subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle+2\beta_{k}^{2}\left\langle\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{% \mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(% \mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M% },\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat% {\mu}_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.4% 9005pt\pi_{\mu_{\star}}}\right)^{M}\right\rangle\;.+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩ .

Applying Assumptions 2 and Corollary E.4 together with Lemma E.1, and following the same lines as in the proof of Proposition C.3, the previous equality implies that

μ^kμ22superscriptsubscriptnormsubscript^𝜇𝑘subscript𝜇22absent\displaystyle\left\|\hat{\mu}_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ [(1βk)(1+(2Cop,MFG1)βk)+βk2Cπ,μ2CErg,M2]μ^k1μ22delimited-[]1subscript𝛽𝑘12subscript𝐶opMFG1subscript𝛽𝑘superscriptsubscript𝛽𝑘2superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇22\displaystyle\left[(1-\beta_{k})\left(1+(2C_{\operatorname{op,MFG}}-1)\beta_{k% }\right)+\beta_{k}^{2}C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\right]\left% \|\hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}}^{2}[ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( 1 + ( 2 italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT - 1 ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2(1βk)βkμ^k1μ,ζ^kμ^k1(𝖯μ^k1π^k)M21subscript𝛽𝑘subscript𝛽𝑘subscript^𝜇𝑘1subscript𝜇subscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀\displaystyle+2(1-\beta_{k})\beta_{k}\left\langle\hat{\mu}_{k-1}-\mu_{\star},% \hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49% 005pt\hat{\pi}_{k}}\right)^{M}\right\rangle+ 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2ζ^kμ^k1(𝖯μ^k1π^k)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀22\displaystyle+\beta_{k}^{2}\left\|\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf% {P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}\right\|_{{2}}% ^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk2ζ^kμ^k1(𝖯μ^k1π^k)M,μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M2superscriptsubscript𝛽𝑘2subscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀\displaystyle+2\beta_{k}^{2}\left\langle\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(% \mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M},\hat{% \mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}% \right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005% pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right\rangle+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+2βk2ζ^kμ^k1(𝖯μ^k1π^k)M,μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M2superscriptsubscript𝛽𝑘2subscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle+2\beta_{k}^{2}\left\langle\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(% \mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M},\hat{% \mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{% k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt% \pi_{\mu_{\star}}}\right)^{M}\right\rangle+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+2(1βk)βkμ^k1μ,μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M21subscript𝛽𝑘subscript𝛽𝑘subscript^𝜇𝑘1subscript𝜇subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀\displaystyle+2(1-\beta_{k})\beta_{k}\left\langle\hat{\mu}_{k-1}-\mu_{\star},% \hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{% k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.490% 05pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right\rangle+ 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
+βk2μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M22superscriptsubscript𝛽𝑘2superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀22\displaystyle+\beta_{k}^{2}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{% k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{% P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right% \|_{{2}}^{2}+ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk2μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M,μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M2superscriptsubscript𝛽𝑘2subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀\displaystyle+2\beta_{k}^{2}\left\langle\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{% \mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(% \mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M% },\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat% {\mu}_{k-1}}}\right)^{M}-\mu_{\star}\left(\mathsf{P}_{\mu_{\star}}^{\hskip 0.4% 9005pt\pi_{\mu_{\star}}}\right)^{M}\right\rangle+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩
=:absent:\displaystyle=:= : [(1βk)(1+(2Cop,MFG1)βk)+βk2Cπ,μ2CErg,M2]μ^k1μ22delimited-[]1subscript𝛽𝑘12subscript𝐶opMFG1subscript𝛽𝑘superscriptsubscript𝛽𝑘2superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇22\displaystyle\left[(1-\beta_{k})\left(1+(2C_{\operatorname{op,MFG}}-1)\beta_{k% }\right)+\beta_{k}^{2}C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\right]\left% \|\hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}}^{2}[ ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( 1 + ( 2 italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT - 1 ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+2βk(1βk)𝐄1+βk2𝐄2+2βk2𝐄3+2βk2𝐄42subscript𝛽𝑘1subscript𝛽𝑘subscript𝐄1superscriptsubscript𝛽𝑘2subscript𝐄22superscriptsubscript𝛽𝑘2subscript𝐄32superscriptsubscript𝛽𝑘2subscript𝐄4\displaystyle+2\beta_{k}(1-\beta_{k})\mathbf{E}_{1}+\beta_{k}^{2}\mathbf{E}_{2% }+2\beta_{k}^{2}\mathbf{E}_{3}+2\beta_{k}^{2}\mathbf{E}_{4}+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
+2βk(1βk)𝐓1+βk2𝐓2+2βk2𝐓3.2subscript𝛽𝑘1subscript𝛽𝑘subscript𝐓1superscriptsubscript𝛽𝑘2subscript𝐓22superscriptsubscript𝛽𝑘2subscript𝐓3\displaystyle+2\beta_{k}(1-\beta_{k})\mathbf{T}_{1}+\beta_{k}^{2}\mathbf{T}_{2% }+2\beta_{k}^{2}\mathbf{T}_{3}\;.+ 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT .

We now proceed in studying the terms 𝐄1subscript𝐄1\mathbf{E}_{1}bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐄2subscript𝐄2\mathbf{E}_{2}bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 𝐄3subscript𝐄3\mathbf{E}_{3}bold_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, 𝐄4subscript𝐄4\mathbf{E}_{4}bold_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, 𝐓1subscript𝐓1\mathbf{T}_{1}bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝐓2subscript𝐓2\mathbf{T}_{2}bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝐓3subscript𝐓3\mathbf{T}_{3}bold_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Using Young’s inequality, we get that

|𝐄1|subscript𝐄1\displaystyle\left|\mathbf{E}_{1}\right|| bold_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | τ4μ^k1μ22+1τζ^kμ^k1(𝖯μ^k1π^k)M22τ4μ^k1μ22+1τ𝐄2,absent𝜏4superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇221𝜏superscriptsubscriptnormsubscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀22𝜏4superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇221𝜏subscript𝐄2\displaystyle\leq\frac{\tau}{4}\left\|\hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}% }^{2}+\frac{1}{\tau}\left\|\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf{P}_{% \hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}\right\|_{{2}}^{2}% \leq\frac{\tau}{4}\left\|\hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+\frac{1% }{\tau}\mathbf{E}_{2}\;,≤ divide start_ARG italic_τ end_ARG start_ARG 4 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_τ end_ARG start_ARG 4 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
|𝐄3|subscript𝐄3\displaystyle\left|\mathbf{E}_{3}\right|| bold_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | 12μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M22+12ζ^kμ^k1(𝖯μ^k1π^k)M2212𝐓2+12𝐄2,absent12superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀2212superscriptsubscriptnormsubscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀2212subscript𝐓212subscript𝐄2\displaystyle\leq\frac{1}{2}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_% {k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf% {P}_{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right% \|_{{2}}^{2}+\frac{1}{2}\left\|\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf{P}% _{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}\right\|_{{2}}^{2% }\leq\frac{1}{2}\mathbf{T}_{2}+\frac{1}{2}\mathbf{E}_{2}\;,≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
|𝐄4|subscript𝐄4\displaystyle\left|\mathbf{E}_{4}\right|| bold_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | 12μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M22+12ζ^kμ^k1(𝖯μ^k1π^k)M22absent12superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀2212superscriptsubscriptnormsubscript^𝜁𝑘subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀22\displaystyle\leq\frac{1}{2}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_% {k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}-\mu_{\star}\left(% \mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right% \|_{{2}}^{2}+\frac{1}{2}\left\|\hat{\zeta}_{k}-\hat{\mu}_{k-1}\left(\mathsf{P}% _{\hat{\mu}_{k-1}}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}\right\|_{{2}}^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ over^ start_ARG italic_ζ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
12Cπ,μ2CErg,M2μ^k1μ22+12𝐄2,absent12superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇2212subscript𝐄2\displaystyle\leq\frac{1}{2}C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\left\|% \hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+\frac{1}{2}\mathbf{E}_{2}\;,≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where we used Lemma E.1 and Corollary E.4 in the last inequality. Using Young’s inequality, we get that

|𝐓1|subscript𝐓1\displaystyle\left|\mathbf{T}_{1}\right|| bold_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | τ4μ^k1μ22+1τμ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M22τ4μ^k1μ22+1τ𝐓2,absent𝜏4superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇221𝜏superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀22𝜏4superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇221𝜏subscript𝐓2\displaystyle\leq\frac{\tau}{4}\left\|\hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}% }^{2}+\frac{1}{\tau}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1}}^{% \hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_{% \hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right\|_{{% 2}}^{2}\leq\frac{\tau}{4}\left\|\hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+% \frac{1}{\tau}\mathbf{T}_{2}\;,≤ divide start_ARG italic_τ end_ARG start_ARG 4 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_τ end_ARG start_ARG 4 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
|𝐓3|subscript𝐓3\displaystyle\left|\mathbf{T}_{3}\right|| bold_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | 12μ^k1(𝖯μ^k1πμ^k1)Mμ(𝖯μπμ)M22+12μ^k1(𝖯μ^k1π^k)Mμ^k1(𝖯μ^k1πμ^k1)M22absent12superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀subscript𝜇superscriptsuperscriptsubscript𝖯subscript𝜇subscript𝜋subscript𝜇𝑀2212superscriptsubscriptnormsubscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript^𝜋𝑘𝑀subscript^𝜇𝑘1superscriptsuperscriptsubscript𝖯subscript^𝜇𝑘1subscript𝜋subscript^𝜇𝑘1𝑀22\displaystyle\leq\frac{1}{2}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_% {k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}-\mu_{\star}\left(% \mathsf{P}_{\mu_{\star}}^{\hskip 0.49005pt\pi_{\mu_{\star}}}\right)^{M}\right% \|_{{2}}^{2}+\frac{1}{2}\left\|\hat{\mu}_{k-1}\left(\mathsf{P}_{\hat{\mu}_{k-1% }}^{\hskip 0.49005pt\hat{\pi}_{k}}\right)^{M}-\hat{\mu}_{k-1}\left(\mathsf{P}_% {\hat{\mu}_{k-1}}^{\hskip 0.49005pt\pi_{\hat{\mu}_{k-1}}}\right)^{M}\right\|_{% {2}}^{2}≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
12Cπ,μ2CErg,M2μ^k1μ22+12𝐓2,absent12superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM2superscriptsubscriptnormsubscript^𝜇𝑘1subscript𝜇2212subscript𝐓2\displaystyle\leq\frac{1}{2}C_{\pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}\left\|% \hat{\mu}_{k-1}-\mu_{\star}\right\|_{{2}}^{2}+\frac{1}{2}\mathbf{T}_{2}\;,≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where we used Lemma E.1 and Corollary E.4 in the last inequality. Since τ<1𝜏1\tau<1italic_τ < 1 from Assumption 2 and βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies (23), a straightforward computation shows that

(1βk)(1+(1τ)βk)+3βk2Cπ,μ2CErg,M2<1τ2βk.1subscript𝛽𝑘11𝜏subscript𝛽𝑘3superscriptsubscript𝛽𝑘2superscriptsubscript𝐶𝜋𝜇2superscriptsubscript𝐶ErgM21𝜏2subscript𝛽𝑘\displaystyle(1-\beta_{k})\left(1+(1-\tau)\beta_{k}\right)+3\beta_{k}^{2}C_{% \pi,\mu}^{2}C_{\operatorname{Erg,M}}^{2}<1-\frac{\tau}{2}\beta_{k}\;.( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( 1 + ( 1 - italic_τ ) italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + 3 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

Since we have that (25) holds, we can apply Theorem D.2, together with Lemma E.1, to get that

𝐓2subscript𝐓2\displaystyle\mathbf{T}_{2}bold_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT CErg,M(s𝒮μ^k12(s)π^k(|s)πμ^k1(|s)TV2)\displaystyle\leq C_{\operatorname{Erg,M}}\left(\sum_{s\in\mathcal{S}}\hat{\mu% }_{k-1}^{2}(s)\left\|\hat{\pi}_{k}(\cdot|s)-\pi_{\hat{\mu}_{k-1}}(\cdot|s)% \right\|_{\mathrm{TV}}^{2}\right)≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_s ) ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
CErg,M(s𝒮μ^k1(s)π^k(|s)πμ^k1(|s)TV2)\displaystyle\leq C_{\operatorname{Erg,M}}\left(\sum_{s\in\mathcal{S}}\hat{\mu% }_{k-1}(s)\left\|\hat{\pi}_{k}(\cdot|s)-\pi_{\hat{\mu}_{k-1}}(\cdot|s)\right\|% _{\mathrm{TV}}^{2}\right)≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_s ) ∥ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
CErg,M(JMFG(πμ^k1,μ^k1,μ^k1)JMFG(π^k,μ^k1,μ^k1))absentsubscript𝐶ErgMsuperscript𝐽MFGsubscript𝜋subscript^𝜇𝑘1subscript^𝜇𝑘1subscript^𝜇𝑘1superscript𝐽MFGsubscript^𝜋𝑘subscript^𝜇𝑘1subscript^𝜇𝑘1\displaystyle\leq C_{\operatorname{Erg,M}}\left(J^{\operatorname{MFG}}(\pi_{% \hat{\mu}_{k-1}},\hat{\mu}_{k-1},\hat{\mu}_{k-1})-J^{\operatorname{MFG}}(\hat{% \pi}_{k},\hat{\mu}_{k-1},\hat{\mu}_{k-1})\right)≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) )
CErg,M2η(1γ)(CTRPO,1log(L)L+CTRPO,2ϵ).absentsubscript𝐶ErgM2𝜂1𝛾subscript𝐶TRPO1𝐿𝐿subscript𝐶TRPO2italic-ϵ\displaystyle\leq C_{\operatorname{Erg,M}}\frac{2}{\eta(1-\gamma)}\left(C_{% \operatorname{TRPO,1}}~{}\frac{\log(L)}{L}+C_{\operatorname{TRPO,2}}~{}% \epsilon\right)\;.≤ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT divide start_ARG 2 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ( italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + italic_C start_POSTSUBSCRIPT roman_TRPO , 2 end_POSTSUBSCRIPT italic_ϵ ) .

Moreover, since (24) holds, using Proposition D.3, we have that, with probability at least 1δ1𝛿1-\delta1 - italic_δ,

𝐄2ϵ.subscript𝐄2italic-ϵ\displaystyle\mathbf{E}_{2}\leq\epsilon.bold_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ .

Therefore, combining the previous inequalities, we have that

μ^kμ22(1τ2βk)μ^k1μ22+βk(2(1βk)τ3βk)[CErg,M(CTRPO,1log(L)L+CTRPO,2ϵ)+ϵ](1τ2βk)μ^k1+μ22+βk2+b1τ[CErg,M(CTRPO,1log(L)L+CTRPO,2ϵ)+ϵ](1τ2βk)μ^k1μ22+βkCMF,1log(L)L+βkCMF,2ϵ.superscriptsubscriptdelimited-∥∥subscript^𝜇𝑘subscript𝜇221𝜏2subscript𝛽𝑘superscriptsubscriptdelimited-∥∥subscript^𝜇𝑘1subscript𝜇22subscript𝛽𝑘21subscript𝛽𝑘𝜏3subscript𝛽𝑘delimited-[]subscript𝐶ErgMsubscript𝐶TRPO1𝐿𝐿subscript𝐶TRPO2italic-ϵitalic-ϵ1𝜏2subscript𝛽𝑘superscriptsubscriptdelimited-∥∥subscript^𝜇𝑘1subscript𝜇22subscript𝛽𝑘2subscript𝑏1𝜏delimited-[]subscript𝐶ErgMsubscript𝐶TRPO1𝐿𝐿subscript𝐶TRPO2italic-ϵitalic-ϵ1𝜏2subscript𝛽𝑘superscriptsubscriptdelimited-∥∥subscript^𝜇𝑘1subscript𝜇22subscript𝛽𝑘subscript𝐶MF1𝐿𝐿subscript𝛽𝑘subscript𝐶MF2italic-ϵ\displaystyle\begin{split}&\left\|\hat{\mu}_{k}-\mu_{\star}\right\|_{{2}}^{2}% \\ &\leq\left(1-\frac{\tau}{2}\beta_{k}\right)\left\|\hat{\mu}_{k-1}-\mu_{\star}% \right\|_{{2}}^{2}\\ &\qquad+\beta_{k}\left(\frac{2(1-\beta_{k})}{\tau}-3\beta_{k}\right)\cdot\left% [C_{\operatorname{Erg,M}}\left(C_{\operatorname{TRPO,1}}~{}\frac{\log(L)}{L}+C% _{\operatorname{TRPO,2}}~{}\epsilon\right)+\epsilon\right]\\ &\leq\left(1-\frac{\tau}{2}\beta_{k}\right)\left\|\hat{\mu}_{k-1}+\mu_{\star}% \right\|_{{2}}^{2}+\beta_{k}\frac{2+b_{1}}{\tau}\left[C_{\operatorname{Erg,M}}% \left(C_{\operatorname{TRPO,1}}~{}\frac{\log(L)}{L}+C_{\operatorname{TRPO,2}}~% {}\epsilon\right)+\epsilon\right]\\ &\leq\left(1-\frac{\tau}{2}\beta_{k}\right)\left\|\hat{\mu}_{k-1}-\mu_{\star}% \right\|_{{2}}^{2}+\beta_{k}\ C_{\operatorname{MF,1}}\frac{\log(L)}{L}+\beta_{% k}\ C_{\operatorname{MF,2}}~{}\epsilon\;.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG 2 ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG - 3 italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ [ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + italic_C start_POSTSUBSCRIPT roman_TRPO , 2 end_POSTSUBSCRIPT italic_ϵ ) + italic_ϵ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG 2 + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG [ italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + italic_C start_POSTSUBSCRIPT roman_TRPO , 2 end_POSTSUBSCRIPT italic_ϵ ) + italic_ϵ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_MF , 1 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_MF , 2 end_POSTSUBSCRIPT italic_ϵ . end_CELL end_ROW (28)

Developping the recursion (28), we obtain

μ^kμ22superscriptsubscriptnormsubscript^𝜇𝑘subscript𝜇22absent\displaystyle\left\|\hat{\mu}_{k}-\mu_{\star}\right\|_{{2}}^{2}\leq∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ j=1k(1τ2βj)μ0μ22superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇22\displaystyle\prod_{j=1}^{k}\left(1-\frac{\tau}{2}\beta_{j}\right)\left\|\mu_{% 0}-\mu_{\star}\right\|_{{2}}^{2}∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+(CMF,1log(L)L+βkCMF,2ϵ)j=1kβj=j+1k(1τ2β)subscript𝐶MF1𝐿𝐿subscript𝛽𝑘subscript𝐶MF2italic-ϵsuperscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽\displaystyle+\left(C_{\operatorname{MF,1}}\frac{\log(L)}{L}+\beta_{k}\ C_{% \operatorname{MF,2}}~{}\epsilon\right)\sum_{j=1}^{k}\beta_{j}\prod_{\ell=j+1}^% {k}\left(1-\frac{\tau}{2}\beta_{\ell}\right)+ ( italic_C start_POSTSUBSCRIPT roman_MF , 1 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_MF , 2 end_POSTSUBSCRIPT italic_ϵ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT )
\displaystyle\leq exp(τ2j=1kβj)μ0μ22𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇22\displaystyle\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)\left\|\mu% _{0}-\mu_{\star}\right\|_{{2}}^{2}roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+(CMF,1log(L)L+βkCMF,2ϵ)j=1kβj=j+1k(1τ2β).subscript𝐶MF1𝐿𝐿subscript𝛽𝑘subscript𝐶MF2italic-ϵsuperscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽\displaystyle+\left(C_{\operatorname{MF,1}}\frac{\log(L)}{L}+\beta_{k}\ C_{% \operatorname{MF,2}}~{}\epsilon\right)\sum_{j=1}^{k}\beta_{j}\prod_{\ell=j+1}^% {k}\left(1-\frac{\tau}{2}\beta_{\ell}\right)\;.+ ( italic_C start_POSTSUBSCRIPT roman_MF , 1 end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_MF , 2 end_POSTSUBSCRIPT italic_ϵ ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) .

Note that the second term of the r.h.s. of the previous inequality is a telescopic sum, as the central term can be rewritten as

βj=j+1k(1τ2β)=2τ[=j+1k(1τ2β)=jk(1τ2β)].subscript𝛽𝑗superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽2𝜏delimited-[]superscriptsubscriptproduct𝑗1𝑘1𝜏2subscript𝛽superscriptsubscriptproduct𝑗𝑘1𝜏2subscript𝛽\displaystyle\beta_{j}\prod_{\ell=j+1}^{k}\left(1-\frac{\tau}{2}\beta_{\ell}% \right)=\frac{2}{\tau}\left[\prod_{\ell=j+1}^{k}\left(1-\frac{\tau}{2}\beta_{% \ell}\right)-\prod_{\ell=j}^{k}\left(1-\frac{\tau}{2}\beta_{\ell}\right)\right% ]\;.italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) = divide start_ARG 2 end_ARG start_ARG italic_τ end_ARG [ ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) - ∏ start_POSTSUBSCRIPT roman_ℓ = italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ] .

Therefore, we get (18). Moreover, since βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfy (14), this concludes the proof. ∎

D.5 ε𝜀\varepsilonitalic_ε-MFNE

We aim to characterize the proximity of an approximate Nash equilibrium, specifically an ε𝜀\varepsilonitalic_ε-Nash equilibrium. In this context, we address two key questions:

  1. 1.

    Given a fixed budget K𝐾Kitalic_K of sampled trajectories, how close the value function to the unique Nash equilibrium?

  2. 2.

    Given a target approximation level ε𝜀\varepsilonitalic_ε, how many trajectories K𝐾Kitalic_K are required to achieve an ε𝜀\varepsilonitalic_ε-Nash equilibrium?

These questions are crucial for understanding the sample complexity of learning equilibria in mean-field settings and provide insights into the efficiency of our algorithmic approach.

Corollary D.5.

Suppose that Assumptions 123, and 4 hold. Fix ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0. Assume that, for any k0𝑘0k\geq 0italic_k ≥ 0, the learning rate βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies (23), and let P𝑃Pitalic_P be the number of trajectories in each iteration for Sample-Based MF-TRPO satisfying (24). Let μ^ksubscript^𝜇𝑘\hat{\mu}_{k}over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (resp. π^LUnif,μ^ksubscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L}over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT) the output of Sample-Based MF-TRPO (resp. of Sample-Based TRPO(μ^k)subscript^μk(\hat{\mu}_{k})( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )). Then, we have the following bound on the exploitability

ϕ(π^LUnif,μ^k,μ^k)εk,italic-ϕsubscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘subscript𝜀𝑘\displaystyle\phi(\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L},\hat{\mu}_{k})% \leq\varepsilon_{k}\;,italic_ϕ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

with

ε^ksubscript^𝜀𝑘\displaystyle\hat{\varepsilon}_{k}over^ start_ARG italic_ε end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT :=δ^NE,1,k+Cϕ((1+CErg,(1+Cπ,μ))δ^NE,2,k+2CErg,|𝒮|η(1γ)δ^NE,1,k),assignabsentsubscript^𝛿NE1𝑘subscript𝐶italic-ϕ1subscript𝐶Erg1subscript𝐶𝜋𝜇subscript^𝛿NE2𝑘2subscript𝐶Erg𝒮𝜂1𝛾subscript^𝛿NE1𝑘\displaystyle:=\hat{\delta}_{\text{NE},1,k}+C_{\phi}\left(\left(1+C_{% \operatorname{Erg,\infty}}\left(1+C_{\pi,\mu}\right)\right)\sqrt{\hat{\delta}_% {\text{NE},2,k}}+\frac{2C_{\operatorname{Erg,\infty}}\sqrt{|\mathcal{S}|}}{% \eta(1-\gamma)}\cdot\sqrt{\hat{\delta}_{\text{NE},1,k}}\right)\;,:= over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ( 1 + italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ) square-root start_ARG over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 2 , italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG ) ,
δ^NE,1,ksubscript^𝛿NE1𝑘\displaystyle\hat{\delta}_{\text{NE},1,k}over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT =CTRPO,1((𝗋2+η2log2|𝒜|)|𝒜|2logLη(1γ)3(L+1)+ϵ(1γ)2supμ𝒫(𝒮)𝖽¯μ,μπμν),absentsuperscriptsubscript𝐶TRPO1superscriptsubscriptnormsuperscript𝗋absent2superscript𝜂2superscript2𝒜superscript𝒜2𝐿𝜂superscript1𝛾3𝐿1italic-ϵsuperscript1𝛾2subscriptsupremum𝜇𝒫𝒮subscriptnormsuperscriptsubscript¯𝖽𝜇𝜇subscript𝜋𝜇𝜈\displaystyle=C_{\operatorname{TRPO,1}}^{\prime}\left(\frac{\left(\left\|% \mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}^{2}+\eta^{2}\log^{2}|\mathcal% {A}|\right)|\mathcal{A}|^{2}\log L}{\eta(1-\gamma)^{3}(L+1)}+\frac{\epsilon}{(% 1-\gamma)^{2}}\sup_{\mu\in\mathcal{P}(\mathcal{S})}\left\|\frac{\overline{% \mathsf{d}}_{\mu,\mu}^{\hskip 0.49005pt\pi_{\mu}}}{\nu}\right\|_{{\infty}}% \right)\;,= italic_C start_POSTSUBSCRIPT roman_TRPO , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( divide start_ARG ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_A | ) | caligraphic_A | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_L end_ARG start_ARG italic_η ( 1 - italic_γ ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ( italic_L + 1 ) end_ARG + divide start_ARG italic_ϵ end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_sup start_POSTSUBSCRIPT italic_μ ∈ caligraphic_P ( caligraphic_S ) end_POSTSUBSCRIPT ∥ divide start_ARG over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_ν end_ARG ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) ,
δ^NE,2,ksubscript^𝛿NE2𝑘\displaystyle\hat{\delta}_{\text{NE},2,k}over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 2 , italic_k end_POSTSUBSCRIPT =exp(τ2j=1kβj)μ0μ22+2CMF,1τlog(L)L+2CMF,2τϵ.absent𝜏2superscriptsubscript𝑗1𝑘subscript𝛽𝑗superscriptsubscriptnormsubscript𝜇0subscript𝜇222subscript𝐶MF1𝜏𝐿𝐿2subscript𝐶MF2𝜏italic-ϵ\displaystyle=\exp\left(-\frac{\tau}{2}\sum_{j=1}^{k}\beta_{j}\right)\left\|% \mu_{0}-\mu_{\star}\right\|_{{2}}^{2}+\frac{2C_{\operatorname{MF,1}}}{\tau}% \frac{\log(L)}{L}+\frac{2C_{\operatorname{MF,2}}}{\tau}\epsilon\;.= roman_exp ( - divide start_ARG italic_τ end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_MF , 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG divide start_ARG roman_log ( italic_L ) end_ARG start_ARG italic_L end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_MF , 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG italic_ϵ .
Proof.

From Proposition E.5, we have that the exploitability of a policy π𝜋\piitalic_π and a mean-field parameter μ𝜇\muitalic_μ can be bound by the gap of optimality of the π𝜋\piitalic_π w.r.t. the value function JMFG(,μ,μ)superscript𝐽MFG𝜇𝜇J^{\operatorname{MFG}}(\cdot,\mu,\mu)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( ⋅ , italic_μ , italic_μ ) and the distance between μ𝜇\muitalic_μ and the stationary distribution λπ,μsubscript𝜆𝜋𝜇\lambda_{\pi,\mu}italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT.

From Theorem D.2, we have that

maxπΠJMFG(π,μ^k,μ^k)JMFG(π^LUnif,μ^k,μ^k,μ^k)subscript𝜋Πsuperscript𝐽MFG𝜋subscript^𝜇𝑘subscript^𝜇𝑘superscript𝐽MFGsubscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘subscript^𝜇𝑘\displaystyle\max_{\pi\in\Pi}J^{\operatorname{MFG}}(\pi,\hat{\mu}_{k},\hat{\mu% }_{k})-J^{\operatorname{MFG}}(\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L},\hat% {\mu}_{k},\hat{\mu}_{k})roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) δ^NE,1,k.absentsubscript^𝛿NE1𝑘\displaystyle\leq\hat{\delta}_{\text{NE},1,k}\;.≤ over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT . (29)

On the other hand, using the fact that (πμ,μ)subscript𝜋subscript𝜇subscript𝜇(\pi_{\mu_{\star}},\mu_{\star})( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) is a MFNE, we have

μ^kλπ^LUnif,μ^k,μ^k=subscript^𝜇𝑘subscript𝜆subscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘absent\displaystyle\hat{\mu}_{k}-\lambda_{{\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{% L}},\hat{\mu}_{k}}=over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT = (μ^kμ)+(λπμ,μλπμ^k,μ^k)+(λπμ^k,μ^kλπ^LUnif,μ^k,μ^k),subscript^𝜇𝑘subscript𝜇subscript𝜆subscript𝜋subscript𝜇subscript𝜇subscript𝜆subscript𝜋subscript^𝜇𝑘subscript^𝜇𝑘subscript𝜆subscript𝜋subscript^𝜇𝑘subscript^𝜇𝑘subscript𝜆subscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘\displaystyle\left(\hat{\mu}_{k}-\mu_{\star}\right)+\left(\lambda_{\pi_{\mu_{% \star}},\mu_{\star}}-\lambda_{\pi_{\hat{\mu}_{k}},\hat{\mu}_{k}}\right)+\left(% \lambda_{\pi_{\hat{\mu}_{k}},\hat{\mu}_{k}}-\lambda_{{\hat{\pi}^{\texttt{Unif}% ,\hat{\mu}_{k}}_{L}},\hat{\mu}_{k}}\right)\;,( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT ) + ( italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ( italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

for =0,,L0𝐿\ell=0,\dots,Lroman_ℓ = 0 , … , italic_L. As in proof of Corollary C.6, we have

λπμ,μλπμ^k,μ^k2CErg,(1+Cπ,μ)μμ^k2.subscriptnormsubscript𝜆subscript𝜋subscript𝜇subscript𝜇subscript𝜆subscript𝜋subscript^𝜇𝑘subscript^𝜇𝑘2subscript𝐶Erg1subscript𝐶𝜋𝜇subscriptnormsubscript𝜇subscript^𝜇𝑘2\displaystyle\left\|\lambda_{\pi_{\mu_{\star}},\mu_{\star}}-\lambda_{\pi_{\hat% {\mu}_{k}},\hat{\mu}_{k}}\right\|_{{2}}\leq C_{\operatorname{Erg,\infty}}\left% (1+C_{\pi,\mu}\right)\left\|\mu_{\star}-\hat{\mu}_{k}\right\|_{{2}}\;.∥ italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ∥ italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Moreover,

λπμ^k,μ^kλπ^LUnif,μ^k,μ^k2subscriptnormsubscript𝜆subscript𝜋subscript^𝜇𝑘subscript^𝜇𝑘subscript𝜆subscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘2\displaystyle\left\|\lambda_{\pi_{\hat{\mu}_{k}},\hat{\mu}_{k}}-\lambda_{{\hat% {\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L}},\hat{\mu}_{k}}\right\|_{{2}}∥ italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT CErg,s𝒮μ^k(s)π^LUnif,μ^k(|s)πμ^k(|s)TV\displaystyle\leq C_{\operatorname{Erg,\infty}}\sum_{s\in\mathcal{S}}\hat{\mu}% _{k}(s)\left\|{\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L}}(\cdot|s)-\pi_{\hat% {\mu}_{k}}(\cdot|s)\right\|_{\mathrm{TV}}≤ italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_s ) ∥ over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT
2CErg,|𝒮|η(1γ)JMFG(πμ^k,μ^k,μ^k)JMFG(π^LUnif,μ^k,μ^k,μ^k)absent2subscript𝐶Erg𝒮𝜂1𝛾superscript𝐽MFGsubscript𝜋subscript^𝜇𝑘subscript^𝜇𝑘subscript^𝜇𝑘superscript𝐽MFGsubscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘subscript^𝜇𝑘\displaystyle\leq\frac{2C_{\operatorname{Erg,\infty}}\ \sqrt{|\mathcal{S}|}}{% \eta(1-\gamma)}\sqrt{J^{\operatorname{MFG}}(\pi_{\hat{\mu}_{k}},\hat{\mu}_{k},% \hat{\mu}_{k})-J^{\operatorname{MFG}}({\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}% _{L}},\hat{\mu}_{k},\hat{\mu}_{k})}≤ divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG square-root start_ARG italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG
2CErg,|𝒮|η(1γ)δ^NE,1,k.absent2subscript𝐶Erg𝒮𝜂1𝛾subscript^𝛿NE1𝑘\displaystyle\leq\frac{2C_{\operatorname{Erg,\infty}}\sqrt{|\mathcal{S}|}}{% \eta(1-\gamma)}\cdot\sqrt{\hat{\delta}_{\text{NE},1,k}}\;.≤ divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG .

Using the triangle inequality, together with Theorem D.2, we can bound μ^kλπ^LUnif,μ^k,μ^k2subscriptnormsubscript^𝜇𝑘subscript𝜆subscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘2\left\|\hat{\mu}_{k}-\lambda_{{\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L}},% \hat{\mu}_{k}}\right\|_{{2}}∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as

μ^kλπ^LUnif,μ^k,μ^k2subscriptnormsubscript^𝜇𝑘subscript𝜆subscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript^𝜇𝑘2\displaystyle\left\|\hat{\mu}_{k}-\lambda_{{\hat{\pi}^{\texttt{Unif},\hat{\mu}% _{k}}_{L}},\hat{\mu}_{k}}\right\|_{{2}}∥ over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (1+CErg,(1+Cπ,μ))μμ^k2+2CErg,|𝒮|η(1γ)δ^NE,1,kabsent1subscript𝐶Erg1subscript𝐶𝜋𝜇subscriptnormsubscript𝜇subscript^𝜇𝑘22subscript𝐶Erg𝒮𝜂1𝛾subscript^𝛿NE1𝑘\displaystyle\leq\left(1+C_{\operatorname{Erg,\infty}}\left(1+C_{\pi,\mu}% \right)\right)\left\|\mu_{\star}-\hat{\mu}_{k}\right\|_{{2}}+\frac{2C_{% \operatorname{Erg,\infty}}\sqrt{|\mathcal{S}|}}{\eta(1-\gamma)}\cdot\sqrt{\hat% {\delta}_{\text{NE},1,k}}≤ ( 1 + italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ) ∥ italic_μ start_POSTSUBSCRIPT ⋆ end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG
(1+CErg,(1+Cπ,μ))δ^NE,2,k+2CErg,|𝒮|η(1γ)δ^NE,1,k.absent1subscript𝐶Erg1subscript𝐶𝜋𝜇subscript^𝛿NE2𝑘2subscript𝐶Erg𝒮𝜂1𝛾subscript^𝛿NE1𝑘\displaystyle\leq\left(1+C_{\operatorname{Erg,\infty}}\left(1+C_{\pi,\mu}% \right)\right)\sqrt{\hat{\delta}_{\text{NE},2,k}}+\frac{2C_{\operatorname{Erg,% \infty}}\sqrt{|\mathcal{S}|}}{\eta(1-\gamma)}\cdot\sqrt{\hat{\delta}_{\text{NE% },1,k}}\;.≤ ( 1 + italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ) square-root start_ARG over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 2 , italic_k end_POSTSUBSCRIPT end_ARG + divide start_ARG 2 italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT square-root start_ARG | caligraphic_S | end_ARG end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ square-root start_ARG over^ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT NE , 1 , italic_k end_POSTSUBSCRIPT end_ARG .

Using the last inequality and (20), together with Proposition E.5, we have that ϕ(π^LUnif,μ^k,μk)εkitalic-ϕsubscriptsuperscript^𝜋Unifsubscript^𝜇𝑘𝐿subscript𝜇𝑘subscript𝜀𝑘\phi(\hat{\pi}^{\texttt{Unif},\hat{\mu}_{k}}_{L},\mu_{k})\leq\varepsilon_{k}italic_ϕ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT Unif , over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT .

Remark D.6.

It is important to note that our analysis does not directly bound the exploitability of the last iterate but on the uniform mixture of policies over the learning process. This distinction arises due to the absence of an exact counterpart to Howard’s theorem (Howard, 1960) in Sample-Based TRPO, as noted in Section D.1. Unlike Exact TRPO, where policy improvement guarantees can be established step by step, sampling errors introduce additional variability that prevents such guarantees in the sample-based setting.

Despite this limitation, our results demonstrate that the learned policies perform well on average and that we approximate the MFNE accordingly. The bounded average exploitability ensures that, over time, the algorithm remains close to an equilibrium, reinforcing the practical effectiveness of the proposed approach in large-scale multi-agent learning.

Remark D.7.

From the obtained sample complexity result, it follows directly that to achieve an ε𝜀\varepsilonitalic_ε-MFNE, the required number of inner policy updates L𝐿Litalic_L and outer population updates K𝐾Kitalic_K must satisfy the following scaling conditions:

LO~(1/ε2)andKO~(log(1/ε2)).formulae-sequence𝐿~𝑂1superscript𝜀2and𝐾~𝑂1superscript𝜀2\displaystyle L\in\widetilde{O}(1/\varepsilon^{2})\quad\text{and}\quad K\in% \widetilde{O}(\log(1/\varepsilon^{2})).italic_L ∈ over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and italic_K ∈ over~ start_ARG italic_O end_ARG ( roman_log ( 1 / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) .

In addition to these requirements, we also establish that the number of episodes P𝑃Pitalic_P and the number of iterations per policy update Isubscript𝐼I_{\ell}italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT must satisfy

P,IO~(1/ε4).𝑃subscript𝐼~𝑂1superscript𝜀4\displaystyle P,I_{\ell}\in\widetilde{O}(1/\varepsilon^{4}).italic_P , italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∈ over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) .

These additional conditions ensure that the variance introduced by the sampling procedure remains controlled, allowing for a sufficiently accurate estimation of the value function and policy updates. This highlights the tradeoff between computational efficiency and precision in approximating the MFNE, showing that our algorithm achieves a well-balanced complexity while ensuring convergence guarantees.

At each iteration of Sample-Based MF-TRPO, the total number of calls to the environment consists of those required by the Sample-Based TRPO procedure plus the additional subroutine for updating the mean-field parameter. Sample-Based TRPO requires O~(1/ε6)~𝑂1superscript𝜀6\widetilde{O}(1/\varepsilon^{6})over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) environment calls, scaling proportionally to the product L×I𝐿subscript𝐼L\times I_{\ell}italic_L × italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT. This aligns with Shani et al. (2020); however, we highlight a distinction stemming from the chosen metrics: the metric they use corresponds to the square root of our exploitability measure, introducing a cubic dependency in terms of ε𝜀\varepsilonitalic_ε. Additionally, the total complexity includes a multiplicative factor K𝐾Kitalic_K, whose contribution is negligible in practice due to its logarithmic scaling, preserving overall algorithmic efficiency.

On the other hand, in each iteration of Sample-Based MF-TRPO, the update step for the mean-field distribution scales as O~(P×I×K)~𝑂𝑃subscript𝐼𝐾\widetilde{O}(P\times I_{\ell}\times K)over~ start_ARG italic_O end_ARG ( italic_P × italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT × italic_K ), which means O~(1/ε2)~𝑂1superscript𝜀2\widetilde{O}(1/\varepsilon^{2})over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) calls to the MF-MDP. This complexity arises naturally from the oracle assumption adopted, which involves an initialization step at each iteration potentially requiring up to K𝐾Kitalic_K steps to accurately initialize the mean-field distribution. While introducing additional complexity, this initialization procedure is crucial for maintaining consistency across iterative population updates, thereby ensuring the stability and convergence accuracy of the algorithm towards the mean-field Nash equilibrium.

Combining these two contributions, we obtain an overall complexity that scales as O~(1/ε6)~𝑂1superscript𝜀6\widetilde{O}(1/\varepsilon^{6})over~ start_ARG italic_O end_ARG ( 1 / italic_ε start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ), consistent with established convergence rates in the RL literature.

Appendix E Technical Lemmata

E.1 Lipschitzness of the Markov reward process iterates

We show in this section that Assumption 1 implies the existence of a Lipschitz constant for the operator λπ,μsubscript𝜆𝜋𝜇\lambda_{\pi,\mu}italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT and (𝖯μπ)Msuperscriptsuperscriptsubscript𝖯𝜇𝜋𝑀(\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi})^{M}( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, for any μ𝒫(𝒮)𝜇𝒫𝒮\mu\in\mathcal{P}(\mathcal{S})italic_μ ∈ caligraphic_P ( caligraphic_S ) and M0𝑀0M\geq 0italic_M ≥ 0.

Lemma E.1.

Suppose Assumptions 1 and 3 holds. Fix M0𝑀0M\geq 0italic_M ≥ 0 (resp. M=𝑀M=\inftyitalic_M = ∞). Then, there exists a constant CErg0subscript𝐶Erg0C_{\operatorname{Erg}}\geq 0italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT ≥ 0 such that

ξ(𝖯μπ)Mξ(𝖯μπ)MTVCErg,M(s𝒮ξ(s)π(|s)π(|s)TV+μμ2)CErg,M(sups𝒮π(|s)π(|s)TV+μμ2)(resp. λπ,μλπ,μTVCErg,(s𝒮ξ(s)π(|s)π(|s)TV+μμ2)CErg,(sups𝒮π(|s)π(|s)TVξ(s)+μμ2) ),\displaystyle\begin{split}\left\|\xi\Big{(}\mathsf{P}_{\mu}^{\hskip 0.49005pt% \pi}\Big{)}^{M}-\xi\Big{(}\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{% \prime}}\Big{)}^{M}\right\|_{\mathrm{TV}}\leq&C_{\operatorname{Erg,M}}\left(% \sum_{s\in\mathcal{S}}\xi(s)\left\|\pi(\cdot|s)-\pi^{\prime}(\cdot|s)\right\|_% {\mathrm{TV}}+\left\|\mu-\mu^{\prime}\right\|_{{2}}\right)\\ \leq&C_{\operatorname{Erg,M}}\left(\sup_{s\in\mathcal{S}}\left\|\pi(\cdot|s)-% \pi^{\prime}(\cdot|s)\right\|_{\mathrm{TV}}+\left\|\mu-\mu^{\prime}\right\|_{{% 2}}\right)\\ \text{(resp. }\left\|\lambda_{\pi,\mu}-\lambda_{\pi^{\prime},\mu^{\prime}}% \right\|_{\mathrm{TV}}\leq&C_{\operatorname{Erg,\infty}}\left(\sum_{s\in% \mathcal{S}}\xi(s)\left\|\pi(\cdot|s)-\pi^{\prime}(\cdot|s)\right\|_{\mathrm{% TV}}+\left\|\mu-\mu^{\prime}\right\|_{{2}}\right)\\ \leq&C_{\operatorname{Erg,\infty}}\left(\sup_{s\in\mathcal{S}}\left\|\pi(\cdot% |s)-\pi^{\prime}(\cdot|s)\right\|_{\mathrm{TV}}\xi(s)+\left\|\mu-\mu^{\prime}% \right\|_{{2}}\right)\text{ )}\;,\end{split}start_ROW start_CELL ∥ italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ≤ end_CELL start_CELL italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( italic_s ) ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL (resp. ∥ italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ≤ end_CELL start_CELL italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( italic_s ) ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT italic_ξ ( italic_s ) + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (30)

with

CErg,M=CErgL𝖯1ρM1ρ,(resp. CErg,=CErgL𝖯1ρ ).\displaystyle\begin{split}C_{\operatorname{Erg,M}}=&C_{\operatorname{Erg}}\ L_% {\mathsf{P}^{\hskip 0.35004pt}}\frac{1-\rho^{M}}{1-\rho}\;,\qquad\text{(resp. % }C_{\operatorname{Erg,\infty}}=\frac{C_{\operatorname{Erg}}\ L_{\mathsf{P}^{% \hskip 0.35004pt}}}{1-\rho}\text{ )}\;.\end{split}start_ROW start_CELL italic_C start_POSTSUBSCRIPT roman_Erg , roman_M end_POSTSUBSCRIPT = end_CELL start_CELL italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG , (resp. italic_C start_POSTSUBSCRIPT roman_Erg , ∞ end_POSTSUBSCRIPT = divide start_ARG italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ) . end_CELL end_ROW

for π,πΠ𝜋superscript𝜋Π\pi,\pi^{\prime}\in\Piitalic_π , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π and μ,μ𝒫(𝒮)𝜇superscript𝜇𝒫𝒮\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{S})italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_S ).

Proof.

This proof is adapted from Fort et al. (Lemma 4.2, 2011) on parametrized Markov chains.

Step 1. Consider first M<𝑀M<\inftyitalic_M < ∞. By employing a telescoping sum, we obtain

(𝖯μπ)M(𝖯μπ)M=superscriptsuperscriptsubscript𝖯𝜇𝜋𝑀superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑀absent\displaystyle\Big{(}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}\Big{)}^{M}-\Big{(}% \mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime}}\Big{)}^{M}=( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = m=0M1(𝖯μπ)Mm1(𝖯μπ𝖯μπ)(𝖯μπ)m.superscriptsubscript𝑚0𝑀1superscriptsuperscriptsubscript𝖯𝜇𝜋𝑀𝑚1superscriptsubscript𝖯𝜇𝜋superscriptsubscript𝖯superscript𝜇superscript𝜋superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑚\displaystyle\sum_{m=0}^{M-1}\Big{(}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}\Big% {)}^{M-m-1}\Big{(}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}-\mathsf{P}_{\mu^{% \prime}}^{\hskip 0.49005pt\pi^{\prime}}\Big{)}\Big{(}\mathsf{P}_{\mu^{\prime}}% ^{\hskip 0.49005pt\pi^{\prime}}\Big{)}^{m}\;.∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_m - 1 end_POSTSUPERSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT - sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

Consider a function ψ:𝒮+:𝜓𝒮subscript\psi\colon\mathcal{S}\to\mathbb{R}_{+}italic_ψ : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT with ψ1subscriptnorm𝜓1\left\|\psi\right\|_{{\infty}}\leq 1∥ italic_ψ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ 1. Therefore, since 𝖯μπ𝖯μπsuperscriptsubscript𝖯𝜇𝜋superscriptsubscript𝖯superscript𝜇superscript𝜋\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}-\mathsf{P}_{\mu^{\prime}}^{\hskip 0.490% 05pt\pi^{\prime}}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT - sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a difference of probabilities, we have that

s𝒮[ξ(𝖯μπ)Mψ](s)[ξ(𝖯μπ)Mψ](s)=s𝒮m=0M1s𝒮ξ(𝖯μπ)Mm1(𝖯μπ𝖯μπ)((𝖯μπ)m(s,s)ψ(s)λπ,μψ(s)).subscript𝑠𝒮delimited-[]𝜉superscriptsuperscriptsubscript𝖯𝜇𝜋𝑀𝜓𝑠delimited-[]𝜉superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑀𝜓𝑠subscript𝑠𝒮superscriptsubscript𝑚0𝑀1subscriptsuperscript𝑠𝒮𝜉superscriptsuperscriptsubscript𝖯𝜇𝜋𝑀𝑚1superscriptsubscript𝖯𝜇𝜋superscriptsubscript𝖯superscript𝜇superscript𝜋superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑚superscript𝑠𝑠𝜓𝑠subscript𝜆superscript𝜋superscript𝜇𝜓𝑠\displaystyle\begin{split}&\sum_{s\in\mathcal{S}}\left[\xi\Big{(}\mathsf{P}_{% \mu}^{\hskip 0.49005pt\pi}\Big{)}^{M}\psi\right](s)-\left[\xi\Big{(}\mathsf{P}% _{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime}}\Big{)}^{M}\psi\right](s)\\ &=\sum_{s\in\mathcal{S}}\sum_{m=0}^{M-1}\sum_{s^{\prime}\in\mathcal{S}}\xi\Big% {(}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}\Big{)}^{M-m-1}\Big{(}\mathsf{P}_{\mu% }^{\hskip 0.49005pt\pi}-\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime% }}\Big{)}\left(\Big{(}\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime}}% \Big{)}^{m}(s^{\prime},s)\psi(s)-\lambda_{\pi^{\prime},\mu^{\prime}}\psi(s)% \right)\;.\end{split}start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT [ italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ψ ] ( italic_s ) - [ italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ψ ] ( italic_s ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_m - 1 end_POSTSUPERSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT - sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ( ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) italic_ψ ( italic_s ) - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ψ ( italic_s ) ) . end_CELL end_ROW (31)

This is due to the fact that whenever we evaluate the previous difference of probabilities matrices on ψ𝜓\psiitalic_ψ, the term s𝒮λπ,μψ(s)subscript𝑠𝒮subscript𝜆superscript𝜋superscript𝜇𝜓𝑠\sum_{s\in\mathcal{S}}\lambda_{\pi^{\prime},\mu^{\prime}}\psi(s)∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ψ ( italic_s ) is perceived a constant by the transition kernels 𝖯μπsuperscriptsubscript𝖯𝜇𝜋\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT and 𝖯μπsuperscriptsubscript𝖯superscript𝜇superscript𝜋\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime}}sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, summing this part to zero.

Define ϕm:𝒮+:subscriptitalic-ϕ𝑚𝒮subscript\phi_{m}\colon\mathcal{S}\to\mathbb{R}_{+}italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT : caligraphic_S → blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT as

ϕm(s)=s𝒮(𝖯μπ𝖯μπ)(s,s)s′′𝒮((𝖯μπ)m(s,s′′)ψ(s′′)λπ,μψ(s′′)).subscriptitalic-ϕ𝑚𝑠subscriptsuperscript𝑠𝒮superscriptsubscript𝖯𝜇𝜋superscriptsubscript𝖯superscript𝜇superscript𝜋𝑠superscript𝑠subscriptsuperscript𝑠′′𝒮superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑚superscript𝑠superscript𝑠′′𝜓superscript𝑠′′subscript𝜆superscript𝜋superscript𝜇𝜓superscript𝑠′′\displaystyle\phi_{m}(s)=\sum_{s^{\prime}\in\mathcal{S}}\Big{(}\mathsf{P}_{\mu% }^{\hskip 0.49005pt\pi}-\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime% }}\Big{)}(s,s^{\prime})\sum_{s^{\prime\prime}\in\mathcal{S}}\left(\Big{(}% \mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime}}\Big{)}^{m}(s^{\prime}% ,s^{\prime\prime})\psi(s^{\prime\prime})-\lambda_{\pi^{\prime},\mu^{\prime}}% \psi(s^{\prime\prime})\right)\;.italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT - sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ( ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) ) .

From Assumption 3, we have that

|(𝖯μπ)m(s,s′′)ψ(s′′)λπ,μψ(s′′)|CErgρmψ.superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑚superscript𝑠superscript𝑠′′𝜓superscript𝑠′′subscript𝜆superscript𝜋superscript𝜇𝜓superscript𝑠′′subscript𝐶Ergsuperscript𝜌𝑚subscriptnorm𝜓\displaystyle\left|\Big{(}\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{% \prime}}\Big{)}^{m}(s^{\prime},s^{\prime\prime})\psi(s^{\prime\prime})-\lambda% _{\pi^{\prime},\mu^{\prime}}\psi(s^{\prime\prime})\right|\leq C_{\operatorname% {Erg}}\rho^{m}\left\|\psi\right\|_{{\infty}}\;.| ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) | ≤ italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_ψ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT . (32)

Therefore, we have

|ϕm(s)|sups𝒮(𝖯μπ(s,s)𝖯μπ(s,s))CErgρmψ,subscriptitalic-ϕ𝑚𝑠subscriptsupremumsuperscript𝑠𝒮superscriptsubscript𝖯𝜇𝜋superscript𝑠𝑠superscriptsubscript𝖯superscript𝜇superscript𝜋superscript𝑠𝑠subscript𝐶Ergsuperscript𝜌𝑚subscriptnorm𝜓\displaystyle|\phi_{m}(s)|\leq\sup_{s^{\prime}\in\mathcal{S}}\Big{(}\mathsf{P}% _{\mu}^{\hskip 0.49005pt\pi}(s^{\prime},s)-\mathsf{P}_{\mu^{\prime}}^{\hskip 0% .49005pt\pi^{\prime}}(s^{\prime},s)\Big{)}C_{\operatorname{Erg}}\rho^{m}\left% \|\psi\right\|_{{\infty}}\;,| italic_ϕ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_s ) | ≤ roman_sup start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) - sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s ) ) italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_ψ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,

and, applying Assumption 1, we get

|ϕ(s)|L𝖯(π(|s)π(|s)TV+μμ2)×CErgρmψ,\displaystyle|\phi(s)|\leq L_{\mathsf{P}^{\hskip 0.35004pt}}\left(\left\|\pi(% \cdot|s)-\pi^{\prime}(\cdot|s)\right\|_{\mathrm{TV}}+\left\|\mu-\mu^{\prime}% \right\|_{{2}}\right)\times C_{\operatorname{Erg}}\rho^{m}\left\|\psi\right\|_% {{\infty}}\;,| italic_ϕ ( italic_s ) | ≤ italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) × italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ italic_ψ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,

for any s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S. Combining this with (31), we get, from the characterization of the total variation norm of the integral with respect to the positive functions bounded in sup-norm by 1111, that

ξ(𝖯μπ)Mξ(𝖯μπ)MTVsubscriptnorm𝜉superscriptsuperscriptsubscript𝖯𝜇𝜋𝑀𝜉superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑀TVabsent\displaystyle\left\|\xi\Big{(}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}\Big{)}^{M% }-\xi\Big{(}\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime}}\Big{)}^{M% }\right\|_{\mathrm{TV}}\leq∥ italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ≤ L𝖯CErgm=0M1ρms𝒮ξ(𝖯μπ)Mm1(π(|s)π(|s)TV+μμ2)\displaystyle L_{\mathsf{P}^{\hskip 0.35004pt}}C_{\operatorname{Erg}}\sum_{m=0% }^{M-1}\rho^{m}\sum_{s\in\mathcal{S}}\xi\Big{(}\mathsf{P}_{\mu}^{\hskip 0.4900% 5pt\pi}\Big{)}^{M-m-1}\left(\left\|\pi(\cdot|s)-\pi^{\prime}(\cdot|s)\right\|_% {\mathrm{TV}}+\left\|\mu-\mu^{\prime}\right\|_{{2}}\right)italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M - italic_m - 1 end_POSTSUPERSCRIPT ( ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq L𝖯CErgm=0M1ρms𝒮ξ(s)(π(|s)π(|s)TV+μμ2)\displaystyle L_{\mathsf{P}^{\hskip 0.35004pt}}C_{\operatorname{Erg}}\sum_{m=0% }^{M-1}\rho^{m}\sum_{s\in\mathcal{S}}\xi(s)\left(\left\|\pi(\cdot|s)-\pi^{% \prime}(\cdot|s)\right\|_{\mathrm{TV}}+\left\|\mu-\mu^{\prime}\right\|_{{2}}\right)italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( italic_s ) ( ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq L𝖯CErg1ρM1ρ(s𝒮ξ(s)π(|s)π(|s)TV+μμ2)\displaystyle L_{\mathsf{P}^{\hskip 0.35004pt}}C_{\operatorname{Erg}}\frac{1-% \rho^{M}}{1-\rho}\left(\sum_{s\in\mathcal{S}}\xi(s)\left\|\pi(\cdot|s)-\pi^{% \prime}(\cdot|s)\right\|_{\mathrm{TV}}+\left\|\mu-\mu^{\prime}\right\|_{{2}}\right)italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( italic_s ) ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq L𝖯CErg1ρM1ρ(sups𝒮π(|s)π(|s)TV+μμ2),\displaystyle L_{\mathsf{P}^{\hskip 0.35004pt}}C_{\operatorname{Erg}}\frac{1-% \rho^{M}}{1-\rho}\left(\sup_{s\in\mathcal{S}}\left\|\pi(\cdot|s)-\pi^{\prime}(% \cdot|s)\right\|_{\mathrm{TV}}+\left\|\mu-\mu^{\prime}\right\|_{{2}}\right)\;,italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where we have used the fact that 𝖯μπsuperscriptsubscript𝖯𝜇𝜋\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT a stochastic matrix, thus its biggest eigenvalue is 1111 and ξ𝜉\xiitalic_ξ is a vector of just positive components.

Step 2. Consider now the ergodic distributions. Fix M1𝑀1M\geq 1italic_M ≥ 1. From triangle inequality, we have

λπ,μλπ,μTVsubscriptnormsubscript𝜆𝜋𝜇subscript𝜆superscript𝜋superscript𝜇TV\displaystyle\left\|\lambda_{\pi,\mu}-\lambda_{\pi^{\prime},\mu^{\prime}}% \right\|_{\mathrm{TV}}∥ italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT λπ,μ(𝖯μπ)MTV+(𝖯μπ)M(𝖯μπ)MTV+(𝖯μπ)Mλπ,μTV.absentsubscriptnormsubscript𝜆𝜋𝜇superscriptsuperscriptsubscript𝖯𝜇𝜋𝑀TVsubscriptnormsuperscriptsuperscriptsubscript𝖯𝜇𝜋𝑀superscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑀TVsubscriptnormsuperscriptsuperscriptsubscript𝖯superscript𝜇superscript𝜋𝑀subscript𝜆superscript𝜋superscript𝜇TV\displaystyle\leq\left\|\lambda_{\pi,\mu}-\Big{(}\mathsf{P}_{\mu}^{\hskip 0.49% 005pt\pi}\Big{)}^{M}\right\|_{\mathrm{TV}}+\left\|\Big{(}\mathsf{P}_{\mu}^{% \hskip 0.49005pt\pi}\Big{)}^{M}-\Big{(}\mathsf{P}_{\mu^{\prime}}^{\hskip 0.490% 05pt\pi^{\prime}}\Big{)}^{M}\right\|_{\mathrm{TV}}+\left\|\Big{(}\mathsf{P}_{% \mu^{\prime}}^{\hskip 0.49005pt\pi^{\prime}}\Big{)}^{M}-\lambda_{\pi^{\prime},% \mu^{\prime}}\right\|_{\mathrm{TV}}\;.≤ ∥ italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT - ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT .

From Assumption 3 together with Step 1, we obtain

λπ,μλπ,μTVsubscriptnormsubscript𝜆𝜋𝜇subscript𝜆superscript𝜋superscript𝜇TV\displaystyle\left\|\lambda_{\pi,\mu}-\lambda_{\pi^{\prime},\mu^{\prime}}% \right\|_{\mathrm{TV}}∥ italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT CErgρM+L𝖯CErg1ρM1ρ(s𝒮ξ(s)π(|s)π(|s)TV+μμ2)+CErgρM\displaystyle\leq C_{\operatorname{Erg}}\rho^{M}+L_{\mathsf{P}^{\hskip 0.35004% pt}}C_{\operatorname{Erg}}\frac{1-\rho^{M}}{1-\rho}\left(\sum_{s\in\mathcal{S}% }\xi(s)\left\|\pi(\cdot|s)-\pi^{\prime}(\cdot|s)\right\|_{\mathrm{TV}}+\left\|% \mu-\mu^{\prime}\right\|_{{2}}\right)+C_{\operatorname{Erg}}\rho^{M}≤ italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( italic_s ) ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
CErgρM+L𝖯CErg1ρ(s𝒮ξ(s)π(|s)π(|s)TV+μμ2)+CErgρM\displaystyle\leq C_{\operatorname{Erg}}\rho^{M}+\frac{L_{\mathsf{P}^{\hskip 0% .35004pt}}C_{\operatorname{Erg}}}{1-\rho}\left(\sum_{s\in\mathcal{S}}\xi(s)% \left\|\pi(\cdot|s)-\pi^{\prime}(\cdot|s)\right\|_{\mathrm{TV}}+\left\|\mu-\mu% ^{\prime}\right\|_{{2}}\right)+C_{\operatorname{Erg}}\rho^{M}≤ italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT + divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ξ ( italic_s ) ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
CErgρM+L𝖯CErg1ρ(sups𝒮π(|s)π(|s)TV+μμ2)+CErgρM.\displaystyle\leq C_{\operatorname{Erg}}\rho^{M}+\frac{L_{\mathsf{P}^{\hskip 0% .35004pt}}C_{\operatorname{Erg}}}{1-\rho}\left(\sup_{s\in\mathcal{S}}\left\|% \pi(\cdot|s)-\pi^{\prime}(\cdot|s)\right\|_{\mathrm{TV}}+\left\|\mu-\mu^{% \prime}\right\|_{{2}}\right)+C_{\operatorname{Erg}}\rho^{M}\;.≤ italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT + divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_π ( ⋅ | italic_s ) - italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT .

As this is true for any M1𝑀1M\geq 1italic_M ≥ 1, taking M𝑀Mitalic_M to infinity, from Fatou’s lemma we get (30). ∎

E.2 From bound on Value function to bounds on Policy

In this section, we demonstrate how a bound on the value function naturally leads to a corresponding bound on the policies. In the seminal work by Shani et al. (2020), an O~(1/N)~𝑂1𝑁\widetilde{O}(1/N)over~ start_ARG italic_O end_ARG ( 1 / italic_N ) bound was established for the cost functions. This result can be extended to derive a bound on the distance between policies by leveraging the properties of regularization. The connection between the value function and policies highlights the role of regularization in maintaining both theoretical guarantees and practical performance stability.

Indeed, from the entropic regularization, the optimization problem (3) with respect to the profile μ𝜇\muitalic_μ admits a unique solution πμsubscript𝜋𝜇\pi_{\mu}italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT. These considerations form the foundation of the following proposition.

Proposition E.2.

We have that

π(|s0)πμ(|s0)TV22η(1γ)(JMFG(πμ,μ,s0)JMFG(π,μ,s0)),\displaystyle\left\|\pi(\cdot|s_{0})-\pi_{\mu}(\cdot|s_{0})\right\|_{\mathrm{% TV}}^{2}\leq\frac{2}{\eta(1-\gamma)}\left(J^{\operatorname{MFG}}(\pi_{\mu},\mu% ,s_{0})-J^{\operatorname{MFG}}(\pi,\mu,s_{0})\right)\;,∥ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) , (33)

for any s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S.

Proof.

Denote ΩΩ\Omegaroman_Ω the entropic regularization tem in the reward function (2) as a function of the occupation measure, i.e.,

Ω(𝖽ξ,μπ):=a𝒜,s𝒮𝖽ξ,μπ(s,a)log(π(a|s)),assignΩsuperscriptsubscript𝖽𝜉𝜇𝜋subscriptformulae-sequence𝑎𝒜𝑠𝒮superscriptsubscript𝖽𝜉𝜇𝜋𝑠𝑎𝜋conditional𝑎𝑠\displaystyle\Omega\left(\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}\right):=% \sum_{a\in\mathcal{A},s\in\mathcal{S}}\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt% \pi}(s,a)\log(\pi(a|s))\;,roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) := ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , italic_s ∈ caligraphic_S end_POSTSUBSCRIPT sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) roman_log ( italic_π ( italic_a | italic_s ) ) ,

with 𝖽ξ,μπsuperscriptsubscript𝖽𝜉𝜇𝜋\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT as in 5. Therefore, we can express JMFGsuperscript𝐽MFGJ^{\operatorname{MFG}}italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT as

JMFG(π,μ,ξ)=a𝒜,s𝒮𝗋(s,a,μ)𝖽ξ,μπ(a,s)+ηΩ(𝖽ξ,μπ).superscript𝐽MFG𝜋𝜇𝜉subscriptformulae-sequence𝑎𝒜𝑠𝒮superscript𝗋absent𝑠𝑎𝜇superscriptsubscript𝖽𝜉𝜇𝜋𝑎𝑠𝜂Ωsuperscriptsubscript𝖽𝜉𝜇𝜋\displaystyle J^{\operatorname{MFG}}(\pi,\mu,\xi)=\sum_{a\in\mathcal{A},s\in% \mathcal{S}}\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)\mathsf{d}_{\xi,\mu}^{\hskip 0% .49005pt\pi}(a,s)+\eta\Omega\left(\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}% \right)\;.italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_ξ ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , italic_s ∈ caligraphic_S end_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_a , italic_s ) + italic_η roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) . (34)

Taking the disentegration on the spatial component, we have the following relationship between the occupation measure and its marginal

𝖽ξ,μπ(s,a)=t=0γtπ(a|s)𝖯π(st=s|s0ξ,st+1𝖯(|st,a,μ))=π(a|s)𝖽¯μ,ξπ(s),\displaystyle\begin{split}\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}(s,a)&=% \sum_{t=0}^{\infty}\gamma^{t}\pi(a|s)\mathsf{P}_{\pi}^{\hskip 0.49005pt}\Big{(% }s_{t}=s\Big{|}s_{0}\sim\xi,\;s_{t+1}\sim\mathsf{P}^{\hskip 0.49005pt}(\cdot|s% _{t},a,\mu)\Big{)}\\ &=\pi(a|s)\;\overline{\mathsf{d}}_{\mu,\xi}^{\hskip 0.49005pt\pi}(s)\;,\end{split}start_ROW start_CELL sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_π ( italic_a | italic_s ) sansserif_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ξ , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a , italic_μ ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_π ( italic_a | italic_s ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) , end_CELL end_ROW (35)

with 𝖽¯μ,ξπsuperscriptsubscript¯𝖽𝜇𝜉𝜋\overline{\mathsf{d}}_{\mu,\xi}^{\hskip 0.49005pt\pi}over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT as in 6. This also implies that

Ω(𝖽ξ,μπ)=a𝒜,s𝒮𝖽ξ,μπ(s,a)log(π(a|s))=a𝒜,s𝒮π(a|s)log(π(a|s))𝖽¯μ,ξπ(s).Ωsuperscriptsubscript𝖽𝜉𝜇𝜋subscriptformulae-sequence𝑎𝒜𝑠𝒮superscriptsubscript𝖽𝜉𝜇𝜋𝑠𝑎𝜋conditional𝑎𝑠subscriptformulae-sequence𝑎𝒜𝑠𝒮𝜋conditional𝑎𝑠𝜋conditional𝑎𝑠superscriptsubscript¯𝖽𝜇𝜉𝜋𝑠\displaystyle\Omega\left(\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt\pi}\right)=% \sum_{a\in\mathcal{A},s\in\mathcal{S}}\mathsf{d}_{\xi,\mu}^{\hskip 0.49005pt% \pi}(s,a)\log(\pi(a|s))=\sum_{a\in\mathcal{A},s\in\mathcal{S}}\pi(a|s)\log(\pi% (a|s))\;\overline{\mathsf{d}}_{\mu,\xi}^{\hskip 0.49005pt\pi}(s)\;.roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , italic_s ∈ caligraphic_S end_POSTSUBSCRIPT sansserif_d start_POSTSUBSCRIPT italic_ξ , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) roman_log ( italic_π ( italic_a | italic_s ) ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log ( italic_π ( italic_a | italic_s ) ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_ξ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) . (36)

From (34), as the optimal value does not depend on the initial condition, we have that the optimal policy πμsubscript𝜋𝜇\pi_{\mu}italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT of the previous optimization problem satisfies

πJMFG(πμ,μ,s0)=0,subscript𝜋superscript𝐽MFGsubscript𝜋𝜇𝜇subscript𝑠00\displaystyle\nabla_{\pi}J^{\operatorname{MFG}}(\pi_{\mu},\mu,s_{0})=0\;,∇ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 0 ,

for any s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S. Combining this with (35), we obtain that the previous condition equivalent to

𝗋(s,a,μ)=ηΩ(𝖽s0,μπμ)(s,a),superscript𝗋absent𝑠𝑎𝜇𝜂Ωsuperscriptsubscript𝖽subscript𝑠0𝜇subscript𝜋𝜇𝑠𝑎\displaystyle\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)=-\eta\nabla\Omega\left(% \mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi_{\mu}}\right)(s,a)\;,sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) = - italic_η ∇ roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ( italic_s , italic_a ) ,

for any a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S. We recall that the Bregman divergence 𝐃Ωsubscript𝐃Ω\mathbf{D}_{\Omega}bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT with respect to the regularization ΩΩ\Omegaroman_Ω is defined as follows

𝐃Ω(νν)=Ω(ν)Ω(ν)Ω(ν)(νν), for ν,ν𝒜×𝒮.formulae-sequencesubscript𝐃Ωconditional𝜈superscript𝜈Ω𝜈Ω𝜈Ωsuperscriptsuperscript𝜈top𝜈superscript𝜈 for 𝜈superscript𝜈𝒜𝒮\displaystyle\mathbf{D}_{\Omega}(\nu\|\nu^{\prime})=\Omega(\nu)-\Omega(\nu)-% \nabla\Omega(\nu^{\prime})^{\top}\left(\nu-\nu^{\prime}\right)\;,\qquad\text{ % for }\nu,\nu^{\prime}\in\mathcal{\mathcal{A}\times\mathcal{S}}\;.bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( italic_ν ∥ italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = roman_Ω ( italic_ν ) - roman_Ω ( italic_ν ) - ∇ roman_Ω ( italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_ν - italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , for italic_ν , italic_ν start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A × caligraphic_S .

Therefore,

JMFG(π,μ,s0)JMFG(πμ,μ,s0)=(s,a)𝒮×𝒜𝗋(s,a,μ)(𝖽s0,μπ𝖽s0,μπμ)(s,a)+ηΩ(𝖽s0,μπ)ηΩ(𝖽s0,μπμ)=η(s,a)𝒮×𝒜Ω(𝖽s0,μπμ)(s,a)(𝖽s0,μπ𝖽s0,μπμ)(s,a)+ηΩ(𝖽s0,μπ)ηΩ(𝖽s0,μπμ)=η𝐃Ω(𝖽s0,μπ𝖽s0,μπμ).\displaystyle\begin{split}&J^{\operatorname{MFG}}(\pi,\mu,s_{0})-J^{% \operatorname{MFG}}(\pi_{\mu},\mu,s_{0})\\ &=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\mathsf{r}^{\hskip 0.49005pt}(s,a% ,\mu)\left(\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi}-\mathsf{d}_{s_{0},\mu}% ^{\hskip 0.49005pt\pi_{\mu}}\right)(s,a)+\eta\Omega\left(\mathsf{d}_{s_{0},\mu% }^{\hskip 0.49005pt\pi}\right)-\eta\Omega\left(\mathsf{d}_{s_{0},\mu}^{\hskip 0% .49005pt\pi_{\mu}}\right)\\ &=-\eta\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\nabla\Omega\left(\mathsf{d}% _{s_{0},\mu}^{\hskip 0.49005pt\pi_{\mu}}\right)(s,a)\left(\mathsf{d}_{s_{0},% \mu}^{\hskip 0.49005pt\pi}-\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi_{\mu}}% \right)(s,a)+\eta\Omega\left(\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi}% \right)-\eta\Omega\left(\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi_{\mu}}% \right)\\ &=\eta\cdot\mathbf{D}_{\Omega}\left(\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt% \pi}\middle\|\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi_{\mu}}\right)\;.\end{split}start_ROW start_CELL end_CELL start_CELL italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT - sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ( italic_s , italic_a ) + italic_η roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) - italic_η roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - italic_η ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT ∇ roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ( italic_s , italic_a ) ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT - sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ( italic_s , italic_a ) + italic_η roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) - italic_η roman_Ω ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_η ⋅ bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∥ sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) . end_CELL end_ROW (37)

However, we have that with the Bregman divergence corresponding to the entropy regularization ΩΩ\Omegaroman_Ω has the following expression (see, e.g., Neu et al., 2017)

𝐃Ω(𝖽s0,μπ𝖽s0,μπμ)\displaystyle\mathbf{D}_{\Omega}\left(\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt% \pi}\middle\|\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi_{\mu}}\right)bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∥ sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) =(s,a)𝒮×𝒜𝖽s0,μπ(s,a)log(π(a|s)πμ(a|s))absentsubscript𝑠𝑎𝒮𝒜superscriptsubscript𝖽subscript𝑠0𝜇𝜋𝑠𝑎𝜋conditional𝑎𝑠subscript𝜋𝜇conditional𝑎𝑠\displaystyle=\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\mathsf{d}_{s_{0},\mu% }^{\hskip 0.49005pt\pi}(s,a)\log\left(\frac{\pi(a|s)}{\pi_{\mu}(a|s)}\right)= ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) roman_log ( divide start_ARG italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG )
=s𝒮𝖽¯μ,s0π(s)a𝒜π(a|s)log(π(a|s)πμ(a|s))absentsubscript𝑠𝒮superscriptsubscript¯𝖽𝜇subscript𝑠0𝜋𝑠subscript𝑎𝒜𝜋conditional𝑎𝑠𝜋conditional𝑎𝑠subscript𝜋𝜇conditional𝑎𝑠\displaystyle=\sum_{s\in\mathcal{S}}\overline{\mathsf{d}}_{\mu,s_{0}}^{\hskip 0% .49005pt\pi}(s)\sum_{a\in\mathcal{A}}\pi(a|s)\log\left(\frac{\pi(a|s)}{\pi_{% \mu}(a|s)}\right)= ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) roman_log ( divide start_ARG italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG )
=s𝒮𝖽¯μ,s0π(s)KL(π(a|s)πμ(a|s)).\displaystyle=\sum_{s\in\mathcal{S}}\overline{\mathsf{d}}_{\mu,s_{0}}^{\hskip 0% .49005pt\pi}(s)\mathrm{KL}\big{(}\pi(a|s)\big{\|}\pi_{\mu}(a|s)\big{)}\;.= ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) roman_KL ( italic_π ( italic_a | italic_s ) ∥ italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) .

Moreover, from the definition of 𝖽¯superscript¯𝖽absent\overline{\mathsf{d}}^{\hskip 0.49005pt}over¯ start_ARG sansserif_d end_ARG start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, extracting the first term of the series, we obtain

𝖽¯μ,s0π(s)=superscriptsubscript¯𝖽𝜇subscript𝑠0𝜋𝑠absent\displaystyle\overline{\mathsf{d}}_{\mu,s_{0}}^{\hskip 0.49005pt\pi}(s)=over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = (1γ)t=0γt𝖯μπ(st=s)1𝛾superscriptsubscript𝑡0superscript𝛾𝑡superscriptsubscript𝖯𝜇𝜋subscript𝑠𝑡𝑠\displaystyle(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathsf{P}_{\mu}^{\hskip 0% .49005pt\pi}(s_{t}=s)( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s )
=\displaystyle== (1γ)δs0(s)+(1γ)t=1γt𝖯μπ(st=s)1𝛾subscript𝛿subscript𝑠0𝑠1𝛾superscriptsubscript𝑡1superscript𝛾𝑡superscriptsubscript𝖯𝜇𝜋subscript𝑠𝑡𝑠\displaystyle(1-\gamma)\delta_{s_{0}}(s)+(1-\gamma)\sum_{t=1}^{\infty}\gamma^{% t}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}(s_{t}=s)( 1 - italic_γ ) italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) + ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s )
=\displaystyle== (1γ)δs0(s)+(1γ)γt=0γt𝖯μπ(st+1=s).1𝛾subscript𝛿subscript𝑠0𝑠1𝛾𝛾superscriptsubscript𝑡0superscript𝛾𝑡superscriptsubscript𝖯𝜇𝜋subscript𝑠𝑡1𝑠\displaystyle(1-\gamma)\delta_{s_{0}}(s)+(1-\gamma)\gamma\sum_{t=0}^{\infty}% \gamma^{t}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}(s_{t+1}=s)\;.( 1 - italic_γ ) italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) + ( 1 - italic_γ ) italic_γ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s ) .

Using the decomposition of 𝖯μπ(st+1=s)superscriptsubscript𝖯𝜇𝜋subscript𝑠𝑡1𝑠\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}(s_{t+1}=s)sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s ) as

𝖯μπ(st+1=s)superscriptsubscript𝖯𝜇𝜋subscript𝑠𝑡1𝑠\displaystyle\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}(s_{t+1}=s)sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s ) =(s,a)𝒮×𝒜𝖯(st+1=s|st=s,at=a,μ)𝖯μπ(st=s,at=a)\displaystyle=\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}% \mathsf{P}^{\hskip 0.49005pt}(s_{t+1}=s|s_{t}=s^{\prime},a_{t}=a^{\prime},\mu)% \mathsf{P}_{\mu}^{\hskip 0.49005pt\pi}(s_{t}=s^{\prime},a_{t}=a^{\prime})= ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=(s,a)𝒮×𝒜𝖯(s|s,a,μ)𝖯μπ(st=s,at=a),absentsubscriptsuperscript𝑠superscript𝑎𝒮𝒜superscript𝖯absentconditional𝑠superscript𝑠superscript𝑎𝜇superscriptsubscript𝖯𝜇𝜋formulae-sequencesubscript𝑠𝑡superscript𝑠subscript𝑎𝑡superscript𝑎\displaystyle=\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}% \mathsf{P}^{\hskip 0.49005pt}(s|s^{\prime},a^{\prime},\mu)\mathsf{P}_{\mu}^{% \hskip 0.49005pt\pi}(s_{t}=s^{\prime},a_{t}=a^{\prime})\;,= ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

we get

𝖽¯μ,s0π(s)=superscriptsubscript¯𝖽𝜇subscript𝑠0𝜋𝑠absent\displaystyle\overline{\mathsf{d}}_{\mu,s_{0}}^{\hskip 0.49005pt\pi}(s)=over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = (1γ)δs0(s)+γ(s,a)𝒮×𝒜𝖯(s|s,a,μ)(1γ)t=0γt𝖯μπ(st=s,at=a)1𝛾subscript𝛿subscript𝑠0𝑠𝛾subscriptsuperscript𝑠superscript𝑎𝒮𝒜superscript𝖯absentconditional𝑠superscript𝑠superscript𝑎𝜇1𝛾superscriptsubscript𝑡0superscript𝛾𝑡superscriptsubscript𝖯𝜇𝜋formulae-sequencesubscript𝑠𝑡superscript𝑠subscript𝑎𝑡superscript𝑎\displaystyle(1-\gamma)\delta_{s_{0}}(s)+\gamma\sum_{(s^{\prime},a^{\prime})% \in\mathcal{S}\times\mathcal{A}}\mathsf{P}^{\hskip 0.49005pt}(s|s^{\prime},a^{% \prime},\mu)(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathsf{P}_{\mu}^{\hskip 0.% 49005pt\pi}(s_{t}=s^{\prime},a_{t}=a^{\prime})( 1 - italic_γ ) italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) + italic_γ ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) ( 1 - italic_γ ) ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
=\displaystyle== (1γ)δs0(s)+γ(s,a)𝒮×𝒜𝖯(s|s,a,μ)𝖽s0,μπ(s,a).1𝛾subscript𝛿subscript𝑠0𝑠𝛾subscriptsuperscript𝑠superscript𝑎𝒮𝒜superscript𝖯absentconditional𝑠superscript𝑠superscript𝑎𝜇superscriptsubscript𝖽subscript𝑠0𝜇𝜋superscript𝑠superscript𝑎\displaystyle(1-\gamma)\delta_{s_{0}}(s)+\gamma\sum_{(s^{\prime},a^{\prime})% \in\mathcal{S}\times\mathcal{A}}\mathsf{P}^{\hskip 0.49005pt}(s|s^{\prime},a^{% \prime},\mu)\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi}(s^{\prime},a^{\prime}% )\;.( 1 - italic_γ ) italic_δ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) + italic_γ ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

Therefore, for a function ψ:𝒮[0,):𝜓𝒮0\psi:\mathcal{S}\to[0,\infty)italic_ψ : caligraphic_S → [ 0 , ∞ ), we have

s𝒮ψ(s)𝖽¯μ,s0π(s)subscript𝑠𝒮𝜓𝑠superscriptsubscript¯𝖽𝜇subscript𝑠0𝜋𝑠\displaystyle\sum_{s\in\mathcal{S}}\psi(s)\overline{\mathsf{d}}_{\mu,s_{0}}^{% \hskip 0.49005pt\pi}(s)∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ψ ( italic_s ) over¯ start_ARG sansserif_d end_ARG start_POSTSUBSCRIPT italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) =(1γ)ψ(s0)+γ(s,a)𝒮×𝒜s𝒮ψ(s)𝖯(s|s,a,μ)𝖽s0,μπ(s,a)absent1𝛾𝜓subscript𝑠0𝛾subscriptsuperscript𝑠superscript𝑎𝒮𝒜subscript𝑠𝒮𝜓𝑠superscript𝖯absentconditional𝑠superscript𝑠superscript𝑎𝜇superscriptsubscript𝖽subscript𝑠0𝜇𝜋superscript𝑠superscript𝑎\displaystyle=(1-\gamma)\psi(s_{0})+\gamma\sum_{(s^{\prime},a^{\prime})\in% \mathcal{S}\times\mathcal{A}}\sum_{s\in\mathcal{S}}\psi(s)\mathsf{P}^{\hskip 0% .49005pt}(s|s^{\prime},a^{\prime},\mu)\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt% \pi}(s^{\prime},a^{\prime})= ( 1 - italic_γ ) italic_ψ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_γ ∑ start_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_S × caligraphic_A end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_ψ ( italic_s ) sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
(1γ)ψ(s0),absent1𝛾𝜓subscript𝑠0\displaystyle\geq(1-\gamma)\psi(s_{0})\;,≥ ( 1 - italic_γ ) italic_ψ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

since ψ𝜓\psiitalic_ψ is a positive function and 𝖽s0,μπsuperscriptsubscript𝖽subscript𝑠0𝜇𝜋\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi}sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is a positive measure. Applying this to the positive function sKL(π(a|s)πμ(a|s))s\mapsto\mathrm{KL}\big{(}\pi(a|s)\big{\|}\pi_{\mu}(a|s)\big{)}italic_s ↦ roman_KL ( italic_π ( italic_a | italic_s ) ∥ italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_a | italic_s ) ), together with Pinsker’s inequality (see, e.g., Cover, 1999), we have

𝐃Ω(𝖽s0,μπ𝖽s0,μπμ)\displaystyle\mathbf{D}_{\Omega}\left(\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt% \pi}\middle\|\mathsf{d}_{s_{0},\mu}^{\hskip 0.49005pt\pi_{\mu}}\right)bold_D start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ( sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ∥ sansserif_d start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (1γ)KL(π(a|s0)πμ(a|s0))\displaystyle\geq(1-\gamma)\mathrm{KL}\big{(}\pi(a|s_{0})\big{\|}\pi_{\mu}(a|s% _{0})\big{)}≥ ( 1 - italic_γ ) roman_KL ( italic_π ( italic_a | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
1γ2π(a|s0)πμ(a|s0)TV2.\displaystyle\geq\frac{1-\gamma}{2}\left\|\pi(a|s_{0})-\pi_{\mu}(a|s_{0})% \right\|_{\mathrm{TV}}^{2}\;.≥ divide start_ARG 1 - italic_γ end_ARG start_ARG 2 end_ARG ∥ italic_π ( italic_a | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_a | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Combining this with (37), we get (33). ∎

Proposition E.3.

Suppose Assumption 1 holds. Then, we have for any two μ,μ𝜇superscript𝜇\mu,\mu^{\prime}italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that

|JMFG(πμ,μ,s0)JMFG(πμ,μ,s0)|L𝗋+γ1γL𝖯(𝗋+ηlog|𝒜|)1γμμ2,superscript𝐽MFGsubscript𝜋𝜇𝜇subscript𝑠0superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇subscript𝑠0subscript𝐿superscript𝗋absent𝛾1𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝗋absent𝜂𝒜1𝛾subscriptnorm𝜇superscript𝜇2\displaystyle\left|J^{\operatorname{MFG}}(\pi_{\mu},\mu,s_{0})-J^{% \operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{\prime},s_{0})\right|\leq\frac{L_{% \mathsf{r}^{\hskip 0.35004pt}}+\frac{\gamma}{1-\gamma}L_{\mathsf{P}^{\hskip 0.% 35004pt}}\left(\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta% \log|\mathcal{A}|\right)}{1-\gamma}\cdot\left\|\mu-\mu^{\prime}\right\|_{{2}}\;,| italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | ≤ divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_γ end_ARG start_ARG 1 - italic_γ end_ARG italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG ⋅ ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (38)

and

|JMFG(π,μ,s0)JMFG(π,μ,s0)|L𝗋+γ1γL𝖯(𝗋+ηlog|𝒜|)1γμμ2,superscript𝐽MFG𝜋𝜇subscript𝑠0superscript𝐽MFG𝜋superscript𝜇subscript𝑠0subscript𝐿superscript𝗋absent𝛾1𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝗋absent𝜂𝒜1𝛾subscriptnorm𝜇superscript𝜇2\displaystyle\left|J^{\operatorname{MFG}}(\pi,\mu,s_{0})-J^{\operatorname{MFG}% }(\pi,\mu^{\prime},s_{0})\right|\leq\frac{L_{\mathsf{r}^{\hskip 0.35004pt}}+% \frac{\gamma}{1-\gamma}L_{\mathsf{P}^{\hskip 0.35004pt}}\left(\left\|\mathsf{r% }^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta\log|\mathcal{A}|\right)}{1-\gamma% }\cdot\left\|\mu-\mu^{\prime}\right\|_{{2}}\;,| italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | ≤ divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_γ end_ARG start_ARG 1 - italic_γ end_ARG italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG ⋅ ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (39)

for any s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S and any πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π.

Proof.

Step 1. Let us state the optimal Bellman equations

Qμπμ(s,a)superscriptsubscript𝑄𝜇subscript𝜋𝜇𝑠𝑎\displaystyle Q_{\mu}^{\pi_{\mu}}(s,a)italic_Q start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) =𝗋(s,a,μ)+γs𝒮𝖯(s|s,a,μ)JMFG(πμ,μ,s),absentsuperscript𝗋absent𝑠𝑎𝜇𝛾subscriptsuperscript𝑠𝒮superscript𝖯absentconditionalsuperscript𝑠𝑠𝑎𝜇superscript𝐽MFGsubscript𝜋𝜇𝜇superscript𝑠\displaystyle=\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)+\gamma\sum_{s^{\prime}\in% \mathcal{S}}\mathsf{P}^{\hskip 0.49005pt}(s^{\prime}|s,a,\mu)\cdot J^{% \operatorname{MFG}}(\pi_{\mu},\mu,s^{\prime})\;,= sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) + italic_γ ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a , italic_μ ) ⋅ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,
JMFG(πμ,μ,s)superscript𝐽MFGsubscript𝜋𝜇𝜇𝑠\displaystyle J^{\operatorname{MFG}}(\pi_{\mu},\mu,s)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s ) =ηlog(a𝒜exp{1ηQμπμ(s,a)}).absent𝜂subscript𝑎𝒜1𝜂superscriptsubscript𝑄𝜇subscript𝜋𝜇𝑠𝑎\displaystyle=\eta\log\left(\sum_{a\in\mathcal{A}}\exp\left\{\frac{1}{\eta}Q_{% \mu}^{\pi_{\mu}}(s,a)\right\}\right)\;.= italic_η roman_log ( ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT roman_exp { divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_Q start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) } ) .

We notice that a function xηlog(i=1dexp{1ηxi})maps-to𝑥𝜂superscriptsubscript𝑖1𝑑1𝜂subscript𝑥𝑖x\mapsto\eta\cdot\log\left(\sum_{i=1}^{d}\exp\{\frac{1}{\eta}x_{i}\}\right)italic_x ↦ italic_η ⋅ roman_log ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_exp { divide start_ARG 1 end_ARG start_ARG italic_η end_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) is 1111-Lipschitz in subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm since the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm of the gradient of this function always lies on a probability simplex (see, e.g., Geist et al., 2019). Thus, we have

JMFG(πμ,μ,s0)JMFG(πμ,μ,s0)superscript𝐽MFGsubscript𝜋𝜇𝜇subscript𝑠0superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇subscript𝑠0\displaystyle J^{\operatorname{MFG}}(\pi_{\mu},\mu,s_{0})-J^{\operatorname{MFG% }}(\pi_{\mu^{\prime}},\mu^{\prime},s_{0})italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) maxa𝒜|Qμπμ(s0,)Qμπμ(s0,)|.absentsubscript𝑎𝒜superscriptsubscript𝑄𝜇subscript𝜋𝜇subscript𝑠0superscriptsubscript𝑄superscript𝜇subscript𝜋superscript𝜇subscript𝑠0\displaystyle\leq\max_{a\in\mathcal{A}}\left|Q_{\mu}^{\pi_{\mu}}(s_{0},\cdot)-% Q_{\mu^{\prime}}^{\pi_{\mu^{\prime}}}(s_{0},\cdot)\right|\,.≤ roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT | italic_Q start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋅ ) - italic_Q start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋅ ) | .

Then, we study the Lipschitzness of optimal Q𝑄Qitalic_Q-values for arbitrary action a0𝒜subscript𝑎0𝒜a_{0}\in\mathcal{A}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_A

|Qμπμ(s0,a0)Qμπμ(s0,a0)|superscriptsubscript𝑄𝜇subscript𝜋𝜇subscript𝑠0subscript𝑎0superscriptsubscript𝑄superscript𝜇subscript𝜋superscript𝜇subscript𝑠0subscript𝑎0\displaystyle\left|Q_{\mu}^{\pi_{\mu}}(s_{0},a_{0})-Q_{\mu^{\prime}}^{\pi_{\mu% ^{\prime}}}(s_{0},a_{0})\right|| italic_Q start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | |𝗋(s0,a0,μ)𝗋(s0,a0,μ)|absentsuperscript𝗋absentsubscript𝑠0subscript𝑎0𝜇superscript𝗋absentsubscript𝑠0subscript𝑎0superscript𝜇\displaystyle\leq\left|\mathsf{r}^{\hskip 0.49005pt}(s_{0},a_{0},\mu)-\mathsf{% r}^{\hskip 0.49005pt}(s_{0},a_{0},\mu^{\prime})\right|≤ | sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ ) - sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
+γ|s𝒮[𝖯(s|s0,a0,μ)𝖯(s|s0,a0,μ)]JMFG(πμ,μ,s)|\displaystyle\ +\gamma\left|\sum_{s^{\prime}\in\mathcal{S}}\left[\mathsf{P}^{% \hskip 0.49005pt}(s^{\prime}|s_{0},a_{0},\mu)-\mathsf{P}^{\hskip 0.49005pt}(s^% {\prime}|s_{0},a_{0},\mu^{\prime})\right]\cdot J^{\operatorname{MFG}}(\pi_{\mu% },\mu,s^{\prime})\right|+ italic_γ | ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT [ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ ) - sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
+γ|s𝒮𝖯(s|s0,a0,μ)[JMFG(πμ,μ,s)JMFG(πμ,μ,s)]|.\displaystyle\ +\gamma\left|\sum_{s^{\prime}\in\mathcal{S}}\mathsf{P}^{\hskip 0% .49005pt}(s^{\prime}|s_{0},a_{0},\mu^{\prime})\cdot\left[J^{\operatorname{MFG}% }(\pi_{\mu},\mu,s^{\prime})-J^{\operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{% \prime},s^{\prime})\right]\right|\;.+ italic_γ | ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] | .

By Assumption 1, we have

|𝗋(s0,a0,μ)𝗋(s0,a0,μ)|L𝗋μμ2,𝖯(|s0,a0,μ)𝖯(|s0,a0,μ)TVL𝖯μμ2.\displaystyle\left|\mathsf{r}^{\hskip 0.49005pt}(s_{0},a_{0},\mu)-\mathsf{r}^{% \hskip 0.49005pt}(s_{0},a_{0},\mu^{\prime})\right|\leq L_{\mathsf{r}^{\hskip 0% .35004pt}}\left\|\mu-\mu^{\prime}\right\|_{{2}}\;,\quad\left\|\mathsf{P}^{% \hskip 0.49005pt}(\cdot|s_{0},a_{0},\mu)-\mathsf{P}^{\hskip 0.49005pt}(\cdot|s% _{0},a_{0},\mu^{\prime})\right\|_{\mathrm{TV}}\leq L_{\mathsf{P}^{\hskip 0.350% 04pt}}\left\|\mu-\mu^{\prime}\right\|_{{2}}\;.| sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ ) - sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ≤ italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ ) - sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

thus

|Qμπμ(s0,a0)Qμπμ(s0,a0)|superscriptsubscript𝑄𝜇subscript𝜋𝜇subscript𝑠0subscript𝑎0superscriptsubscript𝑄superscript𝜇subscript𝜋superscript𝜇subscript𝑠0subscript𝑎0absent\displaystyle\left|Q_{\mu}^{\pi_{\mu}}(s_{0},a_{0})-Q_{\mu^{\prime}}^{\pi_{\mu% ^{\prime}}}(s_{0},a_{0})\right|\leq| italic_Q start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | ≤ (L𝗋+γL𝖯JMFG(πμ,μ,))μμ2subscript𝐿superscript𝗋absent𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇subscriptnorm𝜇superscript𝜇2\displaystyle\left(L_{\mathsf{r}^{\hskip 0.35004pt}}+\gamma L_{\mathsf{P}^{% \hskip 0.35004pt}}\left\|J^{\operatorname{MFG}}(\pi_{\mu},\mu,\cdot)\right\|_{% {\infty}}\right)\cdot\left\|\mu-\mu^{\prime}\right\|_{{2}}( italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) ⋅ ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+γJMFG(πμ,μ,)JMFG(πμ,μ,).𝛾subscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇\displaystyle+\gamma\left\|J^{\operatorname{MFG}}(\pi_{\mu},\mu,\cdot)-J^{% \operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{\prime},\cdot)\right\|_{{\infty}}\;.+ italic_γ ∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT .

Overall, we have a recursive bound on difference between value functions

JMFG(πμ,μ,)JMFG(πμ,μ,)subscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇\displaystyle\left\|J^{\operatorname{MFG}}(\pi_{\mu},\mu,\cdot)-J^{% \operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{\prime},\cdot)\right\|_{{\infty}}∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT (L𝗋+γL𝖯JMFG(πμ,μ,))μμ2absentsubscript𝐿superscript𝗋absent𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇subscriptnorm𝜇superscript𝜇2\displaystyle\leq\left(L_{\mathsf{r}^{\hskip 0.35004pt}}+\gamma L_{\mathsf{P}^% {\hskip 0.35004pt}}\left\|J^{\operatorname{MFG}}(\pi_{\mu},\mu,\cdot)\right\|_% {{\infty}}\right)\cdot\left\|\mu-\mu^{\prime}\right\|_{{2}}≤ ( italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ) ⋅ ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
+γJMFG(πμ,μ,)JMFG(πμ,μ,),𝛾subscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇\displaystyle+\gamma\left\|J^{\operatorname{MFG}}(\pi_{\mu},\mu,\cdot)-J^{% \operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{\prime},\cdot)\right\|_{{\infty}}\;,+ italic_γ ∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ,

therefore

JMFG(πμ,μ,)JMFG(πμ,μ,)L𝗋+γL𝖯JMFG(πμ,μ,)1γμμ2.subscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇subscript𝐿superscript𝗋absent𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇1𝛾subscriptnorm𝜇superscript𝜇2\left\|J^{\operatorname{MFG}}(\pi_{\mu},\mu,\cdot)-J^{\operatorname{MFG}}(\pi_% {\mu^{\prime}},\mu^{\prime},\cdot)\right\|_{{\infty}}\leq\frac{L_{\mathsf{r}^{% \hskip 0.35004pt}}+\gamma L_{\mathsf{P}^{\hskip 0.35004pt}}\left\|J^{% \operatorname{MFG}}(\pi_{\mu},\mu,\cdot)\right\|_{{\infty}}}{1-\gamma}\left\|% \mu-\mu^{\prime}\right\|_{{2}}\;.∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_γ italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

By a bound JMFG(πμ,μ,)(𝗋+ηlog|𝒜|)/(1γ)subscriptnormsuperscript𝐽MFGsubscript𝜋𝜇𝜇subscriptnormsuperscript𝗋absent𝜂𝒜1𝛾\left\|J^{\operatorname{MFG}}(\pi_{\mu},\mu,\cdot)\right\|_{{\infty}}\leq(% \left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta\log|\mathcal{A}|)% /(1-\gamma)∥ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , ⋅ ) ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) / ( 1 - italic_γ ), we conclude the statement (38).

Step 2. Applying directly the Bellman equation, we have

|JMFG(π,μ,s0)JMFG(π,μ,s0)|superscript𝐽MFG𝜋𝜇subscript𝑠0superscript𝐽MFG𝜋superscript𝜇subscript𝑠0\displaystyle\left|J^{\operatorname{MFG}}(\pi,\mu,s_{0})-J^{\operatorname{MFG}% }(\pi,\mu^{\prime},s_{0})\right|| italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) |
a0𝒜π(a0|s0)|𝗋(s0,a0,μ)𝗋(s0,a0,μ)|absentsubscriptsubscript𝑎0𝒜𝜋conditionalsubscript𝑎0subscript𝑠0superscript𝗋absentsubscript𝑠0subscript𝑎0𝜇superscript𝗋absentsubscript𝑠0subscript𝑎0superscript𝜇\displaystyle\leq\sum_{a_{0}\in\mathcal{A}}\pi(a_{0}|s_{0})\left|\mathsf{r}^{% \hskip 0.49005pt}(s_{0},a_{0},\mu)-\mathsf{r}^{\hskip 0.49005pt}(s_{0},a_{0},% \mu^{\prime})\right|≤ ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ ) - sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
+γa0𝒜π(a0|s0)|s𝒮[𝖯(s|s0,a0,μ)𝖯(s|s0,a0,μ)]JMFG(πμ,μ,s)|\displaystyle\quad+\gamma\sum_{a_{0}\in\mathcal{A}}\pi(a_{0}|s_{0})\left|\sum_% {s^{\prime}\in\mathcal{S}}\cdot\left[\mathsf{P}^{\hskip 0.49005pt}(s^{\prime}|% s_{0},a_{0},\mu)-\mathsf{P}^{\hskip 0.49005pt}(s^{\prime}|s_{0},a_{0},\mu^{% \prime})\right]\cdot J^{\operatorname{MFG}}(\pi_{\mu},\mu,s^{\prime})\right|+ italic_γ ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ⋅ [ sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ ) - sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ⋅ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |
+γa0𝒜π(a0|s0)|s𝒮𝖯(s|s0,a0,μ)[JMFG(πμ,μ,s)JMFG(πμ,μ,s)]|.\displaystyle\quad+\gamma\sum_{a_{0}\in\mathcal{A}}\pi(a_{0}|s_{0})\left|\sum_% {s^{\prime}\in\mathcal{S}}\mathsf{P}^{\hskip 0.49005pt}(s^{\prime}|s_{0},a_{0}% ,\mu^{\prime})\cdot\left[J^{\operatorname{MFG}}(\pi_{\mu},\mu,s^{\prime})-J^{% \operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{\prime},s^{\prime})\right]\right|\;.+ italic_γ ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ [ italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] | .

Following the same lines as in the Step 1, we can then obtain (39).

Corollary E.4.

Suppose Assumption 1 holds. Then, we have for any two μ,μ𝜇superscript𝜇\mu,\mu^{\prime}italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that

sups𝒮πμ(|s)πμ(|s)TV2Cπ,μμμ2,\displaystyle\sup_{s\in\mathcal{S}}\left\|\pi_{\mu}(\cdot|s)-\pi_{\mu^{\prime}% }(\cdot|s)\right\|_{\mathrm{TV}}^{2}\leq C_{\pi,\mu}\left\|\mu-\mu^{\prime}% \right\|_{{2}}\;,roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (40)

with

Cπ,μ:=4η(1γ)L𝗋+γ1γL𝖯(𝗋+ηlog|𝒜|)1γ.assignsubscript𝐶𝜋𝜇4𝜂1𝛾subscript𝐿superscript𝗋absent𝛾1𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝗋absent𝜂𝒜1𝛾\displaystyle C_{\pi,\mu}:=\frac{4}{\eta(1-\gamma)}\cdot\frac{L_{\mathsf{r}^{% \hskip 0.35004pt}}+\frac{\gamma}{1-\gamma}L_{\mathsf{P}^{\hskip 0.35004pt}}% \left(\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}+\eta\log|% \mathcal{A}|\right)}{1-\gamma}\;.italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT := divide start_ARG 4 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ⋅ divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_γ end_ARG start_ARG 1 - italic_γ end_ARG italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG .
Proof.

From  E.2, we have that

π(|s0)πμ(|s0)TV2\displaystyle\left\|\pi(\cdot|s_{0})-\pi_{\mu}(\cdot|s_{0})\right\|_{\mathrm{% TV}}^{2}∥ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2η(1γ)(JMFG(πμ,μ,s0)JMFG(πμ,μ,s0))absent2𝜂1𝛾superscript𝐽MFGsubscript𝜋𝜇𝜇subscript𝑠0superscript𝐽MFGsubscript𝜋superscript𝜇𝜇subscript𝑠0\displaystyle\leq\frac{2}{\eta(1-\gamma)}\left(J^{\operatorname{MFG}}(\pi_{\mu% },\mu,s_{0})-J^{\operatorname{MFG}}(\pi_{\mu^{\prime}},\mu,s_{0})\right)≤ divide start_ARG 2 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
2η(1γ)(JMFG(πμ,μ,s0)JMFG(πμ,μ,s0)+JMFG(πμ,μ,s0)JMFG(πμ,μ,s0)).absent2𝜂1𝛾superscript𝐽MFGsubscript𝜋𝜇𝜇subscript𝑠0superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇subscript𝑠0superscript𝐽MFGsubscript𝜋superscript𝜇superscript𝜇subscript𝑠0superscript𝐽MFGsubscript𝜋superscript𝜇𝜇subscript𝑠0\displaystyle\leq\frac{2}{\eta(1-\gamma)}\left(J^{\operatorname{MFG}}(\pi_{\mu% },\mu,s_{0})-J^{\operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{\prime},s_{0})+J^{% \operatorname{MFG}}(\pi_{\mu^{\prime}},\mu^{\prime},s_{0})-J^{\operatorname{% MFG}}(\pi_{\mu^{\prime}},\mu,s_{0})\right)\;.≤ divide start_ARG 2 end_ARG start_ARG italic_η ( 1 - italic_γ ) end_ARG ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) .

Then, applying twice  Proposition E.3, we obtain (40). ∎

E.3 Bound on the Exploitability

To analyze the exploitability ϕitalic-ϕ\phiitalic_ϕ of a given policy π𝜋\piitalic_π and a given mean-field parameter μ𝜇\muitalic_μ, we decompose it into two key contributions. The first term captures the suboptimality of the best response against the mean-field distribution, quantifying how much an agent can improve its reward by deviating optimally. The second term accounts for the discrepancy between the current population distribution and the stationary distribution of the Markov reward process induced by (π,μ)𝜋𝜇(\pi,\mu)( italic_π , italic_μ ). This decomposition allows us to explicitly bound the exploitability by controlling both the policy’s optimality and the convergence of the population dynamics to equilibrium.

Proposition E.5.

Fix a policy πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π and two mean-field parameter μ𝒫(𝒮)𝜇𝒫𝒮\mu\in\mathcal{P}(\mathcal{S})italic_μ ∈ caligraphic_P ( caligraphic_S ). Then, we have that the exploitability ϕitalic-ϕ\phiitalic_ϕ as defined in (7) is bounded by

ϕ(π,μ)(maxπJMFG(π,μ,μ)JMFG(π,μ,μ))+Cϕλπ,μμ2,italic-ϕ𝜋𝜇subscriptsuperscript𝜋superscript𝐽MFGsuperscript𝜋𝜇𝜇superscript𝐽MFG𝜋𝜇𝜇subscript𝐶italic-ϕsubscriptnormsubscript𝜆𝜋𝜇𝜇2\displaystyle\phi(\pi,\mu)\leq\left(\max_{\pi^{\prime}}J^{\operatorname{MFG}}% \left(\pi^{\prime},\mu,\mu\right)-J^{\operatorname{MFG}}\left(\pi,\mu,\mu% \right)\right)+C_{\phi}\left\|\lambda_{\pi,\mu}-\mu\right\|_{{2}}\;,italic_ϕ ( italic_π , italic_μ ) ≤ ( roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_μ ) ) + italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∥ italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

with

Cϕ:=2L𝗋+γ1γL𝖯(𝗋+ηlog|𝒜|)1γ+2|𝒮|𝗋+ηlog(|𝒜|)1γ.assignsubscript𝐶italic-ϕ2subscript𝐿superscript𝗋absent𝛾1𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝗋absent𝜂𝒜1𝛾2𝒮subscriptnormsuperscript𝗋absent𝜂𝒜1𝛾\displaystyle C_{\phi}:=2\frac{L_{\mathsf{r}^{\hskip 0.35004pt}}+\frac{\gamma}% {1-\gamma}L_{\mathsf{P}^{\hskip 0.35004pt}}\left(\left\|\mathsf{r}^{\hskip 0.4% 9005pt}\right\|_{{\infty}}+\eta\log|\mathcal{A}|\right)}{1-\gamma}+2\sqrt{|% \mathcal{S}|}\cdot\frac{\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}% }+\eta\log(|\mathcal{A}|)}{1-\gamma}\;.italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT := 2 divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_γ end_ARG start_ARG 1 - italic_γ end_ARG italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG + 2 square-root start_ARG | caligraphic_S | end_ARG ⋅ divide start_ARG ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log ( | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG . (41)
Proof.

Fix a policy πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π and two mean-field parameter μ,μ𝒫(𝒮)𝜇superscript𝜇𝒫𝒮\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{S})italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_S ). Then, we have that

JMFG(π,μ,μ)=(JMFG(π,μ,μ)JMFG(π,μ,μ))+(JMFG(π,μ,μ)JMFG(π,μ,μ))+JMFG(π,μ,μ).superscript𝐽MFG𝜋𝜇𝜇superscript𝐽MFG𝜋𝜇𝜇superscript𝐽MFG𝜋superscript𝜇𝜇superscript𝐽MFG𝜋superscript𝜇𝜇superscript𝐽MFG𝜋superscript𝜇superscript𝜇superscript𝐽MFG𝜋superscript𝜇superscript𝜇\displaystyle J^{\operatorname{MFG}}\left(\pi,\mu,\mu\right)=\left(J^{% \operatorname{MFG}}\left(\pi,\mu,\mu\right)-J^{\operatorname{MFG}}\left(\pi,% \mu^{\prime},\mu\right)\right)+\left(J^{\operatorname{MFG}}\left(\pi,\mu^{% \prime},\mu\right)-J^{\operatorname{MFG}}\left(\pi,\mu^{\prime},\mu^{\prime}% \right)\right)+J^{\operatorname{MFG}}\left(\pi,\mu^{\prime},\mu^{\prime}\right% )\;.italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_μ ) = ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) ) + ( italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) + italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

On the one hand, from Proposition E.3, we have that

|JMFG(π,μ,μ)JMFG(π,μ,μ)|L𝗋+γ1γL𝖯(𝗋+ηlog|𝒜|)1γμμ2.superscript𝐽MFG𝜋𝜇𝜇superscript𝐽MFG𝜋superscript𝜇𝜇subscript𝐿superscript𝗋absent𝛾1𝛾subscript𝐿superscript𝖯absentsubscriptnormsuperscript𝗋absent𝜂𝒜1𝛾subscriptnorm𝜇superscript𝜇2\displaystyle\left|J^{\operatorname{MFG}}(\pi,\mu,\mu)-J^{\operatorname{MFG}}(% \pi,\mu^{\prime},\mu)\right|\leq\frac{L_{\mathsf{r}^{\hskip 0.35004pt}}+\frac{% \gamma}{1-\gamma}L_{\mathsf{P}^{\hskip 0.35004pt}}\left(\left\|\mathsf{r}^{% \hskip 0.49005pt}\right\|_{{\infty}}+\eta\log|\mathcal{A}|\right)}{1-\gamma}% \cdot\left\|\mu-\mu^{\prime}\right\|_{{2}}\;.| italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) | ≤ divide start_ARG italic_L start_POSTSUBSCRIPT sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + divide start_ARG italic_γ end_ARG start_ARG 1 - italic_γ end_ARG italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG ⋅ ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

On the other hand, we have that

JMFG(π,μ,μ)=s𝒮JMFG(π,μ,s0)μ(s0),superscript𝐽MFG𝜋superscript𝜇𝜇subscript𝑠𝒮superscript𝐽MFG𝜋superscript𝜇subscript𝑠0𝜇subscript𝑠0\displaystyle J^{\operatorname{MFG}}\left(\pi,\mu^{\prime},\mu\right)=\sum_{s% \in\mathcal{S}}J^{\operatorname{MFG}}\left(\pi,\mu^{\prime},s_{0}\right)\mu(s_% {0})\;,italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) = ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_μ ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,

and a similar decomposition applies for JMFG(π,μ,μ)superscript𝐽MFG𝜋superscript𝜇superscript𝜇J^{\operatorname{MFG}}\left(\pi,\mu^{\prime},\mu^{\prime}\right)italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). This means that

|JMFG(π,μ,μ)JMFG(π,μ,μ)|superscript𝐽MFG𝜋superscript𝜇𝜇superscript𝐽MFG𝜋superscript𝜇superscript𝜇\displaystyle\left|J^{\operatorname{MFG}}(\pi,\mu^{\prime},\mu)-J^{% \operatorname{MFG}}(\pi,\mu^{\prime},\mu^{\prime})\right|| italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) - italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | s𝒮|JMFG(π,μ,s0)||μ(s)μ(s)|absentsubscript𝑠𝒮superscript𝐽MFG𝜋superscript𝜇subscript𝑠0superscript𝜇𝑠𝜇𝑠\displaystyle\leq\sum_{s\in\mathcal{S}}\left|J^{\operatorname{MFG}}\left(\pi,% \mu^{\prime},s_{0}\right)\right|\left|\mu^{\prime}(s)-\mu(s)\right|≤ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT | italic_J start_POSTSUPERSCRIPT roman_MFG end_POSTSUPERSCRIPT ( italic_π , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - italic_μ ( italic_s ) |
𝗋+ηlog(|𝒜|)1γs𝒮|μ(s)μ(s)|absentsubscriptnormsuperscript𝗋absent𝜂𝒜1𝛾subscript𝑠𝒮superscript𝜇𝑠𝜇𝑠\displaystyle\leq\frac{\left\|\mathsf{r}^{\hskip 0.49005pt}\right\|_{{\infty}}% +\eta\log(|\mathcal{A}|)}{1-\gamma}\sum_{s\in\mathcal{S}}\left|\mu^{\prime}(s)% -\mu(s)\right|≤ divide start_ARG ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log ( | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT | italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) - italic_μ ( italic_s ) |
|𝒮|𝗋+ηlog(|𝒜|)1γμμ2,absent𝒮subscriptnormsuperscript𝗋absent𝜂𝒜1𝛾subscriptnormsuperscript𝜇𝜇2\displaystyle\leq\sqrt{|\mathcal{S}|}\cdot\frac{\left\|\mathsf{r}^{\hskip 0.49% 005pt}\right\|_{{\infty}}+\eta\log(|\mathcal{A}|)}{1-\gamma}\left\|\mu^{\prime% }-\mu\right\|_{{2}}\;,≤ square-root start_ARG | caligraphic_S | end_ARG ⋅ divide start_ARG ∥ sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_η roman_log ( | caligraphic_A | ) end_ARG start_ARG 1 - italic_γ end_ARG ∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where we have applied Cauchy-Schwarz inequality in the last bound. Therefore, we can bound the exploitability ϕitalic-ϕ\phiitalic_ϕ as defined in (7) as

ϕ(π,μ)italic-ϕ𝜋𝜇\displaystyle\phi(\pi,\mu)italic_ϕ ( italic_π , italic_μ ) =maxπΠJ(π,λπ,μ,λπ,μ)J(π,λπ,μ,λπ,μ)absentsubscriptsuperscript𝜋Π𝐽superscript𝜋subscript𝜆𝜋𝜇subscript𝜆𝜋𝜇𝐽𝜋subscript𝜆𝜋𝜇subscript𝜆𝜋𝜇\displaystyle=\max_{\pi^{\prime}\in\Pi}J(\pi^{\prime},\lambda_{\pi,\mu},% \lambda_{\pi,\mu})-J(\pi,\lambda_{\pi,\mu},\lambda_{\pi,\mu})= roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π end_POSTSUBSCRIPT italic_J ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) - italic_J ( italic_π , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT )
maxπΠJ(π,μ,μ)J(π,μ,μ)+Cϕλπ,μμ2absentsubscriptsuperscript𝜋Π𝐽superscript𝜋𝜇𝜇𝐽𝜋𝜇𝜇subscript𝐶italic-ϕsubscriptnormsubscript𝜆𝜋𝜇𝜇2\displaystyle\leq\max_{\pi^{\prime}\in\Pi}J(\pi^{\prime},\mu,\mu)-J(\pi,\mu,% \mu)+C_{\phi}\left\|\lambda_{\pi,\mu}-\mu\right\|_{{2}}≤ roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Π end_POSTSUBSCRIPT italic_J ( italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ , italic_μ ) - italic_J ( italic_π , italic_μ , italic_μ ) + italic_C start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∥ italic_λ start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

E.4 Discussion on the monotonicity of the optimal Markov Kernel

Define the operator 𝒫Msubscript𝒫𝑀\mathcal{P}_{M}caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as 𝒫M(μ):=μ(𝖯μπμ)Massignsubscript𝒫𝑀𝜇𝜇superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀\mathcal{P}_{M}(\mu):=\mu\left(\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}% \right)^{M}caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_μ ) := italic_μ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT. In this section, we outline sufficient conditions under which this operator, responsible for updating the population distribution in the MFG framework, exhibits monotonicity. Monotonicity of 𝒫Msubscript𝒫𝑀\mathcal{P}_{M}caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is a crucial property that ensures stability and convergence of the iterative updates toward the Nash equilibrium.

This operator represents a generalization of the standard contractivity condition, which is traditionally formulated with M=1𝑀1M=1italic_M = 1. This generalization is motivated by the fact that, as we aim at studying the regularized ergodic MFG problem (2)-(3), the condition can hold for some M>1𝑀1M>1italic_M > 1.

This contractivity condition reflects the combined effect of the Lipschitz continuity of the regularized best response πμsubscript𝜋𝜇\pi_{\mu}italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and the ergodicity of the Markov reward process 𝖯μπμsuperscriptsubscript𝖯𝜇subscript𝜋𝜇\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, ensuring the stability and convergence of the mean-field population updates in the ergodic setting.

Lemma E.6 (Strong monotonicity of 𝒫Msubscript𝒫𝑀\mathcal{P}_{M}caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT).

Suppose that Assumptions 1 and 3 hold. We have for all distribution measures μ𝜇\muitalic_μ and μsuperscript𝜇\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

𝒫M(μ)𝒫M(μ),μμCop,MFGμμ22,subscript𝒫𝑀superscript𝜇subscript𝒫𝑀𝜇superscript𝜇𝜇subscript𝐶opMFGsuperscriptsubscriptnormsuperscript𝜇𝜇22\displaystyle\left\langle\mathcal{P}_{M}(\mu^{\prime})-\mathcal{P}_{M}(\mu),% \mu^{\prime}-\mu\right\rangle\leq C_{\operatorname{op,MFG}}\left\|\mu^{\prime}% -\mu\right\|_{{2}}^{2}\;,⟨ caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_μ ) , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ⟩ ≤ italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

with

Cop,MFG=CErgL𝖯1ρM1ρ(1+Cπ,μ)+2|𝒮|CErgρM.subscript𝐶opMFGsubscript𝐶Ergsubscript𝐿superscript𝖯absent1superscript𝜌𝑀1𝜌1subscript𝐶𝜋𝜇2𝒮subscript𝐶Ergsuperscript𝜌𝑀\displaystyle C_{\operatorname{op,MFG}}=C_{\operatorname{Erg}}L_{\mathsf{P}^{% \hskip 0.35004pt}}\frac{1-\rho^{M}}{1-\rho}\left(1+C_{\pi,\mu}\right)+2\left|% \mathcal{S}\right|C_{\operatorname{Erg}}\rho^{M}\;.italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) + 2 | caligraphic_S | italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT .
Proof.

Consider the following decomposition

μμ,𝒫M(μ)𝒫M(μ)=μμ,μ[(𝖯μπμ)M(𝖯μπμ)M]+μμ,(μμ)(𝖯μπμ)M.superscript𝜇𝜇subscript𝒫𝑀superscript𝜇subscript𝒫𝑀𝜇superscript𝜇𝜇superscript𝜇delimited-[]superscriptsuperscriptsubscript𝖯superscript𝜇subscript𝜋superscript𝜇𝑀superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀superscript𝜇𝜇superscript𝜇𝜇superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀\displaystyle\left\langle\mu^{\prime}-\mu,\mathcal{P}_{M}(\mu^{\prime})-% \mathcal{P}_{M}(\mu)\right\rangle=\left\langle\mu^{\prime}-\mu,\mu^{\prime}% \left[\Big{(}\mathsf{P}_{\mu^{\prime}}^{\hskip 0.49005pt\pi_{\mu^{\prime}}}% \Big{)}^{M}-\Big{(}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}% \right]\right\rangle+\left\langle\mu^{\prime}-\mu,(\mu^{\prime}-\mu)\Big{(}% \mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}\right\rangle\;.⟨ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ , caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_μ ) ⟩ = ⟨ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ] ⟩ + ⟨ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ , ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ) ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩ .

On one hand, applying Lemma E.1 and Cauchy-Schwartz inequality, we obtain

μμ,μ[(𝖯μπμ)M(𝖯μπμ)M]superscript𝜇𝜇superscript𝜇delimited-[]superscriptsuperscriptsubscript𝖯superscript𝜇subscript𝜋superscript𝜇𝑀superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀\displaystyle\left\langle\mu^{\prime}-\mu,\mu^{\prime}\left[\Big{(}\mathsf{P}_% {\mu^{\prime}}^{\hskip 0.49005pt\pi_{\mu^{\prime}}}\Big{)}^{M}-\Big{(}\mathsf{% P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}\right]\right\rangle⟨ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ] ⟩
\displaystyle\leq μμ2(𝖯μπμ)M(𝖯μπμ)MTVsubscriptnormsuperscript𝜇𝜇2subscriptnormsuperscriptsuperscriptsubscript𝖯superscript𝜇subscript𝜋superscript𝜇𝑀superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀TV\displaystyle\left\|\mu^{\prime}-\mu\right\|_{{2}}\cdot\left\|\Big{(}\mathsf{P% }_{\mu^{\prime}}^{\hskip 0.49005pt\pi_{\mu^{\prime}}}\Big{)}^{M}-\Big{(}% \mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}\right\|_{\mathrm{TV}}∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ ( sansserif_P start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT
\displaystyle\leq μμ2CErgL𝖯1ρM1ρ(sups𝒮πμ(|s)πμ(|s)TV+μμ2)\displaystyle\left\|\mu^{\prime}-\mu\right\|_{{2}}\cdot C_{\operatorname{Erg}}% L_{\mathsf{P}^{\hskip 0.35004pt}}\frac{1-\rho^{M}}{1-\rho}\left(\sup_{s\in% \mathcal{S}}\left\|\pi_{\mu}(\cdot|s)-\pi_{\mu^{\prime}}(\cdot|s)\right\|_{% \mathrm{TV}}+\left\|\mu-\mu^{\prime}\right\|_{{2}}\right)∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( roman_sup start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( ⋅ | italic_s ) - italic_π start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
\displaystyle\leq CErgL𝖯1ρM1ρ(1+Cπ,μ)μμ22,subscript𝐶Ergsubscript𝐿superscript𝖯absent1superscript𝜌𝑀1𝜌1subscript𝐶𝜋𝜇superscriptsubscriptnorm𝜇superscript𝜇22\displaystyle C_{\operatorname{Erg}}L_{\mathsf{P}^{\hskip 0.35004pt}}\frac{1-% \rho^{M}}{1-\rho}\left(1+C_{\pi,\mu}\right)\left\|\mu-\mu^{\prime}\right\|_{{2% }}^{2}\;,italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_ρ end_ARG ( 1 + italic_C start_POSTSUBSCRIPT italic_π , italic_μ end_POSTSUBSCRIPT ) ∥ italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where in the last inequality we have applied Corollary (E.4).

On the other hand, from Assumption 3, we get that

μμ,(μμ)(𝖯μπμ)Msuperscript𝜇𝜇superscript𝜇𝜇superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀absent\displaystyle\left\langle\mu^{\prime}-\mu,(\mu^{\prime}-\mu)\Big{(}\mathsf{P}_% {\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}\right\rangle\leq⟨ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ , ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ) ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ⟩ ≤ μμ1(μμ)(𝖯μπμ)MTVsubscriptnormsuperscript𝜇𝜇1subscriptnormsuperscript𝜇𝜇superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀TV\displaystyle\left\|\mu^{\prime}-\mu\right\|_{{1}}\left\|(\mu^{\prime}-\mu)% \Big{(}\mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}\right\|_{% \mathrm{TV}}∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ( italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ) ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT
\displaystyle\leq μμ1(μ(𝖯μπμ)Mλπμ,μTV+λπμ,μμ(𝖯μπμ)MTV)subscriptnormsuperscript𝜇𝜇1subscriptnorm𝜇superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀subscript𝜆subscript𝜋𝜇𝜇TVsubscriptnormsubscript𝜆subscript𝜋𝜇𝜇superscript𝜇superscriptsuperscriptsubscript𝖯𝜇subscript𝜋𝜇𝑀TV\displaystyle\left\|\mu^{\prime}-\mu\right\|_{{1}}\left(\left\|\mu\Big{(}% \mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}-\lambda_{\pi_{\mu},\mu% }\right\|_{\mathrm{TV}}+\left\|\lambda_{\pi_{\mu},\mu}-\mu^{\prime}\Big{(}% \mathsf{P}_{\mu}^{\hskip 0.49005pt\pi_{\mu}}\Big{)}^{M}\right\|_{\mathrm{TV}}\right)∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ∥ italic_μ ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT + ∥ italic_λ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_μ end_POSTSUBSCRIPT - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( sansserif_P start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT )
\displaystyle\leq μμ12CErgρMsubscriptnormsuperscript𝜇𝜇12subscript𝐶Ergsuperscript𝜌𝑀\displaystyle\left\|\mu^{\prime}-\mu\right\|_{{1}}\cdot 2C_{\operatorname{Erg}% }\rho^{M}∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ 2 italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT
\displaystyle\leq 2|𝒮|CErgρMμμ22.2𝒮subscript𝐶Ergsuperscript𝜌𝑀superscriptsubscriptnormsuperscript𝜇𝜇22\displaystyle 2\left|\mathcal{S}\right|C_{\operatorname{Erg}}\rho^{M}\left\|% \mu^{\prime}-\mu\right\|_{{2}}^{2}\;.2 | caligraphic_S | italic_C start_POSTSUBSCRIPT roman_Erg end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∥ italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The condition Cop,MFG<1subscript𝐶opMFG1C_{\operatorname{op,MFG}}<1italic_C start_POSTSUBSCRIPT roman_op , roman_MFG end_POSTSUBSCRIPT < 1 is satisfied when the Lipschitz constant L𝖯subscript𝐿superscript𝖯absentL_{\mathsf{P}^{\hskip 0.35004pt}}italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT associated with the transition kernel is sufficiently small, and the exponent M𝑀Mitalic_M is large enough.

Intuitively, a smaller L𝖯subscript𝐿superscript𝖯absentL_{\mathsf{P}^{\hskip 0.35004pt}}italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT indicates that the transition dynamics of the MDP are less sensitive to changes in the population distribution, reducing the potential for instability. A regularity condition on L𝖯subscript𝐿superscript𝖯absentL_{\mathsf{P}^{\hskip 0.35004pt}}italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is a standard assumption in the literature to ensure the uniqueness of the MFNE. Similar assumptions have been employed in various works, including Becherer & Hesse (2024)Espinosa & Touzi (2015)Lacker & Zariphopoulou (2019), and Tangpi & Zhou (2024), among many others. These studies leverage regularity constraints to prevent degeneracies in equilibrium selection and to guarantee well-posedness in the associated fixed-point problems.

Meanwhile, a larger M𝑀Mitalic_M amplifies the effect of the contraction over multiple iterations of the operator, ensuring convergence even in cases where individual updates are not strongly contractive. This interplay between L𝖯subscript𝐿superscript𝖯absentL_{\mathsf{P}^{\hskip 0.35004pt}}italic_L start_POSTSUBSCRIPT sansserif_P start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and M𝑀Mitalic_M highlights the importance of balancing the model’s inherent dynamics with the structural assumptions to guarantee monotonicity and stability in the population updates.

Appendix F Additional Experiments

We present results for the Exact MF-TRPO algorithm on two extensions of the Crowd Modeling game and we benchmark our results against Ficticious Play (FP) (Perrin et al., 2020) and Online Mirror Descent (OMD) (Pérolat et al., 2022). Our findings demonstrate that the exact algorithm matches the performance of state-of-the-art methods, highlighting its effectiveness in these settings. In the following, we provide a detailed overview of the games employed.

Grid-based Crowd Modeling Game. This environment, inspired by the Four Rooms example from Geist et al. (2022), is based on a two-dimensional grid with obstacles. Each agent’s state is defined by her position on the grid, and she can choose from five possible actions: moving left, right, up, down, or staying in place. The reward function is designed to discourage overcrowding by penalizing agents based on the population density at their next position. Specifically, agents receive a negative reward proportional to the logarithm of the density at their destination, encouraging a more even distribution across the state space. Additionally, a small bonus is given for staying in place, while moving in any direction results in a penalty. Formally, the reward function is defined as

𝗋(s,a,μ)=κlog(μ(s))+Γ(a),superscript𝗋absent𝑠𝑎𝜇𝜅𝜇𝑠Γ𝑎\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)=-\kappa\log(\mu(s))+\Gamma(a)\;,sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) = - italic_κ roman_log ( italic_μ ( italic_s ) ) + roman_Γ ( italic_a ) ,

where Γ(a)=0.2𝟣{a=Stay}0.2𝟣{aStay}Γ𝑎0.2subscript1𝑎Stay0.2subscript1𝑎Stay\Gamma(a)=0.2\cdot\mathsf{1}_{\{a=\text{Stay}\}}-0.2\cdot\mathsf{1}_{\{a\neq% \text{Stay}\}}roman_Γ ( italic_a ) = 0.2 ⋅ sansserif_1 start_POSTSUBSCRIPT { italic_a = Stay } end_POSTSUBSCRIPT - 0.2 ⋅ sansserif_1 start_POSTSUBSCRIPT { italic_a ≠ Stay } end_POSTSUBSCRIPT, with 𝟣1\mathsf{1}sansserif_1 being the indicator function and κ𝜅\kappaitalic_κ being a crowd-aversion parameter.

In this environment, the transition matrix does not depend on the mean-field distribution μ𝜇\muitalic_μ; however, some stochasticity is introduced through a slipperiness parameter: when an agent selects an action, she is most likely to follow it, but there remains a smaller probability of performing a different valid move. In particular, for each action, a total slipperiness probability of 0.10.10.10.1 is evenly distributed among the alternative actions. Furthermore, this game can be extended by introducing a designated point of interest, denoted as stargetsubscript𝑠targets_{\text{target}}italic_s start_POSTSUBSCRIPT target end_POSTSUBSCRIPT, which guides the behavior of the players. The modified reward function is defined as

𝗋~(s,a,μ)=𝗋(s,a,μ)+max(0.30.1d(s,starget),0),~superscript𝗋absent𝑠𝑎𝜇superscript𝗋absent𝑠𝑎𝜇0.30.1𝑑𝑠subscript𝑠target0\tilde{\mathsf{r}^{\hskip 0.49005pt}}(s,a,\mu)=\mathsf{r}^{\hskip 0.49005pt}(s% ,a,\mu)+\max\left(0.3-0.1\cdot d(s,s_{\text{target}}),0\right)\;,over~ start_ARG sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG ( italic_s , italic_a , italic_μ ) = sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) + roman_max ( 0.3 - 0.1 ⋅ italic_d ( italic_s , italic_s start_POSTSUBSCRIPT target end_POSTSUBSCRIPT ) , 0 ) ,

where 𝗋(s,a,μ)superscript𝗋absent𝑠𝑎𝜇\mathsf{r}^{\hskip 0.49005pt}(s,a,\mu)sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_μ ) denotes the previously defined reward function, and d(s,starget)𝑑𝑠subscript𝑠targetd(s,s_{\text{target}})italic_d ( italic_s , italic_s start_POSTSUBSCRIPT target end_POSTSUBSCRIPT ) is the distance between state s𝑠sitalic_s and the target state stargetsubscript𝑠targets_{\text{target}}italic_s start_POSTSUBSCRIPT target end_POSTSUBSCRIPT, computed as the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm of their coordinate difference.

Refer to caption
Refer to caption
Refer to caption
Figure 3: The reading order is (from left to right): Four Rooms Crowd Modeling, Two-Islands-Graph Crowd Modeling, and Four Rooms Crowd Modeling with a point of interest. Solid lines denote η=0.05𝜂0.05\eta=0.05italic_η = 0.05, whereas dashed lines indicate η=0.3𝜂0.3\eta=0.3italic_η = 0.3.

Two-Islands-Graph Crowd Modeling. The Two Islands Crowd Modeling Game replaces the grid structure with two interconnected graphs, referred to as islands, connected by a single narrow bridge. The main challenge in this setting arises from the limited connectivity between the two sub-populations. The transition matrix is generated randomly, assigning to each node a probability distribution over its neighboring nodes, including itself. The reward function penalizes the logarithm of the mean-field distribution while encouraging movement toward the second island i2subscript𝑖2i_{2}italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

𝗋(s,μ)=κlog(μ(s))(2𝟣si2+𝟣si1).superscript𝗋absent𝑠𝜇𝜅𝜇𝑠2subscript1𝑠subscript𝑖2subscript1𝑠subscript𝑖1\mathsf{r}^{\hskip 0.49005pt}(s,\mu)=-\kappa\log(\mu(s))(2\cdot\mathsf{1}_{s% \in i_{2}}+\mathsf{1}_{s\in i_{1}})\;.sansserif_r start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_μ ) = - italic_κ roman_log ( italic_μ ( italic_s ) ) ( 2 ⋅ sansserif_1 start_POSTSUBSCRIPT italic_s ∈ italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + sansserif_1 start_POSTSUBSCRIPT italic_s ∈ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

The Exact MF-TRPO algorithm is evaluated on the two proposed variants of the Grid-Based environment and on the Two-Islands-Graph Crowd Modeling game. The former is modeled as an 11×11111111\times 1111 × 11 grid with walls delineating four symmetric and interconnected rooms, as in Geist et al. (2022), with all the players starting clustered in the top-left corner. For the latter, we consider a state space of size |𝒮|=14𝒮14|\mathcal{S}|=14| caligraphic_S | = 14 and an action space of size |𝒜|=2𝒜2|\mathcal{A}|=2| caligraphic_A | = 2, with a branching factor of 2, that is, each state is connected to exactly two neighbors. Here, initially, all players are positioned at location 2 on the first island i1subscript𝑖1i_{1}italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (see Figure 5).

F.1 Experimental setting

Results are presented for two different values of the regularization parameter: η=0.05𝜂0.05\eta=0.05italic_η = 0.05 and η=0.3𝜂0.3\eta=0.3italic_η = 0.3 and, throughout all experiments, the discount factor is set to γ=0.9𝛾0.9\gamma=0.9italic_γ = 0.9. A key feature of both the exact and sample-based methods is the use of a warm start for the policy in the RL component. Rather than resetting the policy to a uniform distribution over actions at each iteration, it is initialized with the policy learned from the previous iteration. Moreover, the step size used for updating the distribution remains constant throughout the learning phase, i.e., βk=βsubscript𝛽𝑘𝛽\beta_{k}=\betaitalic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_β. The key parameters for the two algorithms are summarized in Table 1.

Table 1: Parameter settings for the algorithms.
Algorithm/Parameter κ𝜅\kappaitalic_κ γ𝛾\gammaitalic_γ η𝜂\etaitalic_η L𝐿Litalic_L β𝛽\betaitalic_β Isubscript𝐼I_{\ell}italic_I start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT P𝑃Pitalic_P M𝑀Mitalic_M
Exact MF-TRPO {0.2,0.4}0.20.4\{0.2,0.4\}{ 0.2 , 0.4 } 0.9 {0.05,0.3}0.050.3\{0.05,0.3\}{ 0.05 , 0.3 } 10 0.01 N/A N/A N/A
Sample-Based MF-TRPO 0.2 0.9 {0.05,0.3}0.050.3\{0.05,0.3\}{ 0.05 , 0.3 } 100 0.1 31053superscript1053\cdot 10^{5}3 ⋅ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 31053superscript1053\cdot 10^{5}3 ⋅ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT 100100100100

F.2 Results

The plots presented show the exploitability, defined in Equation 7, to evaluate the effectiveness of our approach, along with the evolution of the mean-field distribution over time. Compared to FP and OMD, Exact MF-TRPO performs competitively across all evaluated environments, demonstrating superior long-term performance. As training progresses, the model continually improves its policy and ultimately outperforms the other algorithms (see Figure 3). Moreover, players in grid-based games tend to move toward less crowded areas, gradually achieving a more uniform distribution (see Figure 4). Moreover, when a point of interest is introduced, the players manage to cluster around it (see Figure 6). Finally, as shown in Figure 5, the players progressively concentrate on the second island, attracted by the higher reward present in that region.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Evolution of the mean field distribution for η=0.05𝜂0.05\eta=0.05italic_η = 0.05 in the Four Rooms Crowd Modeling game. From left to right: step 0, step 1000 and step 5000.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Evolution of the mean field distribution for η=0.05𝜂0.05\eta=0.05italic_η = 0.05 in the Two-Islands Graph Crowd Modeling game. From left to right: step 0, step 2000 and step 5000.
Refer to caption
Refer to caption
Refer to caption
Figure 6: Evolution of the mean field distribution for η=0.05𝜂0.05\eta=0.05italic_η = 0.05 in the Four Rooms Graph Crowd Modeling game with the bottom-right corner being a point of interest. From left to right: step 0, step 1000 and step 5000.