A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops

Shi Fu1 Yingjie Wang1111Corresponding authors Yuzhu Chen2 Xinmei Tian2 Dacheng Tao1111Corresponding authors
1Generative AI Lab, College of Computing and Data Science,
  Nanyang Technological University, Singapore 639798,
2University of Science and Technology of China, Hefei, China
[email protected], [email protected],
[email protected], [email protected], [email protected]
Abstract

High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.

1 Introduction

The quest of high-quality data is paramount in the training of generative artificial intelligence (AI), such as large language models (LLMs). However, the vast reservoir of publicly available data on the internet has nearly been exhausted (Villalobos et al., 2022), pushing the research community to seek innovative yet plausible solutions. One promising approach is to train the next generation of LLMs using synthetic data generated by earlier generations of the models themselves (Briesch et al., 2023). Additionally, reliance on synthetic data has become almost unavoidable, as many existing datasets are already polluted with synthetic content (Schuhmann et al., 2022), which proves difficult to detect reliably (Sadasivan et al., 2023). This has led to the development of Self-consuming Training Loops (STLs), as illustrated in Figure 1, where generative models are recursively trained on a mix of real and synthetic data generated by the models themselves. In theory, these STLs of data creation and refinement could propel models to new levels of capability, reducing reliance on external datasets.

However, despite their potential, the empirical results of STLs have been highly inconsistent across studies (Shumailov et al., 2024; Alemohammad et al., 2024a; Xing et al., 2025; Dohmatob et al., 2024b). Some studies (Shumailov et al., 2024) have encountered significant setbacks—certain models have shown signs of stagnation, failing to improve or adapt, while others have even regressed, leading to sharp declines in performance. Conversely, other works (Gerstgrasser et al., 2024; Gillman et al., 2024; Alemohammad et al., 2024b; Ferbach et al., 2024) have successfully avoided model collapse by incorporating sufficient real data, augmenting with synthetic data, or introducing guidance during the generation process. However, these observed phenomena lack thorough theoretical explanations.

When and how do STLs generalize effectively, thereby preventing model collapse from a theoretical perspective? Even among “refined” LLMs drawing from similar pools of model-generated data, the results vary significantly (Briesch et al., 2023; Fu et al., 2024a). These inconsistencies highlight the urgency of establishing theoretical guarantees for STLs by exploring the underlying mechanisms that determine when synthetic data generation either facilitates or impedes model development. Initial theoretical explorations have started to address these gaps. For instance, Shumailov et al. (2024) and Alemohammad et al. (2024a) demonstrated model collapse when models were trained exclusively on synthetic data, using simplified Gaussian models to illustrate this issue. In a more detailed theoretical study, Bertrand et al. (2024) derived upper bounds on parameter deviations for likelihood-based models in STLs, establishing convergence under strict statistical and optimization assumptions. Meanwhile, Fu et al. (2024b) relaxed these assumptions and provided bounds on the divergence between synthetic and real data distributions for a simplified diffusion model.

However, existing theoretical research lacks a unified framework and has yet to thoroughly investigate the generalization error of STLs. Additionally, current studies often overlook the role of model architectures in preventing model collapse. Moreover, the behavior of transformers within STLs remains largely unexamined, leaving significant theoretical gaps in the literature. Notably, there is limited exploration of the theoretical trade-offs introduced by synthetic data augmentation. This paper aims to address these gaps with the following contributions:

1. Theoretical Generalization Framework: This paper fills a gap in prior research by being the first to establish generalization error bounds. The key innovation, recursive stability, is introduced to address the core theoretical challenges, specifically the complex recursive structures and the non-i.i.d. nature of the data. Moreover, we demonstrate that the generalization error converges under the following conditions: (1) the generative model satisfies recursive stability, and (2) the proportion of real data is maintained at a non-negligible constant level, thus preventing model collapse.

2. Application to Transformers in In-Context Learning: This paper is the first to extend the theoretical framework to transformer models in in-context learning. We prove that transformers in this setting satisfy recursive stability with a constant-level proportion of real data, controlling output differences in STLs under small perturbations to the initial dataset. Consequently, we show that the generalization error is bounded by 𝒪(n1log2(n)+n1/2log(n)+n1/4)𝒪superscript𝑛1superscript2𝑛superscript𝑛12𝑛superscript𝑛14\mathcal{O}(n^{-1}\log^{2}(n)+n^{-1/2}\log(n)+n^{-1/4})caligraphic_O ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n ) + italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_log ( italic_n ) + italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ).

3. Trade-off Analysis of Synthetic Data Augmentation: We investigate the trade-off in synthetic data augmentation. By employing decomposition techniques, we demonstrate that while synthetic data improves the generalization performance of each generation on mixed datasets, it concurrently exacerbates distribution divergence across successive generations. Our theoretical findings further show that the optimal size of synthetic augmentation increases as the size of real dataset decreases.


Refer to caption

Figure 1: Self-consuming Training Loops: The initial model 𝒢0subscript𝒢0\mathcal{G}_{0}caligraphic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is trained on the real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For generation 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i, the model 𝒢jsubscript𝒢𝑗\mathcal{G}_{j}caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is trained on the mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

2 Related Work

This section reviews STLs research and algorithmic stability studies.

Self-consuming Training Loops. Recent research has increasingly focused on generative models trained within STLs (Shumailov et al., 2024), with much of the analysis conducted from an empirical perspective (Martínez et al., 2023). For example, Shumailov et al. (2024); Briesch et al. (2023) observe a decline in diversity in language models when a portion of the model’s outputs is recursively used as inputs. Additionally, Wyllie et al. (2024) highlights that recursive training on synthetic data amplifies biases, resulting in significant fairness concerns. To mitigate model collapse, some studies suggest incorporating real data into the training process (Alemohammad et al., 2024a), expanding the size of synthetic datasets (Dohmatob et al., 2024c; Gerstgrasser et al., 2024; Dohmatob et al., 2024a; Feng et al., 2024b), or providing guidance during the generation process (Gillman et al., 2024; Alemohammad et al., 2024b; Feng et al., 2024a).

While empirical research has extensively explored STLs of generative models, theoretical studies on this process remain relatively sparse (Kanabar & Gastpar, 2025; Seddik et al., 2024; Marchi et al., 2024; Gerstgrasser et al., 2024; Zhu et al., 2024; Tao et al., 2024). Notably, Shumailov et al. (2024) and Alemohammad et al. (2024a) offer initial theoretical insights by analyzing a simplified Gaussian model. In a more comprehensive analysis, Bertrand et al. (2024) derive upper bounds on parameter deviations between those obtained within a STL and the optimal values, relying on assumptions about statistical and optimization error bounds. In contrast, Fu et al. (2024b) propose bounds on the divergence between synthetic and real-world data distributions, without such assumptions. However, current research lacks a unified theoretical framework that accounts for the influence of different model architectures and does not provide generalization error bounds for STLs, thus failing to rigorously establish the conditions that guarantee the prevention of model collapse. Furthermore, the behavior of transformers within STLs remains unexplored, leaving substantial theoretical gaps.

Algorithmic stability. Algorithmic stability ensures generalization bounds independent of model capacity. A key measure, uniform stability, was introduced by Bousquet & Elisseeff (2002) and has been instrumental in analyzing the generalization behavior of regularization methods. This measure was later extended to SGD (Hardt et al., 2016), including non-convex and non-smooth settings (Charles & Papailiopoulos, 2018; Bassily et al., 2020; Lei, 2023). Recent work shows that uniform stability can also provide near-optimal bounds with high probability (Feldman & Vondrak, 2019; Bousquet et al., 2020; Klochkov & Zhivotovskiy, 2021; Li & Liu, 2022; Wang et al., 2024).

Building on these foundations, recent research has focused on stability in more complex, non-i.i.d. settings. A common approach models data from stationary and mixing sequences (Doukhan, 1994; Yu, 1994), where weakening dependencies allow stability bounds through mixing coefficients (Mohri & Rostamizadeh, 2010; He et al., 2016; Fu et al., 2023). However, estimating these coefficients remains challenging. Additionally, some studies (Zheng et al., 2023) address non-i.i.d. data by leveraging conditional independence properties. Nonetheless, current methodologies struggle with the complexities of STLs, as the non-i.i.d. nature of mixed datasets, where each generation’s data is influenced by previous generations, presents unresolved challenges for stability frameworks.

Remark 1.

Building on previous challenges, our work advances this area by developing a more comprehensive theoretical framework for analyzing generative models within STLs. Specifically, we present the first generalization error bound by addressing the additional complexity arising from the non-i.i.d. nature of mixed datasets. To address this, we propose the key innovation of recursive stability, which quantifies error propagation across generations of synthetic data. Moreover, we are the first to extend this theoretical framework to transformers, explicitly utilizing error decomposition to illustrate the trade-off introduced by augmenting datasets with synthetic data.

3 Preliminaries

In this section, we begin by formally describing the training process of generative models in STLs, then introduce algorithmic stability with a focus on uniform stability, and finally define recursive stability to address the challenges specific to STLs.

3.1 Generative Models within Self-consuming Training Loops

Generative models have made significant strides in producing highly realistic data, such as images and text, which are frequently shared online and often indistinguishable from real content. Meanwhile, the supply of real data has nearly been exhausted. Consequently, deep generative models increasingly rely on synthetic data, either unintentionally (Schuhmann et al., 2022) or intentionally (Huang et al., 2022). This reliance creates a recursive cycle where successive generations are trained on mixed datasets of real and synthetic data, a process known as an STL, as shown in Figure 1.

More concretely, we explore a stochastic process that evolves through sequential generations. In an STL, we start with an initial dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, consisting of real data points 𝒛𝒵𝒛𝒵\bm{z}\in\mathcal{Z}bold_italic_z ∈ caligraphic_Z, sampled from the original real distribution 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The initial generative model 𝒢0subscript𝒢0\mathcal{G}_{0}caligraphic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is trained on this real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, producing the first generation synthetic dataset S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, whose distribution is denoted as 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Next, the real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is combined with the synthetic dataset S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in a certain proportion to form a new mixed dataset S~1subscript~𝑆1\widetilde{S}_{1}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with distribution 𝒟~1subscript~𝒟1\widetilde{\mathcal{D}}_{1}over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The next generation generative model 𝒢1subscript𝒢1\mathcal{G}_{1}caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is then trained on this mixed dataset S~1subscript~𝑆1\widetilde{S}_{1}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Moving forward, for each subsequent generation 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i, the mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is composed of real data and synthetic data from previous generations. The generative model 𝒢jsubscript𝒢𝑗\mathcal{G}_{j}caligraphic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is trained on S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, producing the synthetic dataset Sj+1subscript𝑆𝑗1S_{j+1}italic_S start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT. This STL proceeds iteratively until the maximum generation, denoted as i𝑖iitalic_i, is reached.

3.2 Algorithmic Stability

Algorithmic stability measures the impact of modifying or removing a small number of examples from the training set on the resulting model, a key concept in statistical learning theory (Bousquet & Elisseeff, 2002). Its primary advantage lies in providing generalization bounds independent of model capacity. Among various stability notions (Shalev-Shwartz et al., 2010), we focus on uniform stability, the most widely studied form. Let S𝑆Sitalic_S and Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be two datasets differing by one point. Then, we formally define uniform stability as follows:

Definition 1.

(Uniform Stability (Bousquet & Elisseeff, 2002)). Algorithm 𝒜𝒜\mathcal{A}caligraphic_A is uniformly βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-stable with respect to the loss function \ellroman_ℓ if the following holds

S,S𝒵n,𝒛𝒵,sup𝒛|(𝒜(S),𝒛)(𝒜(S),𝒛)|βn.formulae-sequencefor-all𝑆superscript𝑆superscript𝒵𝑛formulae-sequencefor-all𝒛𝒵subscriptsupremum𝒛𝒜𝑆𝒛𝒜superscript𝑆𝒛subscript𝛽𝑛\forall S,\ S^{\prime}\in\mathcal{Z}^{n},\ \forall\bm{z}\in\mathcal{Z},\ \sup_% {\bm{z}}\left|\ell(\mathcal{A}(S),\bm{z})-\ell\left(\mathcal{A}\left(S^{\prime% }\right),\bm{z}\right)\right|\leq\beta_{n}.∀ italic_S , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , ∀ bold_italic_z ∈ caligraphic_Z , roman_sup start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT | roman_ℓ ( caligraphic_A ( italic_S ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , bold_italic_z ) | ≤ italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

Traditional notions of stability have predominantly been studied in the context of learning algorithms, such as SGD (Lei & Ying, 2020). More recently, there has been significant progress in extending the concept of stability to generative models (Farnia & Ozdaglar, 2021; Zheng et al., 2023; Li et al., 2023). Building on these advancements, we propose recursive stability to specifically address generative models within STLs. This new stability measure is designed to quantify the differences in a generative model’s outputs after multiple generations of recursive training when small perturbations are applied to the initial real dataset. The formal definition of recursive stability is presented below.

Definition 2.

(Recursive Stability) Let S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represent the original real dataset, and S0subscriptsuperscript𝑆0S^{\prime}_{0}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote a dataset differing from S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by a single example. A generative model 𝒢𝒢\mathcal{G}caligraphic_G is said to be recursively γni,αsuperscriptsubscript𝛾𝑛𝑖𝛼\gamma_{n}^{i,\alpha}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_α end_POSTSUPERSCRIPT-stable with respect to the distance measure d𝑑ditalic_d after the i𝑖iitalic_i-th generation of STLs, where the ratio of real to synthetic data is set to α𝛼\alphaitalic_α, if the following condition holds:

S0,S0n,d(𝒢(i)(S0),𝒢(i)(S0))γni,α.formulae-sequencefor-allsubscript𝑆0subscriptsuperscript𝑆0superscript𝑛𝑑superscript𝒢𝑖subscript𝑆0superscript𝒢𝑖subscriptsuperscript𝑆0superscriptsubscript𝛾𝑛𝑖𝛼\forall S_{0},S^{\prime}_{0}\in\mathbb{Z}^{n},\quad d\left(\mathcal{G}^{(i)}(S% _{0}),\mathcal{G}^{(i)}(S^{\prime}_{0})\right)\leq\gamma_{n}^{i,\alpha}.∀ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_d ( caligraphic_G start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , caligraphic_G start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ≤ italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_α end_POSTSUPERSCRIPT .

where 𝒢(i)superscript𝒢𝑖\mathcal{G}^{(i)}caligraphic_G start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT denotes the output of the generative model at the i𝑖iitalic_i-th generation in the STLs. The distance measure d𝑑ditalic_d quantifies the deviation between the outputs generated from inputs S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT across STLs. Specifically, d𝑑ditalic_d can be defined using Total Variation (TV) distance, Kullback-Leibler (KL) divergence, or various norms (e.g., 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm), allowing flexibility in assessing the differences in generated outputs.

4 General Theoretical Results

In this section, we present a general framework for analyzing generalization error. Moving beyond traditional analyses of parameter changes (Bertrand et al., 2024) and distributional discrepancies (Fu et al., 2024b), we focus on evaluating the utility of synthetic data after recursive training (Hittmeir et al., 2019; Xu et al., 2023). Specifically, we examine the behavior of a uniformly stable learning algorithm 𝒜𝒜\mathcal{A}caligraphic_A trained on the mixed dataset S~isubscript~𝑆𝑖\widetilde{S}_{i}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the i𝑖iitalic_i-th generation. Our goal is to study the generalization error of the hypothesis 𝒜(S~i)𝒜subscript~𝑆𝑖\mathcal{A}(\widetilde{S}_{i})caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Formally, we aim to bound |R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_{\widetilde{S% }_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |, where R𝒟0(𝒜(S~i))=𝔼𝒛𝒟0[(𝒜(S~i),𝒛)]subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝔼similar-to𝒛subscript𝒟0delimited-[]𝒜subscript~𝑆𝑖𝒛R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))=\mathbb{E}_{\bm{z}\sim% \mathcal{D}_{0}}[\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z})]italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) ] represents the population risk of 𝒜(S~i)𝒜subscript~𝑆𝑖\mathcal{A}(\widetilde{S}_{i})caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) under the real distribution 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and R^S~i(𝒜(S~i))=1n𝒛iS~i(𝒜(S~i),𝒛i)subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript~𝑆𝑖𝒜subscript~𝑆𝑖subscript𝒛𝑖\widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))=\frac{1}{n}% \sum_{\bm{z}_{i}\in\widetilde{S}_{i}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z% }_{i})over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the empirical risk on the mixed dataset. To derive this bound, we first decompose the generalization error as follows.

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))||R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|Cumulative distribution shift across generations+|R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|Generalization error on mixed distributions.subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖subscriptsubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖Cumulative distribution shift across generationssubscriptsubscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖Generalization error on mixed distributions\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\leq% \underbrace{\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|}_{\text{% Cumulative distribution shift across generations}}+\underbrace{\left|R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_{% \widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|}_{\text{% Generalization error on mixed distributions}}.| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ under⏟ start_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | end_ARG start_POSTSUBSCRIPT Cumulative distribution shift across generations end_POSTSUBSCRIPT + under⏟ start_ARG | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | end_ARG start_POSTSUBSCRIPT Generalization error on mixed distributions end_POSTSUBSCRIPT .

The first term captures the accumulation of error and distribution divergence over multiple generations within the STLs. This heavily depends on the capacity of the generative model to preserve distributional fidelity across generations, requiring recursive techniques to manage error propagation. The second term reflects the generalization performance of the learning algorithm on the non-i.i.d. mixed dataset, where synthetic data points are influenced by the initial real dataset. Drawing on Zheng et al. (2023), we observe that while S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfies the i.i.d. assumption, the synthetic datasets Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT follow a conditional i.i.d. assumption given S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Leveraging this, along with moment bounds and concentration inequalities, we address the challenge of bounding the second term and managing dependencies within the STLs. We now present the following result.

Theorem 1 (General Generalization Bound).

Assume that 𝒜𝒜\mathcal{A}caligraphic_A is a βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-uniformly stable learning algorithm and the loss function \ellroman_ℓ is bounded by M𝑀Mitalic_M. Let n𝑛nitalic_n represent the sample size of the mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, defined as S~j=αS0+(1α)Sjsubscript~𝑆𝑗𝛼subscript𝑆01𝛼subscript𝑆𝑗\widetilde{S}_{j}=\alpha S_{0}+(1-\alpha)S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i, where 0<α10𝛼10<\alpha\leq 10 < italic_α ≤ 1 denotes the proportion of real data. Assume further that the generative model 𝒢𝒢\mathcal{G}caligraphic_G is recursively γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT-stable, and the TV distance for each generation TV(𝒟~j,𝒟j+1)𝑇𝑉subscript~𝒟𝑗subscript𝒟𝑗1TV(\widetilde{\mathcal{D}}_{j},\mathcal{D}_{j+1})italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ) is of the same order, denoted by dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ). Then, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|γniαMlog(nα)log(1/δ)+n1/2Mlog1/δless-than-or-similar-tosubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖superscriptsubscript𝛾𝑛𝑖𝛼𝑀𝑛𝛼1𝛿superscript𝑛12𝑀1𝛿\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\lesssim% \gamma_{n}^{i}\alpha M\log(n\alpha)\log(1/\delta)+n^{-1/2}M\sqrt{\log 1/\delta}| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≲ italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_α italic_M roman_log ( italic_n italic_α ) roman_log ( 1 / italic_δ ) + italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_M square-root start_ARG roman_log 1 / italic_δ end_ARG
+βn(lognlog(1/δ)+α(1α)nlog(1/δ))+dTV(n)M(1(1α)i)α1,subscript𝛽𝑛𝑛1𝛿𝛼1𝛼𝑛1𝛿subscript𝑑TV𝑛𝑀1superscript1𝛼𝑖superscript𝛼1\displaystyle\quad+\beta_{n}\left(\log n\log(1/\delta)+\alpha\sqrt{(1-\alpha)n% \log(1/\delta)}\right)+d_{\mathrm{TV}}(n)M\left(1-(1-\alpha)^{i}\right)\alpha^% {-1},+ italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_log italic_n roman_log ( 1 / italic_δ ) + italic_α square-root start_ARG ( 1 - italic_α ) italic_n roman_log ( 1 / italic_δ ) end_ARG ) + italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (1)

where γni=supjTV(𝒟in(1α)(S0),𝒟in(1α)(S0))superscriptsubscript𝛾𝑛𝑖subscriptsupremum𝑗𝑇𝑉superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsubscript𝑆0superscriptsubscript𝒟𝑖𝑛1𝛼subscript𝑆0\gamma_{n}^{i}=\sup_{j}TV(\mathcal{D}_{i}^{n(1-\alpha)}(S_{0}^{\prime}),% \mathcal{D}_{i}^{n(1-\alpha)}(S_{0}))italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = roman_sup start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_T italic_V ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), with S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT representing two real datasets of size n𝑛nitalic_n, differing by only a single data point.

Remark 2.

Recursive Stability in STLs. In Theorem 1, the recursive stability parameter is quantified using the TV distance to measure the divergence between the distributions of the n(1α)𝑛1𝛼n(1-\alpha)italic_n ( 1 - italic_α ) synthetic data points generated by the model 𝒢isubscript𝒢𝑖\mathcal{G}_{i}caligraphic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at the i𝑖iitalic_i-th generation. Notably, the concept of recursive stability, introduced in Definition 2, is adaptable to various metrics, making it applicable across different types of generative models. In Theorem 2, the recursive stability parameter for transformers is instead defined using the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm between tokens, allowing this concept to be generalized to a broader range of model architectures.

Moreover, Theorem 1 demonstrates that generative models with higher recursive stability exhibit better performance after undergoing the STL. Specifically, the results indicate that the convergence rate of recursive stability parameter is at least faster than 𝒪(1/logn)𝒪1𝑛\mathcal{O}(1/\log n)caligraphic_O ( 1 / roman_log italic_n ), which is a relatively mild condition. Furthermore, Theorem 2 shows that, under mild assumptions, the recursive stability parameter for transformers in in-context learning settings achieves a convergence rate of γni=𝒪(1/n)superscriptsubscript𝛾𝑛𝑖𝒪1𝑛\gamma_{n}^{i}=\mathcal{O}(1/n)italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = caligraphic_O ( 1 / italic_n ) when measured by the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm between tokens.

Remark 3.

Effect of Real Data Proportion on Error Control. Previous experimental results (Shumailov et al., 2024; Alemohammad et al., 2024a) have demonstrated that incorporating real data can mitigate model collapse and help control errors. This remark focuses on exploring the effect of the real data proportion α𝛼\alphaitalic_α on the generalization error within the STLs. As shown in Theorem 1, the real data proportion α𝛼\alphaitalic_α plays a significant role in the cumulative distribution shift across generations, specifically in the term 2M(1(1α)i)α1dTV(n)2𝑀1superscript1𝛼𝑖superscript𝛼1subscript𝑑TV𝑛2M\left(1-(1-\alpha)^{i}\right)\alpha^{-1}d_{\mathrm{TV}}(n)2 italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ).

When α0𝛼0\alpha\to 0italic_α → 0, we observe that (1(1α)i)αi1superscript1𝛼𝑖𝛼𝑖\frac{(1-(1-\alpha)^{i})}{\alpha}\to idivide start_ARG ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_α end_ARG → italic_i, leading to a linear accumulation of errors due to the Distribution Shift, making it increasingly challenging to control the overall error. This observation aligns with the theoretical results reported in Shumailov et al. (2024); Dohmatob et al. (2024a); Fu et al. (2024b). However, it is important to note that the conditions on α𝛼\alphaitalic_α for controlling this term are not strict. In fact, as long as α𝛼\alphaitalic_α remains at a non-negligible constant level, the expression (1(1α)i)α11superscript1𝛼𝑖superscript𝛼1\left(1-(1-\alpha)^{i}\right)\alpha^{-1}( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT remains bounded, effectively controlling the error. This aligns with theoretical intuition: when α𝛼\alphaitalic_α is too small, the mixed dataset contains insufficient real data, resulting in a more severe distribution shift.

Moreover, the proportion of real data α𝛼\alphaitalic_α also impacts the generalization error on mixed distributions, primarily through its effect on the recursive stability parameter γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. As α𝛼\alphaitalic_α increases, the generative model becomes more recursively stable. We will further explore the influence of α𝛼\alphaitalic_α on the recursive stability parameter γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for specific generative models, such as transformers, in Theorem 3, particularly in Remark 8.

Remark 4.

Convergence Rate of Uniform Stability Parameter. With respect to the uniform stability parameter βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we observe from the third term on the right-hand side of inequality 1 that the convergence rate of βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT must be at least 𝒪(1/n)𝒪1𝑛\mathcal{O}(1/\sqrt{n})caligraphic_O ( 1 / square-root start_ARG italic_n end_ARG ) to adequately control the error. This is a relatively mild requirement.

For example, in the case of widely used algorithms such as SGD, it has been shown that the uniform stability parameter βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT converges at a rate of 𝒪(log(n)/n)𝒪𝑛𝑛\mathcal{O}(\log(n)/n)caligraphic_O ( roman_log ( italic_n ) / italic_n ) under the assumptions of Lipschitz continuity and smoothness of the loss function (Zhang et al., 2022). Additionally, for regularization-based algorithms, such as kernel regularization schemes and the Minimum Relative Entropy (MRE) algorithm, it has been demonstrated that βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can achieve a convergence rate of 𝒪(1/n)𝒪1𝑛\mathcal{O}(1/n)caligraphic_O ( 1 / italic_n ) under certain conditions (Bousquet & Elisseeff, 2002).

Remark 5.

Convergence of the Distribution Shift Term dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ). Regarding the convergence of the term 2M(1(1α)i)α1dTV(n)2𝑀1superscript1𝛼𝑖superscript𝛼1subscript𝑑TV𝑛2M\left(1-(1-\alpha)^{i}\right)\alpha^{-1}d_{\mathrm{TV}}(n)2 italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ), as discussed in Remark 3, when α𝛼\alphaitalic_α remains a non-negligible constant, attention turns to the distribution shift term dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ). This term critically depends on the generative model’s capacity and quantifies the divergence between the learned distribution and the input distribution in each generation.

Theoretical studies have provided various convergence rates for dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) across different generative models. For instance, in diffusion models, dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) has been shown to converge at a rate of 𝒪(1/n1/4)𝒪1superscript𝑛14\mathcal{O}\left(1/n^{1/4}\right)caligraphic_O ( 1 / italic_n start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ) (Fu et al., 2024b). Similarly, for GANs, the convergence rate is also 𝒪(1/n1/4)𝒪1superscript𝑛14\mathcal{O}\left(1/n^{1/4}\right)caligraphic_O ( 1 / italic_n start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ) (Liang, 2021). More generally, by applying Pinsker’s inequality to relate KL divergence and TV distance, the convergence rates for other models, such as Bias potential models and Normalizing flows, have been explored in previous works (Yang, 2022). Additionally, we will further examine the behavior of transformer models in Theorem 3, demonstrating the flexibility of our theoretical framework across a wide range of generative models.

Remark 6.

Comparision with Previous Works. In the realm of theoretical research on the STL, where models are recursively trained on the synthetic data they generate, the foundational work was introduced by Shumailov et al. (2024) and Alemohammad et al. (2024a). They provided the initial theoretical definitions and analyzed the behavior of a simplistic multivariate Gaussian toy model in such loops. However, their analyses were limited to basic theoretical insights and lacked in-depth exploration of more complex generative models.

Recent advancements in this field have primarily come from Bertrand et al. (2024) and Fu et al. (2024b). Bertrand et al. (2024) established an upper bound on the deviation of likelihood-based model output parameters from the optimal ones, denoted as θiθnormsubscript𝜃𝑖superscript𝜃\left\|\theta_{i}-\theta^{*}\right\|∥ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥. This was achieved by making direct assumptions on the upper bounds of both statistical and optimization errors in generative models, as outlined in their Assumption 3. In contrast, Fu et al. (2024b) derived bounds on the TV distance, addressing the distribution divergence between the synthetic data distributions produced by future models and the original real data distribution, with a specific focus on diffusion models. Our work makes significant theoretical advancements over both Bertrand et al. (2024) and Fu et al. (2024b) in several key aspects:

1. Innovative Concept of Recursive Stability. A central technical contribution of our work is the extension of the traditional notion of algorithmic stability. We define recursive stability, a crucial factor for controlling error propagation across generations. This novel concept tackles the theoretical challenges posed by non-i.i.d. data and recursive structures within STLs, while also incorporating the influence of model architectures into the generalization error. Moreover, recursive stability serves as a new measure for assessing the stability of generative models within STLs. In Theorem 2, we further establish an upper bound on the recursive stability parameter for transformers under mild conditions, underscoring the broad applicability and robustness of our framework.

2. Establishing the First Generalization Error Bound for STLs. While Bertrand et al. (2024) primarily focused on parameter deviations in generative models and Fu et al. (2024b) concentrated on distribution divergence, our work emphasizes the utility of the generated data produced by STLs. Specifically, by utilizing recursive stability, we present the first generalization error bound that quantifies the gap between the population risk on the initial real data distribution 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the empirical risk of the hypothesis 𝒜(S~i)𝒜subscript~𝑆𝑖\mathcal{A}(\widetilde{S}_{i})caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), generated by applying learning algorithms to the synthetic data produced after multiple generations of STLs. This introduces a new layer of complexity compared to prior work, as it necessitates handling not only the distribution shifts within STLs but also the challenges arising from the non-i.i.d. nature of the mixed datasets, where each generation’s data is influenced by all preceding generations.

3. A More General Framework Accounting for Model Structure. Our proposed theoretical framework is more comprehensive than previous studies. Bertrand et al. (2024) restricted their analysis to simplified likelihood-based generative models, while Fu et al. (2024b) focused specifically on diffusion models. Importantly, neither of their theoretical results accounted for the impact of different model architectures. In contrast, as discussed in Remark 5, our framework explicitly incorporates the effects of varying model structures, thereby extending its applicability to a broader range of generative models. Notably, we are the first to extend the theory of SLTs to transformer models, further broadening the scope of our approach across diverse generative model architectures.

4. Comprehensive Collapse Prevention Through Recursive Stability. In addition to the existing theoretical work, which primarily analyzes conditions to avoid model collapse based on the proportion of real data (e.g., Bertrand et al. (2024); Fu et al. (2024b)), our work extends these analyses by considering the impact of model architecture. Specifically, Theorem 1 demonstrates that under a recursive stability condition and a non-negligible constant level of real data, model collapse can be avoided across a variety of model architectures. This analysis offers broader conditions for preventing collapse by incorporating recursive stability, deepening the understanding of how model architecture affects training robustness.

Remark 7.

Proof Sketch of Theorem 1. We first decompose the generalization error of STLs into two distinct terms: (1) the cumulative distribution shift across generations, and (2) the generalization error on the mixed dataset.

Cumulative Distribution Shift: This term measures the shift between the real dataset 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the mixed distribution 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT after the i𝑖iitalic_i-th generation. Using the TV distance to quantify the shift introduced by the generative model, we bound the difference as:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|(1α)|R𝒟0(𝒜(S~i))R𝒟~i1(𝒜(S~i))|+2(1α)MTV(𝒟~i1,𝒟i).subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖1𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖21𝛼𝑀𝑇𝑉subscript~𝒟𝑖1subscript𝒟𝑖\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{% \mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\leq(1-\alpha)\left|R_% {\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{% i-1}}(\mathcal{A}(\widetilde{S}_{i}))\right|+2(1-\alpha)MTV(\widetilde{% \mathcal{D}}_{i-1},\mathcal{D}_{i}).| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ ( 1 - italic_α ) | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + 2 ( 1 - italic_α ) italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

By leveraging the recursive structure of the generative process, this cumulative distribution shift can be bounded across all generations as:

|R𝒟0(A(Si))R𝒟i(A(Si))|2M(1(1α)i)α1dTV(n).subscript𝑅subscript𝒟0𝐴subscript𝑆𝑖subscript𝑅subscript𝒟𝑖𝐴subscript𝑆𝑖2𝑀1superscript1𝛼𝑖superscript𝛼1subscript𝑑TV𝑛|R_{\mathcal{D}_{0}}(A(S_{i}))-R_{\mathcal{D}_{i}}(A(S_{i}))|\leq 2M\left(1-(1% -\alpha)^{i}\right)\alpha^{-1}d_{\mathrm{TV}}(n).| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ 2 italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) .

Generalization Error on the Mixed Dataset: The second term quantifies the generalization error when training on the mixed dataset S~isubscript~𝑆𝑖\widetilde{S}_{i}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which consists of both real and synthetic data. Our goal is to establish a moment bound on the generalization error, which can be decomposed as follows:

αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)p+(1α)R𝒟i(𝒜(S~i))1n𝒛iSi,1α(𝒜(S~i),𝒛i)p.subscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝subscriptnorm1𝛼subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆𝑖1𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\|\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{n}\sum_{% \bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})\|_{% p}+\|(1-\alpha)R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{n}% \sum_{\bm{z}_{i}\in S_{i,1-\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_% {i})\|_{p}.∥ italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + ∥ ( 1 - italic_α ) italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT .

In this context, S0,αsubscript𝑆0𝛼S_{0,\alpha}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT represents a proportion α𝛼\alphaitalic_α of the n𝑛nitalic_n data points in S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, leading to a total of n×α𝑛𝛼n\times\alphaitalic_n × italic_α data points. Similarly, Si,1αsubscript𝑆𝑖1𝛼S_{i,1-\alpha}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT denotes a subset of the synthetic dataset Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where Si,1αSisubscript𝑆𝑖1𝛼subscript𝑆𝑖S_{i,1-\alpha}\subseteq S_{i}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ⊆ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its size is (1α)×|Si|1𝛼subscript𝑆𝑖(1-\alpha)\times|S_{i}|( 1 - italic_α ) × | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. For each term, we leverage the uniform stability βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the learning algorithm 𝒜𝒜\mathcal{A}caligraphic_A and the recursive stability γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of the generative model to address the non-i.i.d. nature of the mixed dataset. The mixed dataset exhibits conditional independence (Zheng et al., 2023), with synthetic data conditioned on the initial real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, allowing the application of recursive techniques to derive the moment bound. Subsequently, Lemma 8 and Lemma 9 are utilized to derive the high-probability bound for the final result.

5 Theoretical Analysis of Transformers in In-Context Learning

In this section, we first present the transformer in in-context learning (ICL) and its settings within SLTs in Section 5.1. In Section 5.2, we prove that it satisfies recursive stability, followed by the derivation of the generalization error bound for transformers in ICL in Section 5.3. Finally, in Section 5.4, we explore the scenario of synthetic data augmentation and investigate the associated trade-offs.

5.1 Settings of Transformer in In-context Learning

In-Context Learning Setting. ICL involves a transformer model processing a sequence of input-output examples to perform inference without parameter updates. Unlike traditional supervised learning, where a model is trained on a fixed dataset and then makes predictions, ICL allows the model to adapt on-the-fly to new queries based on the provided examples. We denote a prompt, containing n𝑛nitalic_n in-context examples followed by the (n+1𝑛1n+1italic_n + 1)-th query input, as S0=(𝒛1,𝒛2,,𝒛n,𝒙n+1),subscript𝑆0subscript𝒛1subscript𝒛2subscript𝒛𝑛subscript𝒙𝑛1S_{0}=\left(\bm{z}_{1},\bm{z}_{2},\ldots,\bm{z}_{n},\bm{x}_{n+1}\right),italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) , where (𝒛i)i=1n=(𝒙i,𝒚i)i=1n𝒵=𝒳×𝒴superscriptsubscriptsubscript𝒛𝑖𝑖1𝑛superscriptsubscriptsubscript𝒙𝑖subscript𝒚𝑖𝑖1𝑛𝒵𝒳𝒴\left(\bm{z}_{i}\right)_{i=1}^{n}=\left(\bm{x}_{i},\bm{y}_{i}\right)_{i=1}^{n}% \in\mathcal{Z}=\mathcal{X}\times\mathcal{Y}( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ caligraphic_Z = caligraphic_X × caligraphic_Y represents i.i.d. in-context samples, and 𝒙n+1𝒳subscript𝒙𝑛1𝒳\bm{x}_{n+1}\in\mathcal{X}bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ caligraphic_X is the query input whose label we want to predict. The transformer model, denoted as TF()TF\mathrm{TF}(\cdot)roman_TF ( ⋅ ), takes the prompt S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as input and outputs the predicted label 𝒚^n+1subscript^𝒚𝑛1\hat{\bm{y}}_{n+1}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT for the query 𝒙n+1subscript𝒙𝑛1\bm{x}_{n+1}bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT: 𝒚^n+1=TF(S0)subscript^𝒚𝑛1TFsubscript𝑆0\hat{\bm{y}}_{n+1}=\mathrm{TF}(S_{0})over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = roman_TF ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Recursive Data Generation in STLs with ICL. We extend the traditional ICL setting to an STL, where the transformer recursively generates new data using its own ICL predictions. Starting with an initial real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, this serves as the initial real in-context examples for the transformer. The process begins by sampling the first generation queries {𝒙1,j}j=1nsuperscriptsubscriptsubscript𝒙1𝑗𝑗1𝑛\left\{\bm{x}_{1,j}\right\}_{j=1}^{n}{ bold_italic_x start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT i.i.d. from the input distribution 𝒳𝒳\mathcal{X}caligraphic_X. Each query 𝒙1,jsubscript𝒙1𝑗\bm{x}_{1,j}bold_italic_x start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT is incorporated into the in-context examples from S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a new query 𝒙0,n+1subscript𝒙0𝑛1\bm{x}_{0,n+1}bold_italic_x start_POSTSUBSCRIPT 0 , italic_n + 1 end_POSTSUBSCRIPT, and the transformer predicts the corresponding label 𝒚^1,jsubscript^𝒚1𝑗\hat{\bm{y}}_{1,j}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT. This produces a synthetic dataset S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, consisting of inputs {𝒙1,j}j=1nsuperscriptsubscriptsubscript𝒙1𝑗𝑗1𝑛\left\{\bm{x}_{1,j}\right\}_{j=1}^{n}{ bold_italic_x start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and their predicted labels {𝒚^1,j}j=1nsuperscriptsubscriptsubscript^𝒚1𝑗𝑗1𝑛\left\{\hat{\bm{y}}_{1,j}\right\}_{j=1}^{n}{ over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. A mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is then formed and used as the in-context examples for the next generation. This process continues, with each generation producing a new synthetic dataset Sj+1subscript𝑆𝑗1S_{j+1}italic_S start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT based on the updated mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

5.2 Recursive Stability of In-Context Learning with Transformers

In this section, we demonstrate that transformers exhibit recursive stability within the ICL framework. Following the ICL setting from Li et al. (2023), we show that the model effectively controls error propagation from perturbations in the initial real dataset, ensuring stability across the STLs.

Theorem 2.

Let S0,S0subscript𝑆0superscriptsubscript𝑆0S_{0},S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be two initial real datasets that only differ at the inputs 𝐳j=(𝐱j,𝐲j)subscript𝐳𝑗subscript𝐱𝑗subscript𝐲𝑗\bm{z}_{j}=\left(\bm{x}_{j},\bm{y}_{j}\right)bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and 𝐳j=superscriptsubscript𝐳𝑗absent\bm{z}_{j}^{\prime}=bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = (𝐱j,𝐲j)superscriptsubscript𝐱𝑗superscriptsubscript𝐲𝑗\left(\bm{x}_{j}^{\prime},\bm{y}_{j}^{\prime}\right)( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where 1jn1𝑗𝑛1\leq j\leq n1 ≤ italic_j ≤ italic_n. Assume the inputs and labels lie within the unit Euclidean ball in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Represent the prompts S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as matrices 𝐙0,𝐙0(2n+1)×dsubscript𝐙0superscriptsubscript𝐙0superscript2𝑛1𝑑\bm{Z}_{0},\bm{Z}_{0}^{\prime}\in\mathbb{R}^{(2n+1)\times d}bold_italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 2 italic_n + 1 ) × italic_d end_POSTSUPERSCRIPT. Let TF()TF\mathrm{TF}(\cdot)roman_TF ( ⋅ ) be an L𝐿Litalic_L-layer transformer. Given 𝐙0subscript𝐙0\bm{Z}_{0}bold_italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as the initial input, the k𝑘kitalic_k-th layer applies MLPs and self-attention, producing the output:

𝒁k=Parallel_MLPs(ATTN(𝒁k1)) where ATTN(𝒁):=softmax(𝒁𝑾𝒁)𝒁𝑽.\left.\bm{Z}_{k}=\operatorname{Parallel\_\operatorname{MLPs}(ATTN}\left(\bm{Z}% _{k-1}\right)\right)\text{ where }\operatorname{ATTN}(\bm{Z}):=\operatorname{% softmax}\left(\bm{Z}\bm{W}\bm{Z}^{\top}\right)\bm{Z}\bm{V}.bold_italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = start_OPFUNCTION roman_Parallel _ roman_MLPs ( roman_ATTN end_OPFUNCTION ( bold_italic_Z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) where roman_ATTN ( bold_italic_Z ) := roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_Z bold_italic_V .

Assume TFTF\mathrm{TF}roman_TF is normalized as 𝐕1,𝐖BWformulae-sequencenorm𝐕1norm𝐖subscript𝐵𝑊\|\bm{V}\|\leq 1,\|\bm{W}\|\leq B_{W}∥ bold_italic_V ∥ ≤ 1 , ∥ bold_italic_W ∥ ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and MLPsMLPs\operatorname{MLPs}roman_MLPs obey MLP(𝐳)=ReLU(𝐌𝐳)MLP𝐳ReLU𝐌𝐳\operatorname{MLP}(\bm{z})=\operatorname{ReLU}(\bm{M}\bm{z})roman_MLP ( bold_italic_z ) = roman_ReLU ( bold_italic_M bold_italic_z ) with 𝐌1norm𝐌1\|\bm{M}\|\leq 1∥ bold_italic_M ∥ ≤ 1. Let TFTF\mathrm{TF}roman_TF output the last token of the final layer 𝐙Lsubscript𝐙𝐿\bm{Z}_{L}bold_italic_Z start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT that corresponds to the query 𝐱j,n+1subscript𝐱𝑗𝑛1\bm{x}_{j,n+1}bold_italic_x start_POSTSUBSCRIPT italic_j , italic_n + 1 end_POSTSUBSCRIPT. Let n𝑛nitalic_n represent the sample size of the mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where S~j=αS0+(1α)Sjsubscript~𝑆𝑗𝛼subscript𝑆01𝛼subscript𝑆𝑗\widetilde{S}_{j}=\alpha S_{0}+(1-\alpha)S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i. Then, we obtain:

TF(S~i)TF(S~i)2(1α)iB~W(i+1)L2n+1,less-than-or-similar-tosubscriptnormTFsubscript~𝑆𝑖TFsuperscriptsubscript~𝑆𝑖subscript2superscript1𝛼𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿2𝑛1\displaystyle\left\|\operatorname{TF}(\widetilde{S}_{i})-\operatorname{TF}(% \widetilde{S}_{i}^{\prime})\right\|_{\ell_{2}}\lesssim(1-\alpha)^{i}\frac{% \widetilde{B}_{W}^{(i+1)L}}{2n+1},∥ roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≲ ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_n + 1 end_ARG ,

where B~W=(1+2BW)e2BWsubscript~𝐵𝑊12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊\widetilde{B}_{W}=\left(1+2B_{W}\right)e^{2B_{W}}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and S~isuperscriptsubscript~𝑆𝑖\widetilde{S}_{i}^{\prime}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the mixed dataset at the i𝑖iitalic_i-th generation in the STL when the initial real dataset is S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Additionally, if the measure d𝑑ditalic_d for the recursive stability parameter in Definition 2 is taken as the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, then the recursive stability γni(1α)iB~W(i+1)L2n+1less-than-or-similar-tosuperscriptsubscript𝛾𝑛𝑖superscript1𝛼𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿2𝑛1\gamma_{n}^{i}\lesssim(1-\alpha)^{i}\frac{\widetilde{B}_{W}^{(i+1)L}}{2n+1}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≲ ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_n + 1 end_ARG.

Remark 8.

Controlling Exponential Growth with Real Data Proportion. In this remark, we further investigate the influence of the proportion of real data α𝛼\alphaitalic_α on the recursive stability of transformers. As outlined in Theorem 2, the upper bound of the recursive stability parameter includes a term that grows exponentially with the number of generations i𝑖iitalic_i in the STL, specifically B~W(i+1)Lsuperscriptsubscript~𝐵𝑊𝑖1𝐿\widetilde{B}_{W}^{(i+1)L}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT. However, we show that even a constant proportion of real data, α𝛼\alphaitalic_α, is sufficient to control this growth.

Specifically, setting α=Ω(1B~W((i+1)L)/i)𝛼Ω1superscriptsubscript~𝐵𝑊𝑖1𝐿𝑖\alpha=\Omega(1-\widetilde{B}_{W}^{-((i+1)L)/i})italic_α = roman_Ω ( 1 - over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - ( ( italic_i + 1 ) italic_L ) / italic_i end_POSTSUPERSCRIPT ), we establish that the recursive stability parameter in Theorem 2 satisfies γni12n+1less-than-or-similar-tosuperscriptsubscript𝛾𝑛𝑖12𝑛1\gamma_{n}^{i}\lesssim\frac{1}{2n+1}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≲ divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG. Additionally, as the number of generations i𝑖iitalic_i in the STL approaches infinity, the proportion α𝛼\alphaitalic_α asymptotically converges to 1B~WL1superscriptsubscript~𝐵𝑊𝐿1-\widetilde{B}_{W}^{-L}1 - over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_L end_POSTSUPERSCRIPT. Notably, the depth L𝐿Litalic_L is typically small in practical settings. For example, studies on LLM performance in STLs, such as Briesch et al. (2023), often employ models with L=6𝐿6L=6italic_L = 6. Furthermore, techniques like layer normalization effectively constrain the norm of weights BWsubscript𝐵𝑊B_{W}italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, ensuring numerical stability during training. Thus, with a constant real data proportion α𝛼\alphaitalic_α independent of the STL generation number i𝑖iitalic_i, the exponential growth term B~W(i+1)Lsuperscriptsubscript~𝐵𝑊𝑖1𝐿\widetilde{B}_{W}^{(i+1)L}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT can be effectively controlled, ensuring that γni=𝒪(1/n)superscriptsubscript𝛾𝑛𝑖𝒪1𝑛\gamma_{n}^{i}=\mathcal{O}(1/n)italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = caligraphic_O ( 1 / italic_n ).

5.3 Generalization Bound for Transformers in In-Context Learning

In this section, we investigate the behavior of transformers under the ICL framework in STLs. We select SGD as the learning algorithm 𝒜𝒜\mathcal{A}caligraphic_A and consider a binary task with 𝒴={0,1}𝒴01\mathcal{Y}=\{0,1\}caligraphic_Y = { 0 , 1 }. Applying our general theoretical framework from Theorem 1, we derive the generalization error bound by addressing the terms βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) using recent results on SGD (Zhang et al., 2022) and ICL (Zhang et al., 2023). The recursive stability parameter γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is obtained from Theorem 2. We assume that the loss function (;z)𝑧\ell(\cdot;z)roman_ℓ ( ⋅ ; italic_z ) is κ𝜅\kappaitalic_κ-smooth and ρ𝜌\rhoitalic_ρ-Lipschitz, which are standard assumptions in related works (Hardt et al., 2016; Lei & Ying, 2020), with formal definitions provided in Appendix A.1. Examples include logistic and Huber losses. We now present the generalization error bound:

Theorem 3.

Consider an L𝐿Litalic_L-layer transformer under the setting described in Theorem 2. Let n𝑛nitalic_n represent the sample size of the mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where S~j=αS0+(1α)Sjsubscript~𝑆𝑗𝛼subscript𝑆01𝛼subscript𝑆𝑗\widetilde{S}_{j}=\alpha S_{0}+(1-\alpha)S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i. Suppose that the loss function (;𝐳)𝐳\ell(\cdot;\bm{z})roman_ℓ ( ⋅ ; bold_italic_z ) is κ𝜅\kappaitalic_κ-smooth, ρ𝜌\rhoitalic_ρ-Lipschitz and bounded by M>0𝑀0M>0italic_M > 0 for every 𝐳𝐳\bm{z}bold_italic_z. Let 𝒜(S~i)𝒜subscript~𝑆𝑖\mathcal{A}(\widetilde{S}_{i})caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the output after running SGD for Tngreater-than-or-equivalent-to𝑇𝑛T\gtrsim nitalic_T ≳ italic_n iterations with a step size ηt=𝒪(1κt)subscript𝜂𝑡𝒪1𝜅𝑡\eta_{t}=\mathcal{O}(\frac{1}{\kappa t})italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_κ italic_t end_ARG ) on the mixed dataset S~isubscript~𝑆𝑖\widetilde{S}_{i}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|n1/2log(n)Mρ2α1αlog1δless-than-or-similar-tosubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖superscript𝑛12𝑛𝑀superscript𝜌2𝛼1𝛼1𝛿\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\lesssim n% ^{-1/2}\log(n)M\rho^{2}\alpha\sqrt{1-\alpha}\log\frac{1}{\delta}| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≲ italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_log ( italic_n ) italic_M italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α square-root start_ARG 1 - italic_α end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG
+n1log2(n)ρ2((1α)B~WL)iαlog(1δ)+n1/4α1M(1(1α)i)log(1δ).superscript𝑛1superscript2𝑛superscript𝜌2superscript1𝛼superscriptsubscript~𝐵𝑊𝐿𝑖𝛼1𝛿superscript𝑛14superscript𝛼1𝑀1superscript1𝛼𝑖1𝛿\displaystyle\quad+n^{-1}\log^{2}(n)\rho^{2}((1-\alpha)\widetilde{B}_{W}^{L})^% {i}\alpha\log(\frac{1}{\delta})+n^{-1/4}\alpha^{-1}M\left(1-(1-\alpha)^{i}% \right)\log(\frac{1}{\delta}).+ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n ) italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ( 1 - italic_α ) over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_α roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) + italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) . (2)
Remark 9.

In this remark, we provide a detailed explanation of the theoretical results of Theorem 3. As discussed earlier in Remark 8, α𝛼\alphaitalic_α is set to 1B~WL1superscriptsubscript~𝐵𝑊𝐿1-\widetilde{B}_{W}^{-L}1 - over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_L end_POSTSUPERSCRIPT. To enhance clarity and focus on the primary results, we omit constant terms and the log(1/δ)1𝛿\log(1/\delta)roman_log ( 1 / italic_δ ) factor. Consequently, the bound in Theorem 3 can be expressed as follows:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|n1/2log(n)+n1log2(n)+n1/4.less-than-or-similar-tosubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖superscript𝑛12𝑛superscript𝑛1superscript2𝑛superscript𝑛14\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\lesssim n% ^{-1/2}\log(n)+n^{-1}\log^{2}(n)+n^{-1/4}.| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≲ italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_log ( italic_n ) + italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n ) + italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT .

In this bound, the terms n1/2log(n)+n1log2(n)superscript𝑛12𝑛superscript𝑛1superscript2𝑛n^{-1/2}\log(n)+n^{-1}\log^{2}(n)italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_log ( italic_n ) + italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n ) correspond to the generalization error on the mixed dataset, while the term n1/4superscript𝑛14n^{-1/4}italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT represents the cumulative distribution shift across generations, which is primarily governed by the learnability of the generative model.

It is evident from this result that the generative model’s capacity plays a crucial role in the performance within the STLs. The ability of the generative model to maintain distributional fidelity over multiple generations directly impacts the generalization error and determines how well the model can control the propagation of errors across generations.

5.4 Synthetic Data Augmentation

The previous theorem addresses the scenario where the training dataset is unintentionally contaminated by synthetic data, leading to STLs. In contrast, many researchers intentionally incorporate synthetic data to augment the real dataset, also creating STLs. Next, we explore this synthetic data augmentation scenario, where each generation’s synthetic data is added to the mixed dataset, i.e., S~i=j=0iSjsubscript~𝑆𝑖superscriptsubscript𝑗0𝑖subscript𝑆𝑗\widetilde{S}_{i}=\sum_{j=0}^{i}S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

Theorem 4.

Consider an L𝐿Litalic_L-layer transformer under the setting described in Theorem 2. Let n𝑛nitalic_n and λn𝜆𝑛\lambda nitalic_λ italic_n represent the sample size of the real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the synthetic dataset Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively, where 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i. The mixed dataset S~isubscript~𝑆𝑖\widetilde{S}_{i}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is denoted as j=0iSjsuperscriptsubscript𝑗0𝑖subscript𝑆𝑗\sum_{j=0}^{i}S_{j}∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Suppose that the loss function (;𝐳)𝐳\ell(\cdot;\bm{z})roman_ℓ ( ⋅ ; bold_italic_z ) is κ𝜅\kappaitalic_κ-smooth, ρ𝜌\rhoitalic_ρ-Lipschitz and bounded by M>0𝑀0M>0italic_M > 0 for every 𝐳𝐳\bm{z}bold_italic_z. Let 𝒜(S~i)𝒜subscript~𝑆𝑖\mathcal{A}(\widetilde{S}_{i})caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the output after running SGD for Tngreater-than-or-equivalent-to𝑇𝑛T\gtrsim nitalic_T ≳ italic_n iterations with a step size ηt=𝒪(1κt)subscript𝜂𝑡𝒪1𝜅𝑡\eta_{t}=\mathcal{O}(\frac{1}{\kappa t})italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_κ italic_t end_ARG ) on the mixed dataset Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|n14log((1+iλ)n)Mlog1δless-than-or-similar-tosubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖superscript𝑛141𝑖𝜆𝑛𝑀1𝛿\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\lesssim n% ^{-\frac{1}{4}}\log((1+i\lambda)n)M\log\frac{1}{\delta}| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≲ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_M roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG
+n1ρ2(1+iλ)2log((1+iλ)n)i!B~W(i+1)Llog1δ+n12Mi1+iλlog1δ.superscript𝑛1superscript𝜌2superscript1𝑖𝜆21𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿1𝛿superscript𝑛12𝑀𝑖1𝑖𝜆1𝛿\displaystyle\quad+n^{-1}\frac{\rho^{2}}{(1+i\lambda)^{2}}\log((1+i\lambda)n)i% !\widetilde{B}_{W}^{(i+1)L}\log\frac{1}{\delta}+n^{-\frac{1}{2}}\frac{Mi}{1+i% \lambda}\sqrt{\log\frac{1}{\delta}}.+ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG + italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_M italic_i end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG .
Remark 10.

Analyzing the Trade-off in Synthetic Data Augmentation for STLs. In this remark, we examine the trade-off between generalization and distribution shifts from increased synthetic data, providing insights into optimal synthetic data size. At each generation, λn𝜆𝑛\lambda nitalic_λ italic_n synthetic data points are added to the mixed dataset. We analyze how the coefficient λ𝜆\lambdaitalic_λ, representing the scale of synthetic data augmentation, affects the generalization error in STLs. From the bound in Theorem 4, we observe that the Cumulative Distribution Shift Across Generations term is expressed as:

n14log((1+iλ)n)Mlog(1/δ).superscript𝑛141𝑖𝜆𝑛𝑀1𝛿n^{-\frac{1}{4}}\log((1+i\lambda)n)M\log(1/\delta).italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_M roman_log ( 1 / italic_δ ) .

As the coefficient λ𝜆\lambdaitalic_λ increases, the cumulative distribution shift correspondingly grows, thereby amplifying the associated error. This behavior aligns with intuition, as an increase in λ𝜆\lambdaitalic_λ reduces the proportion of real data within the mixed dataset at each generation. Consequently, this reduction in real data leads to a greater divergence between the mixed distribution and the true underlying distribution, exacerbating the deviation and compounding the error across successive generations. In contrast, for the Generalization Error on Mixed Distributions term:

n1ρ2(1+iλ)2log((1+iλ)n)i!B~W(i+1)Llog1δ+n12Mi1+iλlog1δ.superscript𝑛1superscript𝜌2superscript1𝑖𝜆21𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿1𝛿superscript𝑛12𝑀𝑖1𝑖𝜆1𝛿n^{-1}\frac{\rho^{2}}{(1+i\lambda)^{2}}\log((1+i\lambda)n)i!\widetilde{B}_{W}^% {(i+1)L}\log\frac{1}{\delta}+n^{-\frac{1}{2}}\frac{Mi}{1+i\lambda}\sqrt{\log% \frac{1}{\delta}}.italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG + italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_M italic_i end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG .

We observe that as λ𝜆\lambdaitalic_λ increases, the corresponding error decreases. This outcome is consistent with theoretical intuition, as augmenting the dataset with synthetic data effectively enlarges the mixed dataset. A larger dataset provides a more comprehensive representation of the mixed distribution, which in turn reduces the generalization error associated with this distribution. By incorporating more synthetic data, the mixed dataset better approximates the underlying mixed distribution, leading to improved generalization performance.

From the above discussion, we can conclude that the inclusion of synthetic data introduces a trade-off: on one hand, it increases the error from the cumulative distribution shift, while on the other, it reduces the generalization error on the mixed distribution. This trade-off has been explored theoretically in Fu et al. (2024b), though they primarily provided theoretical intuition. In contrast, our work explicitly decomposes the error into two terms, offering a deeper understanding of this trade-off and its implications for model performance in STLs. As for the optimal augmentation coefficient λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, it must satisfy the following condition:

λ=infλsuperscript𝜆subscriptinfimum𝜆\displaystyle\lambda^{*}=\inf_{\lambda}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_inf start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT {n14log((1+iλ)n)Mlog(1/δ)\displaystyle\Big{\{}n^{-\frac{1}{4}}\log((1+i\lambda)n)M\log(1/\delta){ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_M roman_log ( 1 / italic_δ )
n1ρ2(1+iλ)2log((1+iλ)n)i!B~W(i+1)Llog1δ+n12Mi1+iλlog1δ}.\displaystyle\lesssim n^{-1}\frac{\rho^{2}}{(1+i\lambda)^{2}}\log((1+i\lambda)% n)i!\widetilde{B}_{W}^{(i+1)L}\log\frac{1}{\delta}+n^{-\frac{1}{2}}\frac{Mi}{1% +i\lambda}\sqrt{\log\frac{1}{\delta}}\Big{\}}.≲ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG + italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_M italic_i end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG } .

Unfortunately, obtaining a closed-form solution for λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from this equation proves to be analytically intractable. However, we can derive the relationship between λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, the size of the real dataset n𝑛nitalic_n from the above equation. Specifically, by omitting irrelevant constants and the log(1/δ)1𝛿\log(1/\delta)roman_log ( 1 / italic_δ ) term, we obtain that λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT should satisfy the following expression:

i!B~W(i+1)Ln3/4(1+iλ)2+in1/4(1+iλ)log((1+iλ)n)=𝒪(1).𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿superscript𝑛34superscript1𝑖superscript𝜆2𝑖superscript𝑛141𝑖superscript𝜆1𝑖superscript𝜆𝑛𝒪1\frac{i!\widetilde{B}_{W}^{(i+1)L}}{n^{3/4}(1+i\lambda^{*})^{2}}+\frac{i}{n^{1% /4}(1+i\lambda^{*})\log((1+i\lambda^{*})n)}=\mathcal{O}(1).divide start_ARG italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 3 / 4 end_POSTSUPERSCRIPT ( 1 + italic_i italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_i end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT ( 1 + italic_i italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) roman_log ( ( 1 + italic_i italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_n ) end_ARG = caligraphic_O ( 1 ) .

We observe an important trend: the value of λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT increases as the size of the real dataset n𝑛nitalic_n decreases. This aligns with theoretical intuition, as a smaller real dataset struggles to adequately represent the underlying distribution, leading to higher generalization error. Consequently, more synthetic data is required to control the generalization error of each generation on the mixed distribution. Conversely, when the real dataset is sufficiently large, the need for synthetic data augmentation diminishes.

6 Conclusion

As real-world data becomes increasingly scarce and existing datasets are progressively contaminated with synthetic content, STLs have emerged as a necessary strategy. STLs enable generative models to recursively train on a mix of real and synthetic data. However, empirical outcomes have varied significantly, revealing the need for a theoretical foundation to guide their successful application.

In this work, we introduced recursive stability as a key technical innovation and established the first generalization error bounds for STLs, which consider the impact of different model architectures. Our analysis demonstrated that preventing model collapse requires two critical conditions: maintaining a non-negligible proportion of real data and ensuring that models satisfy recursive stability. Furthermore, we were the first to extend this framework to transformers in in-context learning, showing that they also satisfy recursive stability and establish their generalization error bounds. Finally, we explored the trade-off introduced by synthetic data augmentation, balancing generalization improvement with potential distributional shifts. These contributions provide new insights into enhancing the stability and performance of generative models in STLs.

Acknowledgement

This project is supported by the National Research Foundation, Singapore, under its NRF Professorship Award No. NRF-P2024-001.

References

  • Alemohammad et al. (2024a) Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard Baraniuk. Self-consuming generative models go mad. In The Twelfth International Conference on Learning Representations, 2024a.
  • Alemohammad et al. (2024b) Sina Alemohammad, Ahmed Imtiaz Humayun, Shruti Agarwal, John Collomosse, and Richard Baraniuk. Self-improving diffusion models with synthetic data. arXiv preprint arXiv:2408.16333, 2024b.
  • Bassily et al. (2020) Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33, 2020.
  • Bertrand et al. (2024) Quentin Bertrand, Joey Bose, Alexandre Duplessis, Marco Jiralerspong, and Gauthier Gidel. On the stability of iterative retraining of generative models on their own data. In The Twelfth International Conference on Learning Representations, 2024.
  • Bousquet & Elisseeff (2002) Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002.
  • Bousquet et al. (2020) Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pp.  610–626, 2020.
  • Briesch et al. (2023) Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822, 2023.
  • Charles & Papailiopoulos (2018) Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pp.  744–753, 2018.
  • Dohmatob et al. (2024a) Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression. arXiv preprint arXiv:2402.07712, 2024a.
  • Dohmatob et al. (2024b) Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse. arXiv preprint arXiv:2410.04840, 2024b.
  • Dohmatob et al. (2024c) Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws. In Forty-first International Conference on Machine Learning, 2024c.
  • Doukhan (1994) P. Doukhan. Mixing: Properties and examples. Lecture notes in statistics. New York: Springer, 1994.
  • Farnia & Ozdaglar (2021) Farzan Farnia and Asuman Ozdaglar. Train simultaneously, generalize better: Stability of gradient-based minimax learners. In International Conference on Machine Learning, pp.  3174–3185. PMLR, 2021.
  • Feldman & Vondrak (2019) Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pp.  1270–1279, 2019.
  • Feng et al. (2024a) Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, and Julia Kempe. Beyond model collapse: Scaling up with synthesized data requires reinforcement. In ICML 2024 Workshop on Theoretical Foundations of Foundation Models, 2024a.
  • Feng et al. (2024b) Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe, and FAIR Meta. Beyond model collapse: Scaling up with syn-thesized data requires verification. arXiv preprint arXiv:2406.07515, 2024b.
  • Ferbach et al. (2024) Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, and Gauthier Gidel. Self-consuming generative models with curated data provably optimize human preferences. arXiv preprint arXiv:2407.09499, 2024.
  • Fu et al. (2023) Shi Fu, Yunwen Lei, Qiong Cao, Xinmei Tian, and Dacheng Tao. Sharper bounds for uniformly stable algorithms with stationary mixing process. In The Eleventh International Conference on Learning Representations, 2023.
  • Fu et al. (2024a) Shi Fu, Yuzhu Chen, Yingjie Wang, and Dacheng Tao. On championing foundation models: From explainability to interpretability. arXiv preprint arXiv:2410.11444, 2024a.
  • Fu et al. (2024b) Shi Fu, Sen Zhang, Yingjie Wang, Xinmei Tian, and Dacheng Tao. Towards theoretical understandings of self-consuming generative models. In Forty-first International Conference on Machine Learning, 2024b.
  • Gerstgrasser et al. (2024) Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Tomasz Korbak, Henry Sleight, Rajashree Agrawal, John Hughes, Dhruv Bhandarkar Pai, Andrey Gromov, Dan Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. In First Conference on Language Modeling, 2024.
  • Gillman et al. (2024) Nate Gillman, Michael Freeman, Daksh Aggarwal, HSU Chia-Hong, Calvin Luo, Yonglong Tian, and Chen Sun. Self-correcting self-consuming loops for generative model training. In Forty-first International Conference on Machine Learning, 2024.
  • Hardt et al. (2016) Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp.  1225–1234, 2016.
  • He et al. (2016) Fangchao He, Ling Zuo, and Hong Chen. Stability analysis for ranking with stationary φ𝜑\varphiitalic_φ-mixing samples. Neurocomputing, 171:1556–1562, 2016.
  • Hittmeir et al. (2019) Markus Hittmeir, Andreas Ekelhart, and Rudolf Mayer. On the utility of synthetic data: An empirical evaluation on machine learning tasks. In Proceedings of the 14th International Conference on Availability, Reliability and Security, pp.  1–6, 2019.
  • Huang et al. (2022) Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  • Kanabar & Gastpar (2025) Millen Kanabar and Michael Gastpar. Minimax discrete distribution estimation with self-consumption. arXiv preprint arXiv:2501.19273, 2025.
  • Klochkov & Zhivotovskiy (2021) Yegor Klochkov and Nikita Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate O(1/n). In Advances in Neural Information Processing Systems, 2021.
  • Lei (2023) Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pp.  191–227. PMLR, 2023.
  • Lei & Ying (2020) Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pp.  5809–5819, 2020.
  • Li & Liu (2022) Shaojie Li and Yong Liu. High probability generalization bounds with fast rates for minimax problems. In International Conference on Learning Representations, 2022.
  • Li et al. (2023) Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp.  19565–19594. PMLR, 2023.
  • Liang (2021) Tengyuan Liang. How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1–41, 2021.
  • Marchi et al. (2024) Matteo Marchi, Stefano Soatto, Pratik Chaudhari, and Paulo Tabuada. Heat death of generative models in closed-loop learning. arXiv preprint arXiv:2404.02325, 2024.
  • Martínez et al. (2023) Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of generative artificial intelligence and the internet. In International Workshop on Epistemic Uncertainty in Artificial Intelligence, pp.  59–73. Springer, 2023.
  • Mohri & Rostamizadeh (2010) Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary φ𝜑\varphiitalic_φ-mixing and β𝛽\betaitalic_β-mixing processes. Journal of Machine Learning Research, 11(2), 2010.
  • Sadasivan et al. (2023) Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  • Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • Seddik et al. (2024) Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah. How bad is training on synthetic data? a statistical analysis of language model collapse. arXiv preprint arXiv:2404.05090, 2024.
  • Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635–2670, 2010.
  • Shumailov et al. (2024) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, 2024.
  • Tao et al. (2024) Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024.
  • Villalobos et al. (2022) Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 2022.
  • Wang et al. (2024) Peng Wang, Li Shen, Zerui Tao, Shuaida He, and Dacheng Tao. Generalization analysis of stochastic weight averaging with general sampling. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=XwVkqvyziD.
  • Wyllie et al. (2024) Sierra Wyllie, Ilia Shumailov, and Nicolas Papernot. Fairness feedback loops: training on synthetic data amplifies bias. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pp.  2113–2147, 2024.
  • Xing et al. (2025) Xiaodan Xing, Fadong Shi, Jiahao Huang, Yinzhe Wu, Yang Nan, Sheng Zhang, Yingying Fang, Michael Roberts, Carola-Bibiane Schönlieb, Javier Del Ser, et al. On the caveats of ai autophagy. Nature Machine Intelligence, pp.  1–9, 2025.
  • Xu et al. (2023) Shirong Xu, Will Wei Sun, and Guang Cheng. Utility theory of synthetic data generation. arXiv preprint arXiv:2305.10015, 2023.
  • Yang (2022) Hongkang Yang. A mathematical framework for learning probability distributions. arXiv preprint arXiv:2212.11481, 2022.
  • Yu (1994) Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp.  94–116, 1994.
  • Zhang et al. (2022) Yikai Zhang, Wenjia Zhang, Sammy Bald, Vamsi Pingali, Chao Chen, and Mayank Goswami. Stability of sgd: Tightness analysis and improved bounds. In Uncertainty in artificial intelligence, pp.  2364–2373. PMLR, 2022.
  • Zhang et al. (2023) Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420, 2023.
  • Zheng et al. (2023) Chenyu Zheng, Guoqiang Wu, and Chongxuan Li. Toward understanding generative data augmentation. Advances in Neural Information Processing Systems, 36:54046–54060, 2023.
  • Zhu et al. (2024) Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, and Bowen Zhou. How to synthesize text data without model collapse? arXiv preprint arXiv:2412.14689, 2024.

Appendix A Appendix

A.1 Auxiliary Definitions

Below, we present some essential definitions.

Definition 3.

(Lipschitz and Smoothness). Let constants κ,ρ>0𝜅𝜌0\kappa,\rho>0italic_κ , italic_ρ > 0. Consider the function :𝒲×𝒵:𝒲𝒵\ell:\mathcal{W}\times\mathcal{Z}\rightarrow\mathbb{R}roman_ℓ : caligraphic_W × caligraphic_Z → blackboard_R. We define the following properties:

  • Lipschitz Continuity: The loss \ellroman_ℓ is said to be ρ𝜌\rhoitalic_ρ-Lipschitz continuous if (𝒘1,𝒛)(𝒘2,𝒛)ρ𝒘1𝒘2normsubscript𝒘1𝒛subscript𝒘2𝒛𝜌normsubscript𝒘1subscript𝒘2\|\ell(\bm{w}_{1},\bm{z})-\ell(\bm{w}_{2},\bm{z})\|\leq\rho\|\bm{w}_{1}-\bm{w}% _{2}\|∥ roman_ℓ ( bold_italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_z ) - roman_ℓ ( bold_italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_z ) ∥ ≤ italic_ρ ∥ bold_italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ for any 𝒘1,𝒘2,𝒛subscript𝒘1subscript𝒘2𝒛\bm{w}_{1},\bm{w}_{2},\bm{z}bold_italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_z.

  • Smoothness: The loss \ellroman_ℓ is said to be κ𝜅\kappaitalic_κ-Smooth if 𝒘(𝒘1,𝒛)𝒘(𝒘2,𝒛)κ𝒘1𝒘2normsubscript𝒘subscript𝒘1𝒛subscript𝒘subscript𝒘2𝒛𝜅normsubscript𝒘1subscript𝒘2\|\nabla_{\bm{w}}\ell(\bm{w}_{1},\bm{z})-\nabla_{\bm{w}}\ell(\bm{w}_{2},\bm{z}% )\|\leq\kappa\|\bm{w}_{1}-\bm{w}_{2}\|∥ ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT roman_ℓ ( bold_italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_z ) - ∇ start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT roman_ℓ ( bold_italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_z ) ∥ ≤ italic_κ ∥ bold_italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ for any 𝒘1,𝒘2,𝒛subscript𝒘1subscript𝒘2𝒛\bm{w}_{1},\bm{w}_{2},\bm{z}bold_italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_z.

A.2 Expansion to Gaussian Mixture Models

We adopt the setup from prior works Zheng et al. (2023) and consider a binary classification task where Y={1,1}𝑌11Y=\{-1,1\}italic_Y = { - 1 , 1 }. Given a vector μd𝜇superscript𝑑\mu\in\mathbb{R}^{d}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with μ2=1subscriptnorm𝜇21\|\mu\|_{2}=1∥ italic_μ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and noise variance σ2>0superscript𝜎20\sigma^{2}>0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > 0, the data distribution is specified as follows: yuniform{1,1}similar-to𝑦uniform11y\sim\text{uniform}\{-1,1\}italic_y ∼ uniform { - 1 , 1 } and xy𝒩(yμ,σ2Id)similar-toconditional𝑥𝑦𝒩𝑦𝜇superscript𝜎2subscript𝐼𝑑x\mid y\sim\mathcal{N}(y\mu,\sigma^{2}I_{d})italic_x ∣ italic_y ∼ caligraphic_N ( italic_y italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). We define the conditional generative model using parameters μysubscript𝜇𝑦\mu_{y}italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and σk2superscriptsubscript𝜎𝑘2\sigma_{k}^{2}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where y{1,1}𝑦11y\in\{-1,1\}italic_y ∈ { - 1 , 1 } and k[d]𝑘delimited-[]𝑑k\in[d]italic_k ∈ [ italic_d ]. For n𝑛nitalic_n data points, let nysubscript𝑛𝑦n_{y}italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT represent the number of samples in class y𝑦yitalic_y. The parameters of the Gaussian mixture model are then learned as:

μ^y=yi=yxiny,σ^k2=ynynyi=y(xikμ^yk)2ny1.formulae-sequencesubscript^𝜇𝑦subscriptsubscript𝑦𝑖𝑦subscript𝑥𝑖subscript𝑛𝑦superscriptsubscript^𝜎𝑘2subscript𝑦subscript𝑛𝑦𝑛subscriptsubscript𝑦𝑖𝑦superscriptsubscript𝑥𝑖𝑘subscript^𝜇𝑦𝑘2subscript𝑛𝑦1\hat{\mu}_{y}=\frac{\sum_{y_{i}=y}x_{i}}{n_{y}},\quad\hat{\sigma}_{k}^{2}=\sum% _{y}\frac{n_{y}}{n}\frac{\sum_{y_{i}=y}(x_{ik}-\hat{\mu}_{yk})^{2}}{n_{y}-1}.over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG divide start_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_y italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT - 1 end_ARG .

Then we can generate new samples from the distribution: yuniform{1,1}similar-to𝑦uniform11y\sim\text{uniform}\{-1,1\}italic_y ∼ uniform { - 1 , 1 } and xy𝒩(μ^y,Σ)similar-toconditional𝑥𝑦𝒩subscript^𝜇𝑦Σx\mid y\sim\mathcal{N}(\hat{\mu}_{y},\Sigma)italic_x ∣ italic_y ∼ caligraphic_N ( over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , roman_Σ ), where Σ=diag(σ12,,σd2)Σdiagsuperscriptsubscript𝜎12superscriptsubscript𝜎𝑑2\Sigma=\text{diag}(\sigma_{1}^{2},\dots,\sigma_{d}^{2})roman_Σ = diag ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Additionally, the learning algorithm functions as a linear classifier, parameterized by θd𝜃superscript𝑑\theta\in\mathbb{R}^{d}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, with predictions given by: y^=sign(θ𝐱)^𝑦signsuperscript𝜃top𝐱\hat{y}=\text{sign}(\theta^{\top}\mathbf{x})over^ start_ARG italic_y end_ARG = sign ( italic_θ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x ). The loss function is defined as:

(θ,(x,y))=12σ2(xyθ)(xyθ).𝜃𝑥𝑦12superscript𝜎2superscript𝑥𝑦𝜃top𝑥𝑦𝜃\ell(\theta,(x,y))=\frac{1}{2\sigma^{2}}(x-y\theta)^{\top}(x-y\theta).roman_ℓ ( italic_θ , ( italic_x , italic_y ) ) = divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_x - italic_y italic_θ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_x - italic_y italic_θ ) .

Thus, the output is θ^=1mi=1myixi.^𝜃1𝑚superscriptsubscript𝑖1𝑚subscript𝑦𝑖subscript𝑥𝑖\hat{\theta}=\frac{1}{m}\sum_{i=1}^{m}y_{i}x_{i}.over^ start_ARG italic_θ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

In this setting, we demonstrate recursive stability for the Gaussian mixture model as follows:

Theorem 5.

Let S0,S0subscript𝑆0superscriptsubscript𝑆0S_{0},S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote two initial real datasets differing by a single example. Let n𝑛nitalic_n represent the sample size of the mixed dataset S~jsubscript~𝑆𝑗\tilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where S~j=αS0+(1α)Sjsubscript~𝑆𝑗𝛼subscript𝑆01𝛼subscript𝑆𝑗\tilde{S}_{j}=\alpha S_{0}+(1-\alpha)S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i. Choose m=𝒪(n)𝑚𝒪𝑛m=\mathcal{O}(\sqrt{n})italic_m = caligraphic_O ( square-root start_ARG italic_n end_ARG ). Consider the previously described sampling and learning steps, where real data samples are drawn from the Gaussian Mixture Model distribution 𝒟𝒟\mathcal{D}caligraphic_D, and the synthetic data for the i𝑖iitalic_i-th generation is generated from the learned Gaussian Mixture distribution of the i𝑖iitalic_i-th generation. Then with probability at least 1δ1𝛿1-\delta1 - italic_δ, we have:

γnin1/2α1(1(1α)i)log(nd/δ),less-than-or-similar-tosuperscriptsubscript𝛾𝑛𝑖superscript𝑛12superscript𝛼11superscript1𝛼𝑖𝑛𝑑𝛿\displaystyle\gamma_{n}^{i}\lesssim n^{-1/2}\alpha^{-1}(1-(1-\alpha)^{i})\log(% nd/\delta),italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≲ italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_d / italic_δ ) , (3)

where the measure for the recursive stability parameter is taken as the KL divergence.

As α𝛼\alphaitalic_α approaches 0, indicating that no real data is incorporated during each generation of training, we observe

γniin1/2logndδ,less-than-or-similar-tosuperscriptsubscript𝛾𝑛𝑖𝑖superscript𝑛12𝑛𝑑𝛿\gamma_{n}^{i}\lesssim in^{-1/2}\log\frac{nd}{\delta},italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≲ italic_i italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_log divide start_ARG italic_n italic_d end_ARG start_ARG italic_δ end_ARG ,

which suggests a linear accumulation of errors. This finding aligns closely with the theoretical insights presented in Shumailov et al. (2024); Alemohammad et al. (2024a), where a Gaussian model trained without real data demonstrated a linear divergence in variance. Thus, this underscores the validity of our theoretical results, confirming that the derived bound is meaningful and not vacuous.

Moreover, by leveraging the generalization error bound established in Theorem 1, we derive the following:

Theorem 6.

Consider the Gaussian Mixture Model in the setting outlined above. Let n𝑛nitalic_n represent the sample size of the mixed dataset S~jsubscript~𝑆𝑗\tilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where S~j=αS0+(1α)Sjsubscript~𝑆𝑗𝛼subscript𝑆01𝛼subscript𝑆𝑗\tilde{S}_{j}=\alpha S_{0}+(1-\alpha)S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i. Suppose the loss function is defined as (θ,(𝐱,y))=12σ2(𝐱yθ)(𝐱yθ)𝜃𝐱𝑦12superscript𝜎2superscript𝐱𝑦𝜃top𝐱𝑦𝜃\ell(\theta,(\mathbf{x},y))=\frac{1}{2\sigma^{2}}(\mathbf{x}-y\theta)^{\top}(% \mathbf{x}-y\theta)roman_ℓ ( italic_θ , ( bold_x , italic_y ) ) = divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( bold_x - italic_y italic_θ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_x - italic_y italic_θ ). Let 𝒜(S~i)𝒜subscript~𝑆𝑖\mathcal{A}(\tilde{S}_{i})caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote the output of applying the linear classifier described above to the mixed dataset S~isubscript~𝑆𝑖\tilde{S}_{i}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following holds:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|n1/2(d+log(n/δ))lognlog(1/δ)less-than-or-similar-tosubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖superscript𝑛12𝑑𝑛𝛿𝑛1𝛿\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\lesssim n% ^{-1/2}(d+\log(n/\delta))\log n\log(1/\delta)| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≲ italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( italic_d + roman_log ( italic_n / italic_δ ) ) roman_log italic_n roman_log ( 1 / italic_δ )
+n1/4(1(1α)i)α1(d+log(n/δ))dlog(nd/δ).superscript𝑛141superscript1𝛼𝑖superscript𝛼1𝑑𝑛𝛿𝑑𝑛𝑑𝛿\displaystyle\quad+n^{-1/4}(1-(1-\alpha)^{i})\alpha^{-1}(d+\log(n/\delta))% \sqrt{d\log(nd/\delta)}.+ italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_d + roman_log ( italic_n / italic_δ ) ) square-root start_ARG italic_d roman_log ( italic_n italic_d / italic_δ ) end_ARG . (4)

We observe that when α𝛼\alphaitalic_α is set to a constant (e.g., α=0.1𝛼0.1\alpha=0.1italic_α = 0.1), the generalization error can be effectively controlled, preventing model collapse. This result aligns with the experimental findings in Alemohammad et al. (2024a) for Gaussian models.

A.3 Additional Comparison with Related Work on Theorem 1

Dohmatob et al. (2024a) examined a linear regression setting, focusing solely on statistical approximation error without addressing the functional approximation error described in Shumailov et al. (2024). They did not consider incorporating real data to prevent collapse and demonstrated a linear dependency of degradation on the generation number in the case of fully synthetic data. Similarly, Alemohammad et al. (2024a) and Shumailov et al. (2024) provided theoretical insights using simple Gaussian models without incorporating real data, proving that the variance diverges linearly with the generation number. Seddik et al. (2024) explored a linear softmax classifier and, while also neglecting functional approximation error, demonstrated that adding real data can mitigate model collapse. Marchi et al. (2024) used asymptotic analysis to study parameter variance, assuming an infinite number of training generations and considering scenarios where the generative model is controlled via a “temperature” parameter. They proved that parameter variance is bounded under these conditions.

In contrast, our work addresses a much more complex and realistic scenario by introducing the novel concept of recursive stability and providing the first generalization analysis for STLs. Our analysis accounts for statistical approximation error, functional approximation error, and optimization error during the training of generative models. Unlike the settings explored in prior theoretical works, such as linear regression (Dohmatob et al., 2024a; Gerstgrasser et al., 2024), Gaussian models (Alemohammad et al., 2024a; Shumailov et al., 2024), or asymptotic assumptions (Marchi et al., 2024), our framework accommodates more complex generative model architectures, such as transformers. Specifically, we reveal how both model architecture and the ratio of real to synthetic data influence the success of STLs. For example, in Theorem 3, we demonstrate how our general generalization bound applies to transformer-based generative models, providing a theoretical framework that aligns with practical and more sophisticated use cases.

Additionally, while Marchi et al. (2024) assumed an infinite number of training generations for their asymptotic analysis, we consider finite generations, which is more practical since most experimental setups limit generations to fewer than 10 (as noted in Shumailov et al. (2024)). Moreover, our results confirm that when α=0𝛼0\alpha=0italic_α = 0 (i.e., no real data is used), the last term in our bound, representing the Cumulative Distribution Shift (dTV(n)M(1(1α)i)α1subscript𝑑TV𝑛𝑀1superscript1𝛼𝑖superscript𝛼1d_{\text{TV}}(n)M(1-(1-\alpha)^{i})\alpha^{-1}italic_d start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ( italic_n ) italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT), grows linearly. This finding aligns with the theoretical results of Dohmatob et al. (2024a); Alemohammad et al. (2024a); Shumailov et al. (2024); Fu et al. (2024b). Furthermore, we show that introducing even a constant proportion of real data significantly mitigates model collapse, aligning with experimental findings by Alemohammad et al. (2024a) and Bertrand et al. (2024).

A.4 Additional Comparison with Related Work on Theorem 4

Gerstgrasser et al. (2024) also explored the use of accumulating data to prevent model collapse. They considered a simple linear regression setting without accounting for the dynamic process of training generative models, focusing solely on statistical approximation error. They demonstrated that under the assumption of fixed synthetic data quality matching the original real data, statistical approximation error can be controlled.

By contrast, our work addresses a much more complex and realistic scenario, incorporating the dynamic behavior of transformer-based generative models, learning algorithms, and both statistical and functional approximation errors. Additionally, we allow for dynamic regulation of synthetic data size via a λ𝜆\lambdaitalic_λ coefficient, enabling us to identify the optimal synthetic dataset size for avoiding model collapse in these more challenging settings.

A.5 Auxiliary Lemmas

In this section, we begin by introducing a set of auxiliary theorems that will be utilized in the subsequent proofs.

Lemma 7 (McDiarmid’s Inequality).

Consider independent random variables Z1,,Zn𝒵subscript𝑍1subscript𝑍𝑛𝒵Z_{1},\cdots,Z_{n}\in\mathcal{Z}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_Z and a mapping ϕ:𝒵n:italic-ϕsuperscript𝒵𝑛\phi:\mathcal{Z}^{n}\rightarrow\mathbb{R}italic_ϕ : caligraphic_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R. If, for all i{1,,n}𝑖1𝑛i\in\{1,\cdots,n\}italic_i ∈ { 1 , ⋯ , italic_n }, and for all z1,,zn,zi𝒵subscript𝑧1subscript𝑧𝑛superscriptsubscript𝑧𝑖𝒵z_{1},\cdots,z_{n},z_{i}^{\prime}\in\mathcal{Z}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Z, the function ϕitalic-ϕ\phiitalic_ϕ satisfies

|ϕ(z1,,zi1,zi,zi+1,,zn)ϕ(z1,,zi1,zi,zi+1,,zn)|c,italic-ϕsubscript𝑧1subscript𝑧𝑖1subscript𝑧𝑖subscript𝑧𝑖1subscript𝑧𝑛italic-ϕsubscript𝑧1subscript𝑧𝑖1superscriptsubscript𝑧𝑖subscript𝑧𝑖1subscript𝑧𝑛𝑐\left|\phi\left(z_{1},\cdots,z_{i-1},z_{i},z_{i+1},\cdots,z_{n}\right)-\phi% \left(z_{1},\cdots,z_{i-1},z_{i}^{\prime},z_{i+1},\cdots,z_{n}\right)\right|% \leq c,| italic_ϕ ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_ϕ ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | ≤ italic_c ,

then,

P(|ϕ(Z1,,Zn)𝔼ϕ(Z1,,Zn)t|)2exp(2t2nc2).P\left(|\phi\left(Z_{1},\cdots,Z_{n}\right)-\mathbb{E}\phi\left(Z_{1},\ldots,Z% _{n}\right)\geq t|\right)\leq 2\exp\left(\frac{-2t^{2}}{nc^{2}}\right).italic_P ( | italic_ϕ ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - blackboard_E italic_ϕ ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_t | ) ≤ 2 roman_exp ( divide start_ARG - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

Furthermore, for any p2𝑝2p\geq 2italic_p ≥ 2,

ϕ(Z1,,Zn)𝔼[ϕ(Z1,,Zn)]p2npc.subscriptnormitalic-ϕsubscript𝑍1subscript𝑍𝑛𝔼delimited-[]italic-ϕsubscript𝑍1subscript𝑍𝑛𝑝2𝑛𝑝𝑐\left\|\phi\left(Z_{1},\ldots,Z_{n}\right)-\mathbb{E}\left[\phi\left(Z_{1},% \ldots,Z_{n}\right)\right]\right\|_{p}\leq 2\sqrt{np}c.∥ italic_ϕ ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - blackboard_E [ italic_ϕ ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ] ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ 2 square-root start_ARG italic_n italic_p end_ARG italic_c .
Lemma 8.

((Bousquet et al., 2020)). Let 𝐳=(Z1,,Zn)𝐳subscript𝑍1subscript𝑍𝑛\bm{z}=\left(Z_{1},\ldots,Z_{n}\right)bold_italic_z = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) be a vector of independent random variables each taking values in 𝒵𝒵\mathcal{Z}caligraphic_Z, and let g1,,gnsubscript𝑔1subscript𝑔𝑛g_{1},\ldots,g_{n}italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be some functions gi:𝒵n:subscript𝑔𝑖superscript𝒵𝑛g_{i}:\mathcal{Z}^{n}\rightarrow\mathbb{R}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : caligraphic_Z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R such that the following holds for any i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] :

  • |𝔼[gi(𝒛)Zi]|M\left|\mathbb{E}\left[g_{i}(\bm{z})\mid Z_{i}\right]\right|\leq M| blackboard_E [ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_z ) ∣ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] | ≤ italic_M,

  • 𝔼[gi(𝒛)𝒛\i]=0𝔼delimited-[]conditionalsubscript𝑔𝑖𝒛superscript𝒛\absent𝑖0\mathbb{E}\left[g_{i}(\bm{z})\mid\bm{z}^{\backslash i}\right]=0blackboard_E [ italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_z ) ∣ bold_italic_z start_POSTSUPERSCRIPT \ italic_i end_POSTSUPERSCRIPT ] = 0,

  • gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a bounded difference β𝛽\betaitalic_β with respect to all variables except the i𝑖iitalic_i-th variable, that is, for all ji,𝒛=(Z1,,Zn)formulae-sequence𝑗𝑖𝒛subscript𝑍1subscript𝑍𝑛j\neq i,\bm{z}=\left(Z_{1},\ldots,Z_{n}\right)italic_j ≠ italic_i , bold_italic_z = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and 𝒛j=(Z1,,Zj,,Zn)nsuperscript𝒛𝑗subscript𝑍1superscriptsubscript𝑍𝑗subscript𝑍𝑛superscript𝑛\bm{z}^{j}=\left(Z_{1},\ldots,Z_{j}^{\prime},\ldots,Z_{n}\right)\in\mathbb{R}^% {n}bold_italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we have |gi(𝒛)gi(𝒛j)|βsubscript𝑔𝑖𝒛subscript𝑔𝑖superscript𝒛𝑗𝛽\left|g_{i}(\bm{z})-g_{i}\left(\bm{z}^{j}\right)\right|\leq\beta| italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_z ) - italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) | ≤ italic_β.

Then, for any p2𝑝2p\geq 2italic_p ≥ 2,

i=1ngi(𝒛)p122pnβlogn+4Mpn.subscriptnormsuperscriptsubscript𝑖1𝑛subscript𝑔𝑖𝒛𝑝122𝑝𝑛𝛽𝑛4𝑀𝑝𝑛\left\|\sum_{i=1}^{n}g_{i}(\bm{z})\right\|_{p}\leq 12\sqrt{2}pn\beta\log n+4M% \sqrt{pn}.∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_z ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ 12 square-root start_ARG 2 end_ARG italic_p italic_n italic_β roman_log italic_n + 4 italic_M square-root start_ARG italic_p italic_n end_ARG .
Lemma 9.

If Yppa+pbsubscriptnorm𝑌𝑝𝑝𝑎𝑝𝑏\|Y\|_{p}\leq\sqrt{p}a+pb∥ italic_Y ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ square-root start_ARG italic_p end_ARG italic_a + italic_p italic_b for any p1𝑝1p\geq 1italic_p ≥ 1, then for any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability at least 1δ1𝛿1-\delta1 - italic_δ,

|Y|e(alog(eδ)+blog(eδ)).𝑌𝑒𝑎𝑒𝛿𝑏𝑒𝛿|Y|\leq e\left(a\sqrt{\log\left(\frac{e}{\delta}\right)}+b\log\left(\frac{e}{% \delta}\right)\right).| italic_Y | ≤ italic_e ( italic_a square-root start_ARG roman_log ( divide start_ARG italic_e end_ARG start_ARG italic_δ end_ARG ) end_ARG + italic_b roman_log ( divide start_ARG italic_e end_ARG start_ARG italic_δ end_ARG ) ) .

In addition, we introduce the definition of the Total Variation (TV) distance as follows:

Definition 4 (Total Variation Distance).

Given two probability distributions p𝑝pitalic_p and q𝑞qitalic_q over a multidimensional space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the Total Variation Distance between p𝑝pitalic_p and q𝑞qitalic_q is:

TV(p,q)=12d|p(𝒛)q(𝒛)|𝑑𝒛.𝑇𝑉𝑝𝑞12subscriptsuperscript𝑑𝑝𝒛𝑞𝒛differential-d𝒛TV(p,q)=\frac{1}{2}\int_{\mathbb{R}^{d}}|p(\bm{z})-q(\bm{z})|\,d\bm{z}.italic_T italic_V ( italic_p , italic_q ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_p ( bold_italic_z ) - italic_q ( bold_italic_z ) | italic_d bold_italic_z .

A.6 Proof of Theorem 1

In this Section, we prove Theorem 1 by first decomposing the generalization error into two components: the Cumulative Distribution Shift Across Generations and the Generalization Error on Mixed Distributions. We then proceed to bound the Cumulative Distribution Shift Across Generations by leveraging the properties of the generative model and recursive techniques. For the Generalization Error on Mixed Distributions, we follow the framework of Zheng et al. (2023), leveraging the fact that within the mixed dataset S~isubscript~𝑆𝑖\widetilde{S}_{i}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the set Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT satisfies the conditional i.i.d. assumption when S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is fixed. Combined with moment bounds, this allows us to effectively bound the generalization error.

The main proof is as follows:

Proof of Theorem 1.

We begin by decomposing the generalization error as follows:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))||R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|Cumulative distribution shift across generations+|R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|Generalization error on mixed distributions.subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖subscriptsubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖Cumulative distribution shift across generationssubscriptsubscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖Generalization error on mixed distributions\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\leq% \underbrace{\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|}_{\text{% Cumulative distribution shift across generations}}+\underbrace{\left|R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_{% \widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|}_{\text{% Generalization error on mixed distributions}}.| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ under⏟ start_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | end_ARG start_POSTSUBSCRIPT Cumulative distribution shift across generations end_POSTSUBSCRIPT + under⏟ start_ARG | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | end_ARG start_POSTSUBSCRIPT Generalization error on mixed distributions end_POSTSUBSCRIPT .

Upper Bounding Cumulative Distribution Shift Term

For the term |R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{% \mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |, we first note that 𝒟~i=α𝒟0+(1α)𝒟isubscript~𝒟𝑖𝛼subscript𝒟01𝛼subscript𝒟𝑖\widetilde{\mathcal{D}}_{i}=\alpha\mathcal{D}_{0}+(1-\alpha)\mathcal{D}_{i}over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, we obtain:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
=|R𝒟0(𝒜(S~i))αR𝒟0(𝒜(S~i)(1α)R𝒟i(𝒜(S~i))|\displaystyle=\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\alpha R% _{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})-(1-\alpha)R_{\mathcal{D}_{i}% }(\mathcal{A}(\widetilde{S}_{i}))\right|= | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( 1 - italic_α ) italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
=(1α)|R𝒟0(𝒜(S~i)R𝒟i(𝒜(S~i))|.\displaystyle=(1-\alpha)\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i% })-R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|.= ( 1 - italic_α ) | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | . (5)

Furthermore, we can further decompose it as follows:

|R𝒟0(𝒜(S~i)R𝒟i(𝒜(S~i))||R𝒟0(𝒜(S~i))R𝒟~i1(𝒜(S~i))|+|R𝒟~i1(𝒜(S~i))R𝒟i(𝒜(S~i))|.\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})-R_{% \mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\leq\left|R_{\mathcal{D% }_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{i-1}}(% \mathcal{A}(\widetilde{S}_{i}))\right|+\left|R_{\widetilde{\mathcal{D}}_{i-1}}% (\mathcal{A}(\widetilde{S}_{i}))-R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}% _{i}))\right|.| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | . (6)

By substituting inequality 6 into inequality 5, we obtain:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
(1α)|R𝒟0(𝒜(S~i))R𝒟~i1(𝒜(S~i))|+(1α)|R𝒟~i1(𝒜(S~i))R𝒟i(𝒜(S~i))|.absent1𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖1𝛼subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle\leq(1-\alpha)\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}% _{i}))-R_{\widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}_{i}))\right% |+(1-\alpha)\left|R_{\widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}_% {i}))-R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|.≤ ( 1 - italic_α ) | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + ( 1 - italic_α ) | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | . (7)

Then, for the term |R𝒟~i1(𝒜(S~i))R𝒟i(𝒜(S~i))|subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖|R_{\widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\mathcal% {D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |, we have:

|R𝒟~i1(𝒜(S~i))R𝒟i(𝒜(S~i))|subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}% _{i}))-R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | =|𝒛(𝒜(S~i),𝒛)(𝒟~i1(𝒛)𝒟i(𝒛))𝑑𝒛|absentsubscript𝒛𝒜subscript~𝑆𝑖𝒛subscriptsubscript~𝒟𝑖1𝒛subscriptsubscript𝒟𝑖𝒛differential-d𝒛\displaystyle=\Bigg{|}\int_{\bm{z}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z})% \left(\mathbb{P}_{\widetilde{\mathcal{D}}_{i-1}}(\bm{z})-\mathbb{P}_{\mathcal{% D}_{i}}(\bm{z})\right)d\bm{z}\Bigg{|}= | ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) ( blackboard_P start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z ) - blackboard_P start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z ) ) italic_d bold_italic_z |
𝒛|(𝒜(S~),𝒛)(𝒟~i1(𝒛)𝒟i(𝒛))|d𝒛\displaystyle\leq\int_{\bm{z}}\biggl{|}\ell(\mathcal{A}(\widetilde{S}),\bm{z})% \left(\mathbb{P}_{\widetilde{\mathcal{D}}_{i-1}}(\bm{z})-\mathbb{P}_{\mathcal{% D}_{i}}(\bm{z})\right)\biggr{|}d\bm{z}≤ ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT | roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG ) , bold_italic_z ) ( blackboard_P start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z ) - blackboard_P start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z ) ) | italic_d bold_italic_z
M𝒛|𝒟~i1(𝒛)𝒟i(𝒛)|d𝒛\displaystyle\leq M\int_{\bm{z}}\Bigl{|}\mathbb{P}_{\widetilde{\mathcal{D}}_{i% -1}}(\bm{z})-\mathbb{P}_{\mathcal{D}_{i}}(\bm{z})\Bigr{|}d\bm{z}≤ italic_M ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT | blackboard_P start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z ) - blackboard_P start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_z ) | italic_d bold_italic_z
=2MTV(𝒟~i1,𝒟i).absent2𝑀𝑇𝑉subscript~𝒟𝑖1subscript𝒟𝑖\displaystyle=2MTV\left(\widetilde{\mathcal{D}}_{i-1},\mathcal{D}_{i}\right).= 2 italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (8)

Incorporating inequality 8 into inequality 7, we arrive at:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
(1α)|R𝒟0(𝒜(S~i))R𝒟~i1(𝒜(S~i))|+2(1α)MTV(𝒟~i1,𝒟i).absent1𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖21𝛼𝑀𝑇𝑉subscript~𝒟𝑖1subscript𝒟𝑖\displaystyle\leq(1-\alpha)|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})% )-R_{\widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}_{i}))|+2(1-% \alpha)MTV\left(\widetilde{\mathcal{D}}_{i-1},\mathcal{D}_{i}\right).≤ ( 1 - italic_α ) | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + 2 ( 1 - italic_α ) italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (9)

Next, we apply recursive techniques to address the problem further. First, we obtain

|R𝒟0(𝒜(S~i))R𝒟~i1(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
(1α)|R𝒟0(𝒜(S~i))R𝒟~i2(𝒜(S~i))|+2(1α)MTV(𝒟~i2,𝒟i1).absent1𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖2𝒜subscript~𝑆𝑖21𝛼𝑀𝑇𝑉subscript~𝒟𝑖2subscript𝒟𝑖1\displaystyle\leq(1-\alpha)|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})% )-R_{\widetilde{\mathcal{D}}_{i-2}}(\mathcal{A}(\widetilde{S}_{i}))|+2(1-% \alpha)MTV\left(\widetilde{\mathcal{D}}_{i-2},\mathcal{D}_{i-1}\right).≤ ( 1 - italic_α ) | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + 2 ( 1 - italic_α ) italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) . (10)

Plugging inequality 10 into inequality 9into the inequality, we obtain that:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
(1α)2|R𝒟0(𝒜(S~i))R𝒟~i2(𝒜(S~i))|+2(1α)2MTV(𝒟~i2,𝒟i1)+2(1α)MTV(𝒟~i1,𝒟i).absentsuperscript1𝛼2subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖2𝒜subscript~𝑆𝑖2superscript1𝛼2𝑀𝑇𝑉subscript~𝒟𝑖2subscript𝒟𝑖121𝛼𝑀𝑇𝑉subscript~𝒟𝑖1subscript𝒟𝑖\displaystyle\leq(1-\alpha)^{2}|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_% {i}))-R_{\widetilde{\mathcal{D}}_{i-2}}(\mathcal{A}(\widetilde{S}_{i}))|+2(1-% \alpha)^{2}MTV\left(\widetilde{\mathcal{D}}_{i-2},\mathcal{D}_{i-1}\right)+2(1% -\alpha)MTV\left(\widetilde{\mathcal{D}}_{i-1},\mathcal{D}_{i}\right).≤ ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + 2 ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + 2 ( 1 - italic_α ) italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

By recursion, we obtain:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
(1α)i1|R𝒟0(𝒜(S~i))R𝒟~1(𝒜(S~i))|+2(1α)i1MTV(𝒟~1,𝒟2)++2(1α)MTV(𝒟~i1,𝒟i)absentsuperscript1𝛼𝑖1subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟1𝒜subscript~𝑆𝑖2superscript1𝛼𝑖1𝑀𝑇𝑉subscript~𝒟1subscript𝒟221𝛼𝑀𝑇𝑉subscript~𝒟𝑖1subscript𝒟𝑖\displaystyle\leq(1-\alpha)^{i-1}|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S% }_{i}))-R_{\widetilde{\mathcal{D}}_{1}}(\mathcal{A}(\widetilde{S}_{i}))|+2(1-% \alpha)^{i-1}MTV\left(\widetilde{\mathcal{D}}_{1},\mathcal{D}_{2}\right)+...+2% (1-\alpha)MTV\left(\widetilde{\mathcal{D}}_{i-1},\mathcal{D}_{i}\right)≤ ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + 2 ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + … + 2 ( 1 - italic_α ) italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
2(1α)iMTV(𝒟0,𝒟1)+2(1α)i1MTV(𝒟~1,𝒟2)++2(1α)MTV(𝒟~i1,𝒟i).absent2superscript1𝛼𝑖𝑀𝑇𝑉subscript𝒟0subscript𝒟12superscript1𝛼𝑖1𝑀𝑇𝑉subscript~𝒟1subscript𝒟221𝛼𝑀𝑇𝑉subscript~𝒟𝑖1subscript𝒟𝑖\displaystyle\leq 2(1-\alpha)^{i}MTV\left(\mathcal{D}_{0},\mathcal{D}_{1}% \right)+2(1-\alpha)^{i-1}MTV\left(\widetilde{\mathcal{D}}_{1},\mathcal{D}_{2}% \right)+...+2(1-\alpha)MTV\left(\widetilde{\mathcal{D}}_{i-1},\mathcal{D}_{i}% \right).≤ 2 ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_M italic_T italic_V ( caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + 2 ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + … + 2 ( 1 - italic_α ) italic_M italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Let n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represent the sample size of the real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and let nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the sample size of the mixed dataset S~isubscript~𝑆𝑖\widetilde{S}_{i}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the i𝑖iitalic_i-th generation. Thus, TV(𝒟~j,𝒟j+1)𝑇𝑉subscript~𝒟𝑗subscript𝒟𝑗1TV\left(\widetilde{\mathcal{D}}_{j},\mathcal{D}_{j+1}\right)italic_T italic_V ( over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ) can be written as a function of njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Assuming that the sample size for each generation’s dataset is identical, i.e., n0=n1==ni=nsubscript𝑛0subscript𝑛1subscript𝑛𝑖𝑛n_{0}=n_{1}=\cdots=n_{i}=nitalic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_n, and that the TV distance for each generation is of the same order, denoted by dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ), we can derive the following result:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | 2MdTV(n)[(1α)i+(1α)i1++(1α)]absent2𝑀subscript𝑑TV𝑛delimited-[]superscript1𝛼𝑖superscript1𝛼𝑖11𝛼\displaystyle\leq 2Md_{\mathrm{TV}}(n)\left[(1-\alpha)^{i}+(1-\alpha)^{i-1}+..% .+(1-\alpha)\right]≤ 2 italic_M italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) [ ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT + … + ( 1 - italic_α ) ]
=2M(1(1α)i)α1dTV(n).absent2𝑀1superscript1𝛼𝑖superscript𝛼1subscript𝑑TV𝑛\displaystyle=2M\left(1-(1-\alpha)^{i}\right)\alpha^{-1}d_{\mathrm{TV}}(n).= 2 italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) . (11)

Then we obtain:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))||R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|+|R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_% {\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|\leq|R_{\mathcal{D}_{0}}(% \mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(% \widetilde{S}_{i}))|+|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S% }_{i}))-\widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
2M(1(1α)i)α1dTV(n)+|R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|.absent2𝑀1superscript1𝛼𝑖superscript𝛼1subscript𝑑TV𝑛subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖\displaystyle\leq 2M\left(1-(1-\alpha)^{i}\right)\alpha^{-1}d_{\mathrm{TV}}(n)% +|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_% {\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|.≤ 2 italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) + | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | . (12)

Upper Bounding Generalization Error on Mixed Distributions Term

Next, we turn our attention to the term |R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_{% \widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |. Our primary objective is to establish a moment bound for this expression.

R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))psubscriptnormsubscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖𝑝\displaystyle\left\|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_% {i}))-\widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right\|_% {p}∥ italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
=αR𝒟0(𝒜(S~i))+(1α)R𝒟i(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)1n𝒛iSi,1α(𝒜(S~i),𝒛i)pabsentsubscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝛼subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆𝑖1𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle=\left\|\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})% )+(1-\alpha)R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{n}% \sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i% })-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{i,1-\alpha}}\ell(\mathcal{A}(\widetilde{S% }_{i}),\bm{z}_{i})\right\|_{p}= ∥ italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + ( 1 - italic_α ) italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)pTerm 1+(1α)R𝒟i(𝒜(S~i))1n𝒛iSi,1α(𝒜(S~i),𝒛i)pTerm 2.absentsubscriptsubscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝Term 1subscriptsubscriptnorm1𝛼subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆𝑖1𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝Term 2\displaystyle\leq\underbrace{\left\|\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(% \widetilde{S}_{i}))-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{% A}(\widetilde{S}_{i}),\bm{z}_{i})\right\|_{p}}_{\text{Term 1}}+\underbrace{% \left\|(1-\alpha)R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{% n}\sum_{\bm{z}_{i}\in S_{i,1-\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z% }_{i})\right\|_{p}}_{\text{Term 2}}.≤ under⏟ start_ARG ∥ italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Term 1 end_POSTSUBSCRIPT + under⏟ start_ARG ∥ ( 1 - italic_α ) italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Term 2 end_POSTSUBSCRIPT . (13)

The newly sampled dataset, denoted as S0,αsubscript𝑆0𝛼S_{0,\alpha}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT, is a subset of the original dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where S0,αS0subscript𝑆0𝛼subscript𝑆0S_{0,\alpha}\subseteq S_{0}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ⊆ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and its size is α×|S0|𝛼subscript𝑆0\alpha\times\left|S_{0}\right|italic_α × | italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |. Specifically, S0,αsubscript𝑆0𝛼S_{0,\alpha}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT contains a proportion α𝛼\alphaitalic_α of the n𝑛nitalic_n data points in S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, resulting in a total of n×α𝑛𝛼n\times\alphaitalic_n × italic_α data points. Similarly, Si,1αsubscript𝑆𝑖1𝛼S_{i,1-\alpha}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT is a subset of the synthetic dataset Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where Si,1αSisubscript𝑆𝑖1𝛼subscript𝑆𝑖S_{i,1-\alpha}\subseteq S_{i}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ⊆ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and its size is (1α)×|Si|1𝛼subscript𝑆𝑖(1-\alpha)\times\left|S_{i}\right|( 1 - italic_α ) × | italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |. Specifically, Si,1αsubscript𝑆𝑖1𝛼S_{i,1-\alpha}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT contains a proportion 1α1𝛼1-\alpha1 - italic_α of the n𝑛nitalic_n data points in Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, resulting in n×(1α)𝑛1𝛼n\times(1-\alpha)italic_n × ( 1 - italic_α ) data points.

We observe that for any function f(S)𝑓𝑆f(S)italic_f ( italic_S ), if there exists a bound fp(Sj)Csubscriptnorm𝑓𝑝subscript𝑆𝑗𝐶\|f\|_{p}\left(S_{j}\right)\leq C∥ italic_f ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ italic_C for some subset SjSsubscript𝑆𝑗𝑆S_{j}\subseteq Sitalic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⊆ italic_S, then we have the following:

fp=(𝔼𝔼[|f|pSj])1/p(𝔼[Cp])1/pC.subscriptnorm𝑓𝑝superscript𝔼𝔼delimited-[]conditionalsuperscript𝑓𝑝subscript𝑆𝑗1𝑝superscript𝔼delimited-[]superscript𝐶𝑝1𝑝𝐶\|f\|_{p}=\left(\mathbb{E}\mathbb{E}\left[|f|^{p}\mid S_{j}\right]\right)^{1/p% }\leq\left(\mathbb{E}\left[C^{p}\right]\right)^{1/p}\leq C.∥ italic_f ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ( blackboard_E blackboard_E [ | italic_f | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∣ italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ≤ ( blackboard_E [ italic_C start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ≤ italic_C .

Fix S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then data in Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are independent. We use this property and Lemma 8 to bound the Term 2. We introduce functions fj(Si,1α)subscript𝑓𝑗subscript𝑆𝑖1𝛼f_{j}(S_{i,1-\alpha})italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) which play the same role as gjsubscript𝑔𝑗g_{j}italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s in Lemma 8 as

fj(Si,1α)=𝔼𝒛i,j𝒟i[𝔼𝒛𝒟i(𝒜(S0,αSi,1αj),𝒛)(𝒜(S0,αSi,1αj),𝒛i,j)],subscript𝑓𝑗subscript𝑆𝑖1𝛼subscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]subscript𝔼similar-to𝒛subscript𝒟𝑖𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗𝒛𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖𝑗f_{j}(S_{i,1-\alpha})=\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}% \left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{i}}\ell(\mathcal{A}(S_{0,\alpha}\cup S% _{i,1-\alpha}^{j}),\bm{z})-\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j% }),\bm{z}_{i,j})\right],italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] ,

where 𝒛i,jsubscript𝒛𝑖𝑗\bm{z}_{i,j}bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the j𝑗jitalic_j-th data in Si,1αsubscript𝑆𝑖1𝛼S_{i,1-\alpha}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT, and Si,1αjsuperscriptsubscript𝑆𝑖1𝛼𝑗S_{i,1-\alpha}^{j}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT obtained by replacing 𝒛i,jsubscript𝒛𝑖𝑗\bm{z}_{i,j}bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT by 𝒛i,j.superscriptsubscript𝒛𝑖𝑗\bm{z}_{i,j}^{\prime}.bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . Next, we prove that fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT satisfies the three conditions outlined in Lemma 8. First, we demonstrate condition |fj|Msubscript𝑓𝑗𝑀|f_{j}|\leq M| italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_M.

|fj|subscript𝑓𝑗\displaystyle|f_{j}|| italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | =|𝔼𝒛i,j𝒟i[𝔼𝒛𝒟i(𝒜(S0,αSi,1αj),𝒛)(𝒜(S0,αSi,1αj),𝒛i,j)]|absentsubscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]subscript𝔼similar-to𝒛subscript𝒟𝑖𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗𝒛𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖𝑗\displaystyle=\left|\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left% [\mathbb{E}_{\bm{z}\sim\mathcal{D}_{i}}\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,% 1-\alpha}^{j}),\bm{z})-\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),% \bm{z}_{i,j})\right]\right|= | blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] |
𝔼𝒛i,j𝒟i𝔼𝒛𝒟i|(𝒜(S0,αSi,1αj),𝒛)(𝒜(S0,αSi,1αj),𝒛i,j)|.absentsubscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖subscript𝔼similar-to𝒛subscript𝒟𝑖𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗𝒛𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖𝑗\displaystyle\leq\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\mathbb{% E}_{\bm{z}\sim\mathcal{D}_{i}}\left|\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-% \alpha}^{j}),\bm{z})-\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),\bm% {z}_{i,j})\right|.≤ blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) | .
Mabsent𝑀\displaystyle\leq M≤ italic_M

We then continue by proving conditions 𝔼[fj|Si,1αj]=0𝔼delimited-[]conditionalsubscript𝑓𝑗superscriptsubscript𝑆𝑖1𝛼𝑗0\mathbb{E}[f_{j}|S_{i,1-\alpha}^{\setminus j}]=0blackboard_E [ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∖ italic_j end_POSTSUPERSCRIPT ] = 0:

𝔼[fjSi,1α\j]𝔼delimited-[]conditionalsubscript𝑓𝑗superscriptsubscript𝑆𝑖1𝛼\absent𝑗\displaystyle\mathbb{E}\left[f_{j}\mid S_{i,1-\alpha}^{\backslash j}\right]blackboard_E [ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT \ italic_j end_POSTSUPERSCRIPT ]
=𝔼𝒛i,j𝒟i[𝔼𝒛i,j𝒟i[𝔼𝒛𝒟i(𝒜(S0,αSi,1αj),𝒛)(𝒜(S0,αSi,1αj),𝒛i,j)]Si,1α\j]absentsubscript𝔼similar-tosubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]conditionalsubscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]subscript𝔼similar-to𝒛subscript𝒟𝑖𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗𝒛𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖𝑗superscriptsubscript𝑆𝑖1𝛼\absent𝑗\displaystyle=\mathbb{E}_{\bm{z}_{i,j}\sim\mathcal{D}_{i}}\left[\mathbb{E}_{% \bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left[\mathbb{E}_{\bm{z}\sim\mathcal{% D}_{i}}\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),\bm{z})-\ell(% \mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),\bm{z}_{i,j})\right]\mid S_{i% ,1-\alpha}^{\backslash j}\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] ∣ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT \ italic_j end_POSTSUPERSCRIPT ]
=𝔼𝒛i,j𝒟i[[𝔼𝒛𝒟i(𝒜(S0,αSi,1αj),𝒛)𝔼𝒛i,j𝒟i(𝒜(S0,αSi,1αj),𝒛i,j)]Si,1α\j]absentsubscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]conditionaldelimited-[]subscript𝔼similar-to𝒛subscript𝒟𝑖𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗𝒛subscript𝔼similar-tosubscript𝒛𝑖𝑗subscript𝒟𝑖𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖𝑗superscriptsubscript𝑆𝑖1𝛼\absent𝑗\displaystyle=\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left[\left% [\mathbb{E}_{\bm{z}\sim\mathcal{D}_{i}}\ell\left(\mathcal{A}\left(S_{0,\alpha}% \cup S_{i,1-\alpha}^{j}\right),\bm{z}\right)-\mathbb{E}_{\bm{z}_{i,j}\sim% \mathcal{D}_{i}}\ell\left(\mathcal{A}\left(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}% \right),\bm{z}_{i,j}\right)\right]\mid S_{i,1-\alpha}^{\backslash j}\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] ∣ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT \ italic_j end_POSTSUPERSCRIPT ]
=𝔼𝒛i,j𝒟i[0Si,1α\j]=0.absentsubscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]conditional0superscriptsubscript𝑆𝑖1𝛼\absent𝑗0\displaystyle=\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left[0\mid S% _{i,1-\alpha}^{\backslash j}\right]=0.= blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ 0 ∣ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT \ italic_j end_POSTSUPERSCRIPT ] = 0 .

Finally, we prove that fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT has a bounded difference 2βn2subscript𝛽𝑛2\beta_{n}2 italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with respect to all variables except the j𝑗jitalic_j-th variable. Let tj𝑡𝑗t\neq jitalic_t ≠ italic_j, then we obtain:

|fj(Si,1α)\displaystyle|f_{j}\left(S_{i,1-\alpha}\right)-| italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) - fj(Si,1αt)|\displaystyle f_{j}\left(S_{i,1-\alpha}^{t}\right)|italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) |
=\displaystyle== |𝔼𝒛i,j𝒟i[𝔼𝒛𝒟i(𝒜(S0,αSi,1αj),𝒛)(𝒜(S0,αSi,1αj),𝒛i,j)]\displaystyle|\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left[% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{i}}\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1% -\alpha}^{j}),\bm{z})-\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),% \bm{z}_{i,j})\right]| blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ]
𝔼𝒛i,j𝒟i[𝔼𝒛𝒟i(𝒜(S0,α(Si,1αt)j),𝒛)(𝒜(S0,α(Si,1αt)j),𝒛i,j)]\displaystyle-\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left[% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{i}}\ell(\mathcal{A}(S_{0,\alpha}\cup(S_{i,1% -\alpha}^{t})^{j}),\bm{z})-\ell(\mathcal{A}(S_{0,\alpha}\cup(S_{i,1-\alpha}^{t% })^{j}),\bm{z}_{i,j})\right]\mid- blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] ∣
\displaystyle\leq |𝔼𝒛i,j𝒟i𝔼𝒛𝒟i[(𝒜(S0,αSi,1αj),𝒛)(𝒜(S0,α(Si,1αt)j),𝒛)]|subscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖subscript𝔼similar-to𝒛subscript𝒟𝑖delimited-[]𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗𝒛𝒜subscript𝑆0𝛼superscriptsuperscriptsubscript𝑆𝑖1𝛼𝑡𝑗𝒛\displaystyle\left|\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{i}}\left[\ell(\mathcal{A}(S_{0,\alpha}\cup S% _{i,1-\alpha}^{j}),\bm{z})-\ell(\mathcal{A}(S_{0,\alpha}\cup(S_{i,1-\alpha}^{t% })^{j}),\bm{z})\right]\right|| blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) ] |
+|𝔼𝒛i,j𝒟i[(𝒜(S0,αSi,1αj),𝒛i,j)(𝒜(S0,α(Si,1αt)j),𝒛i,j)]|subscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖𝑗𝒜subscript𝑆0𝛼superscriptsuperscriptsubscript𝑆𝑖1𝛼𝑡𝑗subscript𝒛𝑖𝑗\displaystyle+\left|\mathbb{E}_{\bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left% [\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),\bm{z}_{i,j})-\ell(% \mathcal{A}(S_{0,\alpha}\cup(S_{i,1-\alpha}^{t})^{j}),\bm{z}_{i,j})\right]\right|+ | blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] |
\displaystyle\leq βn+βn=2βn.subscript𝛽𝑛subscript𝛽𝑛2subscript𝛽𝑛\displaystyle\beta_{n}+\beta_{n}=2\beta_{n}.italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 2 italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

Therefore, for any fixed S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, by Lemma 8, for any p2𝑝2p\geq 2italic_p ≥ 2, we have

j=1n(1α)fj(Si,1α)ppn(1α)βnlog(n(1α))+Mpn(1α).less-than-or-similar-tosubscriptnormsuperscriptsubscript𝑗1𝑛1𝛼subscript𝑓𝑗subscript𝑆𝑖1𝛼𝑝𝑝𝑛1𝛼subscript𝛽𝑛𝑛1𝛼𝑀𝑝𝑛1𝛼\left\|\sum_{j=1}^{n(1-\alpha)}f_{j}\left(S_{i,1-\alpha}\right)\right\|_{p}% \lesssim pn(1-\alpha)\beta_{n}\log(n(1-\alpha))+M\sqrt{pn(1-\alpha)}.∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≲ italic_p italic_n ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_M square-root start_ARG italic_p italic_n ( 1 - italic_α ) end_ARG . (14)

We note that the difference between Term 2 and j=1n(1α)fjsuperscriptsubscript𝑗1𝑛1𝛼subscript𝑓𝑗\sum_{j=1}^{n(1-\alpha)}f_{j}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is minimal. Consequently, for any fixed S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we can bound Term 2 using inequality 14 as follows.

(1α)R𝒟i(𝒜(S~i))1n𝒛iSi,1α(𝒜(S~i),𝒛i)psubscriptnorm1𝛼subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆𝑖1𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle\left\|(1-\alpha)R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i% }))-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{i,1-\alpha}}\ell(\mathcal{A}(\widetilde{% S}_{i}),\bm{z}_{i})\right\|_{p}∥ ( 1 - italic_α ) italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
=(1α)R𝒟i(𝒜(S0,αSi,1α))1nj=1n(1α)(𝒜(S0,αSi,1α),𝒛i,j)pFixS0,αabsentsubscriptnorm1𝛼subscript𝑅subscript𝒟𝑖𝒜subscript𝑆0𝛼subscript𝑆𝑖1𝛼1𝑛superscriptsubscript𝑗1𝑛1𝛼𝒜subscript𝑆0𝛼subscript𝑆𝑖1𝛼subscript𝒛𝑖𝑗𝑝Fixsubscript𝑆0𝛼\displaystyle=\left\|(1-\alpha)R_{\mathcal{D}_{i}}(\mathcal{A}(S_{0,\alpha}% \cup S_{i,1-\alpha}))-\frac{1}{n}\sum_{j=1}^{n(1-\alpha)}\ell(\mathcal{A}(S_{0% ,\alpha}\cup S_{i,1-\alpha}),\bm{z}_{i,j})\right\|_{p}\qquad\qquad\text{Fix}\ % S_{0,\alpha}= ∥ ( 1 - italic_α ) italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT Fix italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT
=1nj=1n(1α)(𝔼𝒛𝒟i(𝒜(S0,αSi,1α),𝒛)(𝒜(S0,αSi,1α),𝒛i,j))pabsentsubscriptnorm1𝑛superscriptsubscript𝑗1𝑛1𝛼subscript𝔼similar-to𝒛subscript𝒟𝑖𝒜subscript𝑆0𝛼subscript𝑆𝑖1𝛼𝒛𝒜subscript𝑆0𝛼subscript𝑆𝑖1𝛼subscript𝒛𝑖𝑗𝑝\displaystyle=\left\|\frac{1}{n}\sum_{j=1}^{n(1-\alpha)}\left(\mathbb{E}_{\bm{% z}\sim\mathcal{D}_{i}}\ell\left(\mathcal{A}\left(S_{0,\alpha}\cup S_{i,1-% \alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A}\left(S_{0,\alpha}\cup S_{i,% 1-\alpha}\right),\bm{z}_{i,j}\right)\right)\right\|_{p}= ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
1nj=1n(1α)(𝔼𝒛i,j𝒟i[𝔼𝒛𝒟i(𝒜(S0,αSi,1αj),𝒛)(𝒜(S0,αSi,1αj),𝒛i,j)])p+(1α)2βnpabsent1𝑛subscriptnormsuperscriptsubscript𝑗1𝑛1𝛼subscript𝔼similar-tosuperscriptsubscript𝒛𝑖𝑗subscript𝒟𝑖delimited-[]subscript𝔼similar-to𝒛subscript𝒟𝑖𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗𝒛𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖𝑗𝑝1𝛼subscriptnorm2subscript𝛽𝑛𝑝\displaystyle\leq\frac{1}{n}\left\|\sum_{j=1}^{n(1-\alpha)}\left(\mathbb{E}_{% \bm{z}_{i,j}^{\prime}\sim\mathcal{D}_{i}}\left[\mathbb{E}_{\bm{z}\sim\mathcal{% D}_{i}}\ell(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),\bm{z})-\ell(% \mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),\bm{z}_{i,j})\right]\right)% \right\|_{p}+(1-\alpha)\left\|2\beta_{n}\right\|_{p}≤ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) ] ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + ( 1 - italic_α ) ∥ 2 italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
=1nj=1n(1α)fj(Si,1α)p+(1α)2βnpabsent1𝑛subscriptnormsuperscriptsubscript𝑗1𝑛1𝛼subscript𝑓𝑗subscript𝑆𝑖1𝛼𝑝1𝛼subscriptnorm2subscript𝛽𝑛𝑝\displaystyle=\frac{1}{n}\left\|\sum_{j=1}^{n(1-\alpha)}f_{j}\left(S_{i,1-% \alpha}\right)\right\|_{p}+(1-\alpha)\left\|2\beta_{n}\right\|_{p}= divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + ( 1 - italic_α ) ∥ 2 italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
p(1α)βnlog(n(1α))+Mp(1α)n+2(1α)βnless-than-or-similar-toabsent𝑝1𝛼subscript𝛽𝑛𝑛1𝛼𝑀𝑝1𝛼𝑛21𝛼subscript𝛽𝑛\displaystyle\lesssim p(1-\alpha)\beta_{n}\log(n(1-\alpha))+M\sqrt{\frac{p(1-% \alpha)}{n}}+2(1-\alpha)\beta_{n}≲ italic_p ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_M square-root start_ARG divide start_ARG italic_p ( 1 - italic_α ) end_ARG start_ARG italic_n end_ARG end_ARG + 2 ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
p(1α)βnlog(n(1α))+Mp(1α)n.less-than-or-similar-toabsent𝑝1𝛼subscript𝛽𝑛𝑛1𝛼𝑀𝑝1𝛼𝑛\displaystyle\lesssim p(1-\alpha)\beta_{n}\log(n(1-\alpha))+M\sqrt{\frac{p(1-% \alpha)}{n}}.≲ italic_p ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_M square-root start_ARG divide start_ARG italic_p ( 1 - italic_α ) end_ARG start_ARG italic_n end_ARG end_ARG .

Next, for Term 2, we derive the following result:

(1α)R𝒟i(𝒜(S~i))1n𝒛iSi,1α(𝒜(S~i),𝒛i)pp(1α)βnlog(n(1α))+Mp(1α)n.less-than-or-similar-tosubscriptnorm1𝛼subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆𝑖1𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝𝑝1𝛼subscript𝛽𝑛𝑛1𝛼𝑀𝑝1𝛼𝑛\displaystyle\left\|(1-\alpha)R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i% }))-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{i,1-\alpha}}\ell(\mathcal{A}(\widetilde{% S}_{i}),\bm{z}_{i})\right\|_{p}\lesssim p(1-\alpha)\beta_{n}\log(n(1-\alpha))+% M\sqrt{\frac{p(1-\alpha)}{n}}.∥ ( 1 - italic_α ) italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≲ italic_p ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_M square-root start_ARG divide start_ARG italic_p ( 1 - italic_α ) end_ARG start_ARG italic_n end_ARG end_ARG . (15)

Now, we use a similar idea to bound Term 1 αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)psubscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\left\|\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{n}% \sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i% })\right\|_{p}∥ italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. We decompose it as the following.

αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)psubscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle\left\|\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))% -\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i% }),\bm{z}_{i})\right\|_{p}∥ italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
(αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i))𝔼Si,1α𝒟in(1α)(αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i))pTerm 3absentsubscriptsubscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝Term 3\displaystyle\leq\underbrace{\left\|(\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(% \widetilde{S}_{i}))-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{% A}(\widetilde{S}_{i}),\bm{z}_{i}))-\mathbb{E}_{S_{i,1-\alpha}\sim\mathcal{D}_{% i}^{n(1-\alpha)}}(\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}% ),\bm{z}_{i}))\right\|_{p}}_{\text{Term 3}}≤ under⏟ start_ARG ∥ ( italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Term 3 end_POSTSUBSCRIPT
+𝔼Si,1α𝒟in(1α)(αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i))Term 4.subscriptnormsubscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖Term 4\displaystyle+\underbrace{\left\|\mathbb{E}_{S_{i,1-\alpha}\sim\mathcal{D}_{i}% ^{n(1-\alpha)}}(\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}% ),\bm{z}_{i}))\right\|}_{\text{Term 4}}.+ under⏟ start_ARG ∥ blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ end_ARG start_POSTSUBSCRIPT Term 4 end_POSTSUBSCRIPT . (16)

We proceed by bounding each term. Specifically, Term 3 can be bounded using McDiarmid’s inequality, as outlined in Lemma 7, and Term 4 can be bounded by applying Lemma 8.

To bound Term 3, we begin by fixing S0,αsubscript𝑆0𝛼S_{0,\alpha}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT and utilizing the conditional independence property of Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT once again. In order to apply Lemma 8, we must show that αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{n}\sum_{% \bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) exhibits a bounded difference with respect to Si,1αsubscript𝑆𝑖1𝛼S_{i,1-\alpha}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT when S0,αsubscript𝑆0𝛼S_{0,\alpha}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT is fixed. This expression can be formulated as follows.

|αR𝒟0(𝒜(S0,αSi,1α))1n𝒛iS0,α(𝒜(S0,αSi,1α),𝒛i)\displaystyle\left|\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(S_{0,\alpha}\cup S_{% i,1-\alpha}))-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(S_{% 0,\alpha}\cup S_{i,1-\alpha}),\bm{z}_{i})\right.| italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
αR𝒟0(𝒜(S0,αSi,1αj))+1n𝒛iS0,α(𝒜(S0,αSi,1αj),𝒛i)|\displaystyle\left.-\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(S_{0,\alpha}\cup S_% {i,1-\alpha}^{j}))+\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A% }(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}),\bm{z}_{i})\right|- italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) |
α|R𝒟0(𝒜(S0,αSi,1α))R𝒟0(𝒜(S0,αSi,1αj))|absent𝛼subscript𝑅subscript𝒟0𝒜subscript𝑆0𝛼subscript𝑆𝑖1𝛼subscript𝑅subscript𝒟0𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗\displaystyle\leq\alpha\left|R_{\mathcal{D}_{0}}(\mathcal{A}(S_{0,\alpha}\cup S% _{i,1-\alpha}))-R_{\mathcal{D}_{0}}(\mathcal{A}(S_{0,\alpha}\cup S_{i,1-\alpha% }^{j}))\right|≤ italic_α | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) |
+1n|𝒛iS0,α(𝒜(S0,αSi,1α),𝒛i)𝒛iS0,α(𝒜(S0,αSi,1αj),𝒛i)|1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript𝑆0𝛼subscript𝑆𝑖1𝛼subscript𝒛𝑖subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript𝑆0𝛼superscriptsubscript𝑆𝑖1𝛼𝑗subscript𝒛𝑖\displaystyle+\frac{1}{n}\left|\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{% A}(S_{0,\alpha}\cup S_{i,1-\alpha}),\bm{z}_{i})-\sum_{\bm{z}_{i}\in S_{0,% \alpha}}\ell\left(\mathcal{A}\left(S_{0,\alpha}\cup S_{i,1-\alpha}^{j}\right),% \bm{z}_{i}\right)\right|+ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG | ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) |
αβn+αβn=2αβn.absent𝛼subscript𝛽𝑛𝛼subscript𝛽𝑛2𝛼subscript𝛽𝑛\displaystyle\leq\alpha\beta_{n}+\alpha\beta_{n}=2\alpha\beta_{n}.≤ italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 2 italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

Thus, by Mcdiarmid Inequality, we have

(αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i))𝔼Si,1α𝒟in(1α)(αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i))psubscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle\left\|(\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})% )-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{% i}),\bm{z}_{i}))-\mathbb{E}_{S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}}(% \alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{n}\sum_{% \bm{z}_{i}\in S_{0,\alpha}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i}))% \right\|_{p}∥ ( italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
4n(1α)pαβnn(1α)pαβn.absent4𝑛1𝛼𝑝𝛼subscript𝛽𝑛less-than-or-similar-to𝑛1𝛼𝑝𝛼subscript𝛽𝑛\displaystyle\leq 4\sqrt{n(1-\alpha)p}\alpha\beta_{n}\lesssim\sqrt{n(1-\alpha)% p}\alpha\beta_{n}.≤ 4 square-root start_ARG italic_n ( 1 - italic_α ) italic_p end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≲ square-root start_ARG italic_n ( 1 - italic_α ) italic_p end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT . (17)

We now introduce a set of functions and apply Lemma 8 once more to bound Term 4. Specifically, we define hj(S)subscript𝑗𝑆h_{j}(S)italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S ), which serves a similar role to the gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ’s in Lemma 8, as follows:

hj(S0,α)subscript𝑗subscript𝑆0𝛼\displaystyle h_{j}(S_{0,\alpha})italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT )
=\displaystyle== 𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜(S0,αjSi,1α),𝒛)(𝒜(S0,αjSi,1α),𝒛0,j)]subscript𝔼similar-tosuperscriptsubscript𝒛0𝑗subscript𝒟0subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsubscript𝑆0𝛼𝑗delimited-[]subscript𝔼similar-to𝒛subscript𝒟0𝒜superscriptsubscript𝑆0𝛼𝑗subscript𝑆𝑖1𝛼𝒛𝒜superscriptsubscript𝑆0𝛼𝑗subscript𝑆𝑖1𝛼subscript𝒛0𝑗\displaystyle\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_{% S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}% \left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left(S_{0,% \alpha}^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A}\left% (S_{0,\alpha}^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right)\right]blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] (18)

where 𝒛0,jsubscript𝒛0𝑗\bm{z}_{0,j}bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT denote the j𝑗jitalic_j-th data point in S0,αsubscript𝑆0𝛼S_{0,\alpha}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT, and S0,αjsuperscriptsubscript𝑆0𝛼𝑗S_{0,\alpha}^{j}italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT represent the dataset obtained by replacing 𝒛0,jsubscript𝒛0𝑗\bm{z}_{0,j}bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT with 𝒛0,jsuperscriptsubscript𝒛0𝑗\bm{z}_{0,j}^{\prime}bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Additionally, it is important to note that Si,1α𝒟in(1α)(S0,αj)similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsubscript𝑆0𝛼𝑗S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) indicates that Si,1αsubscript𝑆𝑖1𝛼S_{i,1-\alpha}italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT is the synthetic dataset generated after the self-consuming loop, following i𝑖iitalic_i-generations, and obtained by modifying a single data point from the initial real dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This complex scenario can be addressed using the recursive stability we have defined for the self-consuming loop in Definition 2. Moreover, similar to the process above, we observe that |hj|Msubscript𝑗𝑀\left|h_{j}\right|\leq M| italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_M and 𝔼[hjS0,α\j]=0𝔼delimited-[]conditionalsubscript𝑗superscriptsubscript𝑆0𝛼\absent𝑗0\mathbb{E}\left[h_{j}\mid S_{0,\alpha}^{\backslash j}\right]=0blackboard_E [ italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT \ italic_j end_POSTSUPERSCRIPT ] = 0. More intricately, we will now prove that hjsubscript𝑗h_{j}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT exhibits a bounded difference. This will be demonstrated as follows.

|hj(S0,α)hj(S0,αt)|subscript𝑗subscript𝑆0𝛼subscript𝑗superscriptsubscript𝑆0𝛼𝑡\displaystyle|h_{j}(S_{0,\alpha})-h_{j}\left(S_{0,\alpha}^{t}\right)|| italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) - italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) |
=\displaystyle== 𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜(S0,αjSi,1α),𝒛)(𝒜(S0,αjSi,1α),𝒛0,j)]\displaystyle\mid\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{% E}_{S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}% \right)}\left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left% (S_{0,\alpha}^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A% }\left(S_{0,\alpha}^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right)\right]∣ blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ]
\displaystyle-- 𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)((S0,αt)j)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]subscript𝔼similar-tosuperscriptsubscript𝒛0𝑗subscript𝒟0subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗delimited-[]subscript𝔼similar-to𝒛subscript𝒟0𝒜superscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗subscript𝑆𝑖1𝛼𝒛𝒜superscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗subscript𝑆𝑖1𝛼subscript𝒛0𝑗\displaystyle\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_{% S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left((S_{0,\alpha}^{t})^{j}% \right)}\left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left% ((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(% \mathcal{A}\left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}% \right)\right]blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ]
\displaystyle\leq 𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜(S0,αjSi,1α),𝒛)(𝒜(S0,αjSi,1α),𝒛0,j)]\displaystyle\mid\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{% E}_{S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}% \right)}\left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left% (S_{0,\alpha}^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A% }\left(S_{0,\alpha}^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right)\right]∣ blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ]
\displaystyle-- 𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]|\displaystyle\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_{% S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}% \left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left((S_{0,% \alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A}% \left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right)% \right]|blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] | (19)
+\displaystyle++ |𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]\displaystyle|\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_% {S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}% \left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left((S_{0,% \alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A}% \left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right)\right]| blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ]
\displaystyle-- 𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)((S0,αt)j)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]|.\displaystyle\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_{% S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left((S_{0,\alpha}^{t})^{j}% \right)}\left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left% ((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(% \mathcal{A}\left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}% \right)\right]|.blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] | . (20)

We can bound equation 19 by applying the concept of uniform stability, resulting in an upper bound of 2βn2subscript𝛽𝑛2\beta_{n}2 italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Regarding equation 20, for ease of notation, let us represent (𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)𝒜superscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗subscript𝑆𝑖1𝛼𝒛𝒜superscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗subscript𝑆𝑖1𝛼subscript𝒛0𝑗\ell\left(\mathcal{A}\left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),% \bm{z}\right)-\ell\left(\mathcal{A}\left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-% \alpha}\right),\bm{z}_{0,j}\right)roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) as Q𝑄Qitalic_Q. Consequently, we obtain the following:

|𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]\displaystyle|\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_% {S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}% \left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left((S_{0,% \alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A}% \left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right)\right]| blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ]
𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)((S0,αt)j)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]|\displaystyle-\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_% {S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left((S_{0,\alpha}^{t})^{j}% \right)}\left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left% ((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(% \mathcal{A}\left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}% \right)\right]|- blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] |
=|𝔼𝒛0,j𝒟0𝔼𝒛𝒟0[𝔼Si,1α𝒟in(1α)(S0,αj)Q𝔼Si,1α𝒟in(1α)((S0,αt)j)Q]|absentsubscript𝔼similar-tosuperscriptsubscript𝒛0𝑗subscript𝒟0subscript𝔼similar-to𝒛subscript𝒟0delimited-[]subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsubscript𝑆0𝛼𝑗𝑄subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗𝑄\displaystyle=\left|\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\left[\mathbb{E}_{S_{i,1-\alpha}\sim% \mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}Q-\mathbb{E}_{S_{i,% 1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left((S_{0,\alpha}^{t})^{j}\right)}% Q\right]\right|= | blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_Q - blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_Q ] |
=𝔼𝒛0,j𝒟0𝔼𝒛𝒟0|Si,1α((Si,1αS0,αj)(Si,1α(S0,αt)j))QdSi,1α|\displaystyle=\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_% {\bm{z}\sim\mathcal{D}_{0}}\left|\int_{S_{i,1-\alpha}}\left(\mathbb{P}\left(S_% {i,1-\alpha}\mid S^{j}_{0,\alpha}\right)-\mathbb{P}\left(S_{i,1-\alpha}\mid% \left(S^{t}_{0,\alpha}\right)^{j}\right)\right)QdS_{i,1-\alpha}\right|= blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_P ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∣ italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) - blackboard_P ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∣ ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) italic_Q italic_d italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT |
𝔼𝒛0,j𝒟0𝔼𝒛𝒟0[Si,1α|((Si,1αS0,αj)(Si,1α(S0,αt)j))Q|dSi,1α]\displaystyle\leq\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{% E}_{\bm{z}\sim\mathcal{D}_{0}}\left[\int_{S_{i,1-\alpha}}\left|\left(\mathbb{P% }\left(S_{i,1-\alpha}\mid S^{j}_{0,\alpha}\right)-\mathbb{P}\left(S_{i,1-% \alpha}\mid\left(S^{t}_{0,\alpha}\right)^{j}\right)\right)Q\right|dS_{i,1-% \alpha}\right]≤ blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ( blackboard_P ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∣ italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) - blackboard_P ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∣ ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) italic_Q | italic_d italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ]
M𝔼𝒛0,j𝒟0𝔼𝒛𝒟0[Si,1α|((Si,1αS0,αj)(Si,1α(S0,αt)j))|dSi,1α]\displaystyle\leq M\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\left[\int_{S_{i,1-\alpha}}\left|\left(% \mathbb{P}\left(S_{i,1-\alpha}\mid S^{j}_{0,\alpha}\right)-\mathbb{P}\left(S_{% i,1-\alpha}\mid\left(S^{t}_{0,\alpha}\right)^{j}\right)\right)\right|dS_{i,1-% \alpha}\right]≤ italic_M blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ( blackboard_P ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∣ italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) - blackboard_P ( italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∣ ( italic_S start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) | italic_d italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ]
2MsupjTV(𝒟in(1α)(S0j),𝒟in(1α)(S0))absent2𝑀subscriptsupremum𝑗𝑇𝑉superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsubscript𝑆0𝑗superscriptsubscript𝒟𝑖𝑛1𝛼subscript𝑆0\displaystyle\leq 2M\sup_{j}TV\left(\mathcal{D}_{i}^{n(1-\alpha)}(S_{0}^{j}),% \mathcal{D}_{i}^{n(1-\alpha)}(S_{0})\right)≤ 2 italic_M roman_sup start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_T italic_V ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
=2Mγni.absent2𝑀superscriptsubscript𝛾𝑛𝑖\displaystyle=2M\gamma_{n}^{i}.= 2 italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (21)

Thus, hjsubscript𝑗h_{j}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT exhibits a bounded difference of 2βn+2Mγni2subscript𝛽𝑛2𝑀superscriptsubscript𝛾𝑛𝑖2\beta_{n}+2M\gamma_{n}^{i}2 italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 2 italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with respect to all variables except the j𝑗jitalic_j-th variable. By applying Lemma 8, we obtain the following:

j=1nαhj(S0,α)psubscriptnormsuperscriptsubscript𝑗1𝑛𝛼subscript𝑗subscript𝑆0𝛼𝑝\displaystyle\left\|\sum_{j=1}^{n\alpha}h_{j}(S_{0,\alpha})\right\|_{p}∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_α end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 122pnα(2βn+2Mγni)log(nα)+4Mpnαabsent122𝑝𝑛𝛼2subscript𝛽𝑛2𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼4𝑀𝑝𝑛𝛼\displaystyle\leq 12\sqrt{2}pn\alpha\left(2\beta_{n}+2M\gamma_{n}^{i}\right)% \log(n\alpha)+4M\sqrt{pn\alpha}≤ 12 square-root start_ARG 2 end_ARG italic_p italic_n italic_α ( 2 italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 2 italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) + 4 italic_M square-root start_ARG italic_p italic_n italic_α end_ARG
pnα(βn+Mγni)log(nα)+Mpnα.less-than-or-similar-toabsent𝑝𝑛𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼𝑀𝑝𝑛𝛼\displaystyle\lesssim pn\alpha\left(\beta_{n}+M\gamma_{n}^{i}\right)\log(n% \alpha)+M\sqrt{pn\alpha}.≲ italic_p italic_n italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) + italic_M square-root start_ARG italic_p italic_n italic_α end_ARG .

We observe that the difference between Term 4 and j=1nαhj(S0,α)psubscriptnormsuperscriptsubscript𝑗1𝑛𝛼subscript𝑗subscript𝑆0𝛼𝑝\left\|\sum_{j=1}^{n\alpha}h_{j}\left(S_{0,\alpha}\right)\right\|_{p}∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_α end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is negligible. Thus, we can bound Term 4 as follows:

𝔼Si,1α𝒟in(1α)[αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)]psubscriptnormsubscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼delimited-[]𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle\left\|\mathbb{E}_{S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)% }}\left[\alpha R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{n}% \sum_{\bm{z}_{i}\in S_{0,\alpha}}\ell\left(\mathcal{A}\left(\widetilde{S}_{i}% \right),\bm{z}_{i}\right)\right]\right\|_{p}∥ blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
=1n𝒛iS0,α𝔼Si,1α𝒟in(1α)[R𝒟0(𝒜(S~i))(𝒜(S~i),𝒛i)]pabsentsubscriptnorm1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼delimited-[]subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle=\left\|\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}\mathbb{E}_{S% _{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}}\left[R_{\mathcal{D}_{0}}(% \mathcal{A}(\widetilde{S}_{i}))-\ell\left(\mathcal{A}(\widetilde{S}_{i}),\bm{z% }_{i}\right)\right]\right\|_{p}= ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
1nj=1nα(𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜(S0,αjSi,1α),𝒛)(𝒜(S0,αjSi,1α),𝒛0,j)])pabsentsubscriptnorm1𝑛superscriptsubscript𝑗1𝑛𝛼subscript𝔼similar-tosuperscriptsubscript𝒛0𝑗subscript𝒟0subscript𝔼similar-tosubscript𝑆𝑖1𝛼superscriptsubscript𝒟𝑖𝑛1𝛼superscriptsubscript𝑆0𝛼𝑗delimited-[]subscript𝔼similar-to𝒛subscript𝒟0𝒜superscriptsubscript𝑆0𝛼𝑗subscript𝑆𝑖1𝛼𝒛𝒜superscriptsubscript𝑆0𝛼𝑗subscript𝑆𝑖1𝛼subscript𝒛0𝑗𝑝\displaystyle\leq\left\|\frac{1}{n}\sum_{j=1}^{n\alpha}\left(\mathbb{E}_{\bm{z% }_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_{S_{i,1-\alpha}\sim\mathcal{D}_% {i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}\left[\mathbb{E}_{\bm{z}\sim% \mathcal{D}_{0}}\ell\left(\mathcal{A}\left(S_{0,\alpha}^{j}\cup S_{i,1-\alpha}% \right),\bm{z}\right)-\ell\left(\mathcal{A}\left(S_{0,\alpha}^{j}\cup S_{i,1-% \alpha}\right),\bm{z}_{0,j}\right)\right]\right)\right\|_{p}≤ ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_α end_POSTSUPERSCRIPT ( blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
+2αβn+2αMγnipsubscriptnorm2𝛼subscript𝛽𝑛2𝛼𝑀superscriptsubscript𝛾𝑛𝑖𝑝\displaystyle+\left\|2\alpha\beta_{n}+2\alpha M\gamma_{n}^{i}\right\|_{p}+ ∥ 2 italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 2 italic_α italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
=1nj=1nαhj(S0,α)p+2αβn+2αMγnipabsentsubscriptnorm1𝑛superscriptsubscript𝑗1𝑛𝛼subscript𝑗subscript𝑆0𝛼𝑝subscriptnorm2𝛼subscript𝛽𝑛2𝛼𝑀superscriptsubscript𝛾𝑛𝑖𝑝\displaystyle=\left\|\frac{1}{n}\sum_{j=1}^{n\alpha}h_{j}(S_{0,\alpha})\right% \|_{p}+\left\|2\alpha\beta_{n}+2\alpha M\gamma_{n}^{i}\right\|_{p}= ∥ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n italic_α end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + ∥ 2 italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + 2 italic_α italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
pα(βn+Mγni)log(nα)+Mpαn1+αβn+αMγniless-than-or-similar-toabsent𝑝𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼𝑀𝑝𝛼superscript𝑛1𝛼subscript𝛽𝑛𝛼𝑀superscriptsubscript𝛾𝑛𝑖\displaystyle\lesssim p\alpha\left(\beta_{n}+M\gamma_{n}^{i}\right)\log(n% \alpha)+M\sqrt{p\alpha n^{-1}}+\alpha\beta_{n}+\alpha M\gamma_{n}^{i}≲ italic_p italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) + italic_M square-root start_ARG italic_p italic_α italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG + italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_α italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT
pα(βn+Mγni)log(nα)+Mpαn1.less-than-or-similar-toabsent𝑝𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼𝑀𝑝𝛼superscript𝑛1\displaystyle\lesssim p\alpha\left(\beta_{n}+M\gamma_{n}^{i}\right)\log(n% \alpha)+M\sqrt{p\alpha n^{-1}}.≲ italic_p italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) + italic_M square-root start_ARG italic_p italic_α italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG .

By substituting the above inequality and inequality 17 into the decomposition 16, we obtain:

αR𝒟0(𝒜(S~i))1n𝒛iS0,α(𝒜(S~i),𝒛i)psubscriptnorm𝛼subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖1𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝛼𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle\left\|\alpha R_{\mathcal{D}_{0}}\left(\mathcal{A}\left(% \widetilde{S}_{i}\right)\right)-\frac{1}{n}\sum_{\bm{z}_{i}\in S_{0,\alpha}}% \ell\left(\mathcal{A}\left(\widetilde{S}_{i}\right),\bm{z}_{i}\right)\right\|_% {p}∥ italic_α italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
n(1α)pαβn+pα(βn+Mγni)log(nα)+Mpαn1less-than-or-similar-toabsent𝑛1𝛼𝑝𝛼subscript𝛽𝑛𝑝𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼𝑀𝑝𝛼superscript𝑛1\displaystyle\lesssim\sqrt{n(1-\alpha)p}\alpha\beta_{n}+p\alpha\left(\beta_{n}% +M\gamma_{n}^{i}\right)\log(n\alpha)+M\sqrt{p\alpha n^{-1}}≲ square-root start_ARG italic_n ( 1 - italic_α ) italic_p end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_p italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) + italic_M square-root start_ARG italic_p italic_α italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG
p((1α)nαβn+Mαn1)+pα(βn+Mγni)log(nα).less-than-or-similar-toabsent𝑝1𝛼𝑛𝛼subscript𝛽𝑛𝑀𝛼superscript𝑛1𝑝𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼\displaystyle\lesssim\sqrt{p}(\sqrt{(1-\alpha)n}\alpha\beta_{n}+M\sqrt{\alpha n% ^{-1}})+p\alpha\left(\beta_{n}+M\gamma_{n}^{i}\right)\log(n\alpha).≲ square-root start_ARG italic_p end_ARG ( square-root start_ARG ( 1 - italic_α ) italic_n end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M square-root start_ARG italic_α italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG ) + italic_p italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) . (22)

Plug inequalities 22 and 15 into the inequality 57, then we obtain:

R𝒟~i\displaystyle\|R_{\widetilde{\mathcal{D}}_{i}}∥ italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (𝒜(S~i))R^S~i(𝒜(S~i))p𝒜subscript~𝑆𝑖evaluated-atsubscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖𝑝\displaystyle(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_{\widetilde{S}_{i}}(% \mathcal{A}(\widetilde{S}_{i}))\|_{p}( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
less-than-or-similar-to\displaystyle\lesssim p(1α)βnlog(n(1α))+Mp(1α)n1+p((1α)nαβn+Mαn1)𝑝1𝛼subscript𝛽𝑛𝑛1𝛼𝑀𝑝1𝛼superscript𝑛1𝑝1𝛼𝑛𝛼subscript𝛽𝑛𝑀𝛼superscript𝑛1\displaystyle p(1-\alpha)\beta_{n}\log(n(1-\alpha))+M\sqrt{p(1-\alpha)n^{-1}}+% \sqrt{p}(\sqrt{(1-\alpha)n}\alpha\beta_{n}+M\sqrt{\alpha n^{-1}})italic_p ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_M square-root start_ARG italic_p ( 1 - italic_α ) italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG italic_p end_ARG ( square-root start_ARG ( 1 - italic_α ) italic_n end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M square-root start_ARG italic_α italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG )
+pα(βn+Mγni)log(nα)𝑝𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼\displaystyle+p\alpha\left(\beta_{n}+M\gamma_{n}^{i}\right)\log(n\alpha)+ italic_p italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α )
=\displaystyle== p((1α)nαβn+Mn1/2(1α+α))𝑝1𝛼𝑛𝛼subscript𝛽𝑛𝑀superscript𝑛121𝛼𝛼\displaystyle\sqrt{p}\left(\sqrt{(1-\alpha)n}\alpha\beta_{n}+Mn^{-1/2}(\sqrt{1% -\alpha}+\sqrt{\alpha})\right)square-root start_ARG italic_p end_ARG ( square-root start_ARG ( 1 - italic_α ) italic_n end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( square-root start_ARG 1 - italic_α end_ARG + square-root start_ARG italic_α end_ARG ) )
+p((1α)βnlog(n(1α))+α(βn+Mγni)log(nα)).𝑝1𝛼subscript𝛽𝑛𝑛1𝛼𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼\displaystyle+p\left((1-\alpha)\beta_{n}\log(n(1-\alpha))+\alpha\left(\beta_{n% }+M\gamma_{n}^{i}\right)\log(n\alpha)\right).+ italic_p ( ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) ) . (23)

By applying Lemma 8, we can derive a bound on the generalization error with respect to the mixed distribution.|R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖\left|R_{\widetilde{\mathcal{D}}_{i}}\left(\mathcal{A}\left(\widetilde{S}_{i}% \right)\right)-\widehat{R}_{\widetilde{S}_{i}}\left(\mathcal{A}\left(% \widetilde{S}_{i}\right)\right)\right|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | as follows.

|R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\widetilde{\mathcal{D}}_{i}}\left(\mathcal{A}\left(% \widetilde{S}_{i}\right)\right)-\widehat{R}_{\widetilde{S}_{i}}\left(\mathcal{% A}\left(\widetilde{S}_{i}\right)\right)\right|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
((1α)nαβn+Mn1/2(1α+α))log(1δ)less-than-or-similar-toabsent1𝛼𝑛𝛼subscript𝛽𝑛𝑀superscript𝑛121𝛼𝛼1𝛿\displaystyle\lesssim\left(\sqrt{(1-\alpha)n}\alpha\beta_{n}+Mn^{-1/2}(\sqrt{1% -\alpha}+\sqrt{\alpha})\right)\sqrt{\log\left(\frac{1}{\delta}\right)}≲ ( square-root start_ARG ( 1 - italic_α ) italic_n end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( square-root start_ARG 1 - italic_α end_ARG + square-root start_ARG italic_α end_ARG ) ) square-root start_ARG roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG
+((1α)βnlog(n(1α))+α(βn+Mγni)log(nα))log(1δ).1𝛼subscript𝛽𝑛𝑛1𝛼𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼1𝛿\displaystyle+\left((1-\alpha)\beta_{n}\log(n(1-\alpha))+\alpha\left(\beta_{n}% +M\gamma_{n}^{i}\right)\log(n\alpha)\right)\log\left(\frac{1}{\delta}\right).+ ( ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) .

Finally, we conclude that:

|R𝒟0(A(S~i))R^S~i(A(S~i))|subscript𝑅subscript𝒟0𝐴subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝐴subscript~𝑆𝑖\displaystyle\left|R_{\mathcal{D}_{0}}\left(A\left(\widetilde{S}_{i}\right)% \right)-\widehat{R}_{\widetilde{S}_{i}}\left(A\left(\widetilde{S}_{i}\right)% \right)\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
|R𝒟0(A(S~i))R𝒟~i(A(S~i))|+|R𝒟~i(A(S~i))R^S~i(A(S~i))|absentsubscript𝑅subscript𝒟0𝐴subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝐴subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝐴subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝐴subscript~𝑆𝑖\displaystyle\leq\left|R_{\mathcal{D}_{0}}\left(A\left(\widetilde{S}_{i}\right% )\right)-R_{\widetilde{\mathcal{D}}_{i}}\left(A\left(\widetilde{S}_{i}\right)% \right)\right|+\left|R_{\widetilde{\mathcal{D}}_{i}}\left(A\left(\widetilde{S}% _{i}\right)\right)-\widehat{R}_{\widetilde{S}_{i}}\left(A\left(\widetilde{S}_{% i}\right)\right)\right|≤ | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
2M(1(1α)i)α1dTV(n)absent2𝑀1superscript1𝛼𝑖superscript𝛼1subscript𝑑TV𝑛\displaystyle\leq 2M\left(1-(1-\alpha)^{i}\right)\alpha^{-1}d_{\mathrm{TV}}(n)≤ 2 italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n )
+((1α)βnlog(n(1α))+α(βn+Mγni)log(nα))log(1δ)1𝛼subscript𝛽𝑛𝑛1𝛼𝛼subscript𝛽𝑛𝑀superscriptsubscript𝛾𝑛𝑖𝑛𝛼1𝛿\displaystyle+\left((1-\alpha)\beta_{n}\log(n(1-\alpha))+\alpha\left(\beta_{n}% +M\gamma_{n}^{i}\right)\log(n\alpha)\right)\log\left(\frac{1}{\delta}\right)+ ( ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG )
+((1α)nαβn+Mn1/2(1α+α))log(1δ).1𝛼𝑛𝛼subscript𝛽𝑛𝑀superscript𝑛121𝛼𝛼1𝛿\displaystyle+\left(\sqrt{(1-\alpha)n}\alpha\beta_{n}+Mn^{-1/2}(\sqrt{1-\alpha% }+\sqrt{\alpha})\right)\sqrt{\log\left(\frac{1}{\delta}\right)}.+ ( square-root start_ARG ( 1 - italic_α ) italic_n end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( square-root start_ARG 1 - italic_α end_ARG + square-root start_ARG italic_α end_ARG ) ) square-root start_ARG roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG . (24)

A.7 Proof of Theorem 2

In this section, we prove that transformers in in-context learning exhibit recursive stability. Specifically, we utilize the framework and lemmas from Li et al. (2023), combined with recursive techniques, to establish the proof.

Lemma 10 (Li et al. (2023)).

Let 𝐳,𝛆n𝐳𝛆superscript𝑛\bm{z},\bm{\varepsilon}\in\mathbb{R}^{n}bold_italic_z , bold_italic_ε ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be vectors obeying 𝐳,𝐳+𝛆csubscriptnorm𝐳subscriptsubscriptnorm𝐳𝛆subscript𝑐\|\bm{z}\|_{\ell_{\infty}},\|\bm{z}+\bm{\varepsilon}\|_{\ell_{\infty}}\leq c∥ bold_italic_z ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ∥ bold_italic_z + bold_italic_ε ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_c. Then, there exists a constant C=C(c)𝐶𝐶𝑐C=C(c)italic_C = italic_C ( italic_c ), such that

softmax(𝒛)e2c/n and softmax(𝒛)softmax(𝒛+𝜺)1e2c𝜺1/n.formulae-sequencesubscriptnormsoftmax𝒛subscriptsuperscript𝑒2𝑐𝑛 and subscriptnormsoftmax𝒛softmax𝒛𝜺subscript1superscript𝑒2𝑐subscriptnorm𝜺subscript1𝑛\|\operatorname{softmax}(\bm{z})\|_{\ell_{\infty}}\leq e^{2c}/n\quad\text{ and% }\quad\|\operatorname{softmax}(\bm{z})-\operatorname{softmax}(\bm{z}+\bm{% \varepsilon})\|_{\ell_{1}}\leq e^{2c}\|\bm{\varepsilon}\|_{\ell_{1}}/n.∥ roman_softmax ( bold_italic_z ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_e start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT / italic_n and ∥ roman_softmax ( bold_italic_z ) - roman_softmax ( bold_italic_z + bold_italic_ε ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_e start_POSTSUPERSCRIPT 2 italic_c end_POSTSUPERSCRIPT ∥ bold_italic_ε ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_n .
Proof of Theorem 2.

. Let 𝒁=[𝒛1,,𝒛n]𝒁superscriptsubscript𝒛1subscript𝒛𝑛top\bm{Z}=\left[\bm{z}_{1},\ldots,\bm{z}_{n}\right]^{\top}bold_italic_Z = [ bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and 𝑬=[𝜺1,,𝜺n]𝑬superscriptsubscript𝜺1subscript𝜺𝑛top\bm{E}=\left[\bm{\varepsilon}_{1},\ldots,\bm{\varepsilon}_{n}\right]^{\top}bold_italic_E = [ bold_italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_ε start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT be the input and perturbation matrices respectively. Given that the tokens 𝒛isubscript𝒛𝑖\bm{z}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT lie in the unit ball, and assuming 𝒛i+𝜺isubscript𝒛𝑖subscript𝜺𝑖\bm{z}_{i}+\bm{\varepsilon}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT also lies in the unit ball, we can proceed with the following. For a matrix, let 2,p\|\cdot\|_{2,p}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 , italic_p end_POSTSUBSCRIPT denote the psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-norm of the vector formed by the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norms of its rows. Therefore, we obtain 𝒁2,1subscriptnorm𝒁21\|\bm{Z}\|_{2,\infty}\leq 1∥ bold_italic_Z ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ≤ 1 and 𝒁¯2,=𝒁+𝑬2,subscriptnormbold-¯𝒁2subscriptnorm𝒁𝑬2absent\|\bm{\bar{Z}}\|_{2,\infty}=\|\bm{Z}+\bm{E}\|_{2,\infty}\leq∥ overbold_¯ start_ARG bold_italic_Z end_ARG ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT = ∥ bold_italic_Z + bold_italic_E ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ≤ 1. Let the attention outputs be defined as 𝑨=softmax(𝒁𝑾𝒁)𝒁𝑽𝑨softmax𝒁𝑾superscript𝒁top𝒁𝑽\bm{A}=\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z}^{\top}\right)\bm{Z}\bm{V}bold_italic_A = roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_Z bold_italic_V and 𝑨¯=softmax(𝒁¯𝑾𝒁¯)𝒁¯𝑽¯𝑨softmax¯𝒁𝑾superscript¯𝒁top¯𝒁𝑽\bar{\bm{A}}=\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm{Z}}^{\top}% \right)\bar{\bm{Z}}\bm{V}over¯ start_ARG bold_italic_A end_ARG = roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) over¯ start_ARG bold_italic_Z end_ARG bold_italic_V. Define the perturbation as 𝑬¯=𝑨¯𝑨:=[𝜺¯1,𝜺¯n]¯𝑬¯𝑨𝑨assignsuperscriptsubscript¯𝜺1subscript¯𝜺𝑛top\bar{\bm{E}}=\bar{\bm{A}}-\bm{A}:=\left[\bar{\bm{\varepsilon}}_{1},\ldots\bar{% \bm{\varepsilon}}_{n}\right]^{\top}over¯ start_ARG bold_italic_E end_ARG = over¯ start_ARG bold_italic_A end_ARG - bold_italic_A := [ over¯ start_ARG bold_italic_ε end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … over¯ start_ARG bold_italic_ε end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

Let us examine the attention output difference 𝑬¯=𝑨¯𝑨¯𝑬¯𝑨𝑨\bar{\bm{E}}=\bar{\bm{A}}-\bm{A}over¯ start_ARG bold_italic_E end_ARG = over¯ start_ARG bold_italic_A end_ARG - bold_italic_A, which can be further decomposed as follows:

𝑬¯bold-¯𝑬\displaystyle\bm{\bar{E}}overbold_¯ start_ARG bold_italic_E end_ARG =softmax(𝒁¯𝑾𝒁¯)𝒁¯𝑽softmax(𝒁𝑾𝒁)𝒁𝑽absentsoftmax¯𝒁𝑾superscript¯𝒁top¯𝒁𝑽softmax𝒁𝑾superscript𝒁top𝒁𝑽\displaystyle=\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm{Z}}^{\top% }\right)\bar{\bm{Z}}\bm{V}-\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z}^{% \top}\right)\bm{Z}\bm{V}= roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) over¯ start_ARG bold_italic_Z end_ARG bold_italic_V - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_Z bold_italic_V
=[softmax(𝒁¯𝑾𝒁¯)softmax(𝒁𝑾𝒁)]𝒁𝑽𝑬¯𝟏+softmax(𝒁¯𝑾𝒁¯)𝑬𝑽𝑬¯𝟐.absentsubscriptdelimited-[]softmax¯𝒁𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁top𝒁𝑽subscriptbold-¯𝑬1subscriptsoftmax¯𝒁𝑾superscript¯𝒁top𝑬𝑽subscriptbold-¯𝑬2\displaystyle=\underbrace{\left[\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}% \bar{\bm{Z}}^{\top}\right)-\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z}^{% \top}\right)\right]\bm{Z}\bm{V}}_{\bm{\bar{E}_{1}}}+\underbrace{\operatorname{% softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm{Z}}^{\top}\right)\bm{E}\bm{V}}_{\bm{% \bar{E}_{2}}}.= under⏟ start_ARG [ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] bold_italic_Z bold_italic_V end_ARG start_POSTSUBSCRIPT overbold_¯ start_ARG bold_italic_E end_ARG start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_E bold_italic_V end_ARG start_POSTSUBSCRIPT overbold_¯ start_ARG bold_italic_E end_ARG start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (25)

We first observe that 𝑽𝑽\bm{V}bold_italic_V preserves the norm, meaning that 𝒁𝑽𝒁𝑽\bm{Z}\bm{V}bold_italic_Z bold_italic_V satisfies 𝒁𝑽2,subscriptnorm𝒁𝑽2absent\|\bm{Z}\bm{V}\|_{2,\infty}\leq∥ bold_italic_Z bold_italic_V ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ≤ 𝒁2,1subscriptnorm𝒁21\|\bm{Z}\|_{2,\infty}\leq 1∥ bold_italic_Z ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ≤ 1 and 𝑬𝑽2,1𝑬2,1subscriptnorm𝑬𝑽21subscriptnorm𝑬21\|\bm{E}\bm{V}\|_{2,1}\leq\|\bm{E}\|_{2,1}∥ bold_italic_E bold_italic_V ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT. Moreover, for any pair of tokens, it holds that |𝒛i𝑾𝒛j|BWsuperscriptsubscript𝒛𝑖top𝑾subscript𝒛𝑗subscript𝐵𝑊\left|\bm{z}_{i}^{\top}\bm{W}\bm{z}_{j}\right|\leq B_{W}| bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT. Applying Lemma 10, we can therefore derive the following:

𝑬¯22,1=softmax(𝒁¯𝑾𝒁¯)𝑬𝑽2,1e2BW𝑬2,1.subscriptnormsubscriptbold-¯𝑬221subscriptnormsoftmax¯𝒁𝑾superscript¯𝒁top𝑬𝑽21superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21\displaystyle\left\|\bm{\bar{E}}_{2}\right\|_{2,1}=\left\|\operatorname{% softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm{Z}}^{\top}\right)\bm{E}\bm{V}\right\|% _{2,1}\leq e^{2B_{W}}\|\bm{E}\|_{2,1}.∥ overbold_¯ start_ARG bold_italic_E end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT = ∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) bold_italic_E bold_italic_V ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT . (26)

Subsequently, for 𝑬¯1subscriptbold-¯𝑬1\bm{\bar{E}}_{1}overbold_¯ start_ARG bold_italic_E end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we establish the following expression

𝑬¯12,1subscriptnormsubscriptbold-¯𝑬121\displaystyle\left\|\bm{\bar{E}}_{1}\right\|_{2,1}∥ overbold_¯ start_ARG bold_italic_E end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT =[softmax(𝒁¯𝑾𝒁¯)softmax(𝒁𝑾𝒁)]𝒁𝑽2,1absentsubscriptnormdelimited-[]softmax¯𝒁𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁top𝒁𝑽21\displaystyle=\left\|[\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm{Z% }}^{\top}\right)-\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z}^{\top}\right)]% \bm{Z}\bm{V}\right\|_{2,1}= ∥ [ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] bold_italic_Z bold_italic_V ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
softmax(𝒁¯𝑾𝒁¯)softmax(𝒁𝑾𝒁)1𝒁𝑽2,absentsubscriptnormsoftmax¯𝒁𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁topsubscript1subscriptnorm𝒁𝑽2\displaystyle\leq\left\|\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm% {Z}}^{\top}\right)-\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z}^{\top}\right% )\right\|_{\ell_{1}}\|\bm{Z}\bm{V}\|_{2,\infty}≤ ∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_Z bold_italic_V ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT
softmax(𝒁¯𝑾𝒁¯)softmax(𝒁𝑾𝒁)1.absentsubscriptnormsoftmax¯𝒁𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁topsubscript1\displaystyle\leq\left\|\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm% {Z}}^{\top}\right)-\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z}^{\top}\right% )\right\|_{\ell_{1}}.≤ ∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

To advance the analysis, we introduce the δ𝛿\deltaitalic_δ-scaled perturbation 𝑬=δ𝑬=𝒁¯𝒁superscript𝑬𝛿𝑬superscript¯𝒁𝒁\bm{E}^{\prime}=\delta\bm{E}=\bar{\bm{Z}}^{\prime}-\bm{Z}bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_δ bold_italic_E = over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_italic_Z, where δ𝛿\deltaitalic_δ is constrained within 0δ10𝛿10\leq\delta\leq 10 ≤ italic_δ ≤ 1. Our approach involves first bounding the derivative as δ0𝛿0\delta\rightarrow 0italic_δ → 0, and then integrating this bound along the path of 𝑬𝑬\bm{E}bold_italic_E, effectively covering the interval from δ=0𝛿0\delta=0italic_δ = 0 to δ=1𝛿1\delta=1italic_δ = 1. Notably, as δ0𝛿0\delta\rightarrow 0italic_δ → 0, the quadratic terms proportional to δ2𝑬superscript𝛿2𝑬\delta^{2}\bm{E}italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_E diminish, simplifying the analysis at this limit.

softmax(𝒁¯𝑾𝒁¯)softmax(𝒁𝑾𝒁)1subscriptnormsoftmaxsuperscript¯𝒁𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁topsubscript1\displaystyle\left\|\operatorname{softmax}\left(\bar{\bm{Z}}^{\prime}\bm{W}% \bar{\bm{Z}}^{\prime\top}\right)-\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z% }^{\top}\right)\right\|_{\ell_{1}}∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
softmax(𝒁¯𝑾𝒁)softmax(𝒁𝑾𝒁)1+softmax(𝒁𝑾𝒁¯)softmax(𝒁𝑾𝒁)1.absentsubscriptnormsoftmaxsuperscript¯𝒁𝑾superscript𝒁topsoftmax𝒁𝑾superscript𝒁topsubscript1subscriptnormsoftmax𝒁𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁topsubscript1\displaystyle\leq\left\|\operatorname{softmax}(\bar{\bm{Z}}^{\prime}\bm{W}\bm{% Z}^{\top})-\operatorname{softmax}(\bm{Z}\bm{W}\bm{Z}^{\top})\right\|_{\ell_{1}% }+\left\|\operatorname{softmax}(\bm{Z}\bm{W}\bar{\bm{Z}}^{\prime\top})-% \operatorname{softmax}(\bm{Z}\bm{W}\bm{Z}^{\top})\right\|_{\ell_{1}}.≤ ∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ roman_softmax ( bold_italic_Z bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

To bound the latter, we focus on each row separately. Consider a row from 𝒁𝒁\bm{Z}bold_italic_Z and its perturbed version 𝒁+𝑬𝒁superscript𝑬\bm{Z}+\bm{E}^{\prime}bold_italic_Z + bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, represented by the pair (𝒛,𝒛+𝜺)𝒛𝒛superscript𝜺bold-′\left(\bm{z},\bm{z}+\bm{\varepsilon^{\prime}}\right)( bold_italic_z , bold_italic_z + bold_italic_ε start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ). It follows that for any cross product, we have the guarantees |(𝒛+𝜺)𝑾𝒛i|BWsuperscript𝒛superscript𝜺bold-′top𝑾subscript𝒛𝑖subscript𝐵𝑊\left|\left(\bm{z}+\bm{\varepsilon^{\prime}}\right)^{\top}\bm{W}\bm{z}_{i}% \right|\leq B_{W}| ( bold_italic_z + bold_italic_ε start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT and |𝒛𝑾𝒛i|BWsuperscript𝒛top𝑾subscript𝒛𝑖subscript𝐵𝑊\left|\bm{z}^{\top}\bm{W}\bm{z}_{i}\right|\leq B_{W}| bold_italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT. Additionally, the bounds 𝜺𝑾𝒁1BWn𝜺2subscriptnormsuperscript𝜺top𝑾𝒁subscript1subscript𝐵𝑊𝑛subscriptnormsuperscript𝜺subscript2\left\|\bm{\varepsilon}^{\prime\top}\bm{W}\bm{Z}\right\|_{\ell_{1}}\leq B_{W}n% \left\|\bm{\varepsilon}^{\prime}\right\|_{\ell_{2}}∥ bold_italic_ε start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_Z ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_n ∥ bold_italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒛𝑾𝑬1BW𝑬2,1subscriptnormsuperscript𝒛top𝑾superscript𝑬topsubscript1subscript𝐵𝑊subscriptnormsuperscript𝑬21\left\|\bm{z}^{\top}\bm{W}\bm{E}^{\prime\top}\right\|_{\ell_{1}}\leq B_{W}% \left\|\bm{E}^{\prime}\right\|_{2,1}∥ bold_italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_E start_POSTSUPERSCRIPT ′ ⊤ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∥ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT hold. Applying the perturbation bounds provided by Lemma 10, we obtain the desired result

softmax((𝒛+𝜺)𝑾𝒁)softmax(𝒛𝑾𝒁)1BWe2BW𝜺2subscriptnormsoftmaxsuperscript𝒛superscript𝜺top𝑾superscript𝒁topsoftmaxsuperscript𝒛top𝑾superscript𝒁topsubscript1subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnormsuperscript𝜺subscript2\displaystyle\left\|\operatorname{softmax}\left(\left(\bm{z}+\bm{\varepsilon}^% {\prime}\right)^{\top}\bm{W}\bm{Z}^{\top}\right)-\operatorname{softmax}\left(% \bm{z}^{\top}\bm{W}\bm{Z}^{\top}\right)\right\|_{\ell_{1}}\leq B_{W}e^{2B_{W}}% \left\|\bm{\varepsilon}^{\prime}\right\|_{\ell_{2}}∥ roman_softmax ( ( bold_italic_z + bold_italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
softmax(𝒛𝑾(𝒁+𝑬))softmax(𝒛𝑾𝒁)1BWe2BW𝑬2,1/n.subscriptnormsoftmaxsuperscript𝒛top𝑾superscript𝒁superscript𝑬topsoftmaxsuperscript𝒛top𝑾superscript𝒁topsubscript1subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnormsuperscript𝑬21𝑛\displaystyle\left\|\operatorname{softmax}\left(\bm{z}^{\top}\bm{W}\left(\bm{Z% }+\bm{E}^{\prime}\right)^{\top}\right)-\operatorname{softmax}\left(\bm{z}^{% \top}\bm{W}\bm{Z}^{\top}\right)\right\|_{\ell_{1}}\leq B_{W}e^{2B_{W}}\left\|% \bm{E}^{\prime}\right\|_{2,1}/n.∥ roman_softmax ( bold_italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W ( bold_italic_Z + bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT / italic_n .

Summing across all n𝑛nitalic_n rows, we obtain the following:

limδ0softmax((𝒁+δ𝑬)𝑾𝒁¯)softmax(𝒁𝑾𝒁)1/δ2BWe2BW𝑬2,1.subscript𝛿0subscriptnormsoftmax𝒁𝛿𝑬𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁topsubscript1𝛿2subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21\displaystyle\lim_{\delta\rightarrow 0}\left\|\operatorname{softmax}\left((\bm% {Z}+\delta\bm{E})\bm{W}\bar{\bm{Z}}^{\top}\right)-\operatorname{softmax}\left(% \bm{Z}\bm{W}\bm{Z}^{\top}\right)\right\|_{\ell_{1}}/\delta\leq 2B_{W}e^{2B_{W}% }\|\bm{E}\|_{2,1}.roman_lim start_POSTSUBSCRIPT italic_δ → 0 end_POSTSUBSCRIPT ∥ roman_softmax ( ( bold_italic_Z + italic_δ bold_italic_E ) bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_δ ≤ 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT .

By integrating the derivative over the interval δ=0𝛿0\delta=0italic_δ = 0 to δ=1𝛿1\delta=1italic_δ = 1, we obtain the final expression,

softmax(𝒁¯𝑾𝒁¯)softmax(𝒁𝑾𝒁)12BWe2BW𝑬2,1.subscriptnormsoftmax¯𝒁𝑾superscript¯𝒁topsoftmax𝒁𝑾superscript𝒁topsubscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21\displaystyle\left\|\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}\bar{\bm{Z}}% ^{\top}\right)-\operatorname{softmax}\left(\bm{Z}\bm{W}\bm{Z}^{\top}\right)% \right\|_{\ell_{1}}\leq 2B_{W}e^{2B_{W}}\|\bm{E}\|_{2,1}.∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W over¯ start_ARG bold_italic_Z end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - roman_softmax ( bold_italic_Z bold_italic_W bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT . (27)

By substituting inequality 27 and inequality 26 into the decomposition 25, we derive the following result:

𝑨¯𝑨2,1=𝑬¯2,1(2BW+1)e2BW𝑬2,1.subscriptnormbold-¯𝑨𝑨21subscriptnorm¯𝑬212subscript𝐵𝑊1superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21\displaystyle\|\bm{\bar{A}}-\bm{A}\|_{2,1}=\|\bar{\bm{E}}\|_{2,1}\leq(2B_{W}+1% )e^{2B_{W}}\|\bm{E}\|_{2,1}.∥ overbold_¯ start_ARG bold_italic_A end_ARG - bold_italic_A ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT = ∥ over¯ start_ARG bold_italic_E end_ARG ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ ( 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT + 1 ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT . (28)

To continue, we aim to control the output for a specific index j𝑗jitalic_j where the input perturbation remains small, specifically 𝜺j2𝑬2,1nsubscriptnormsubscript𝜺𝑗subscript2subscriptnorm𝑬21𝑛\left\|\bm{\varepsilon}_{j}\right\|_{\ell_{2}}\leq\frac{\|\bm{E}\|_{2,1}}{n}∥ bold_italic_ε start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG. To address this, we will apply the same argument, focusing on the j𝑗jitalic_j-th token. For the j𝑗jitalic_j-th token (omitting subscripts for clarity), let the inputs be denoted as 𝒛,𝒛¯,𝜺=𝒛¯𝒛𝒛bold-¯𝒛𝜺bold-¯𝒛𝒛\bm{z},\bm{\bar{z}},\bm{\varepsilon}=\bm{\bar{z}}-\bm{z}bold_italic_z , overbold_¯ start_ARG bold_italic_z end_ARG , bold_italic_ε = overbold_¯ start_ARG bold_italic_z end_ARG - bold_italic_z, and the corresponding outputs as 𝒂,𝒂¯,𝜺¯=𝒂¯𝒂𝒂¯𝒂¯𝜺¯𝒂𝒂\bm{a},\bar{\bm{a}},\bar{\bm{\varepsilon}}=\bar{\bm{a}}-\bm{a}bold_italic_a , over¯ start_ARG bold_italic_a end_ARG , over¯ start_ARG bold_italic_ε end_ARG = over¯ start_ARG bold_italic_a end_ARG - bold_italic_a. Similar to the previous decomposition, we can derive the following:

𝜺¯=𝑽𝒁[softmax(𝒁¯𝑾𝒁¯)softmax(𝒁𝑾𝒁)]𝜺¯1+𝑽𝑬softmax(𝒁¯𝑾𝒁¯)𝜺¯2.bold-¯𝜺subscriptsuperscript𝑽topsuperscript𝒁topdelimited-[]softmax¯𝒁superscript𝑾top¯𝒁softmax𝒁superscript𝑾top𝒁subscriptbold-¯𝜺1subscriptsuperscript𝑽topsuperscript𝑬topsoftmax¯𝒁superscript𝑾top¯𝒁subscriptbold-¯𝜺2\displaystyle\bm{\bar{\varepsilon}}=\underbrace{\bm{V}^{\top}\bm{Z}^{\top}% \left[\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}^{\top}\bar{\bm{Z}}\right)% -\operatorname{softmax}\left(\bm{Z}\bm{W}^{\top}\bm{Z}\right)\right]}_{\bm{% \bar{\varepsilon}}_{1}}+\underbrace{\bm{V}^{\top}\bm{E}^{\top}\operatorname{% softmax}\left(\bar{\bm{Z}}\bm{W}^{\top}\bar{\bm{Z}}\right)}_{\bm{\bar{% \varepsilon}}_{2}}.overbold_¯ start_ARG bold_italic_ε end_ARG = under⏟ start_ARG bold_italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_Z end_ARG ) - roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_Z ) ] end_ARG start_POSTSUBSCRIPT overbold_¯ start_ARG bold_italic_ε end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG bold_italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_Z end_ARG ) end_ARG start_POSTSUBSCRIPT overbold_¯ start_ARG bold_italic_ε end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (29)

By leveraging the fact that |𝒛i𝑾𝒛j|BWsuperscriptsubscript𝒛𝑖top𝑾subscript𝒛𝑗subscript𝐵𝑊\left|\bm{z}_{i}^{\top}\bm{W}\bm{z}_{j}\right|\leq B_{W}| bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_W bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT for all i,j𝑖𝑗i,jitalic_i , italic_j, and applying Lemma 10, we can establish a bound similar to that in equation 26. Specifically, we can constrain the terms involved as follows:

𝜺¯22𝑬softmax(𝒁¯𝑾𝒛¯)2e2BWn𝑬2,1.subscriptnormsubscriptbold-¯𝜺2subscript2subscriptnormsuperscript𝑬topsoftmax¯𝒁superscript𝑾top¯𝒛subscript2superscript𝑒2subscript𝐵𝑊𝑛subscriptnorm𝑬21\displaystyle\left\|\bm{\bar{\varepsilon}}_{2}\right\|_{\ell_{2}}\leq\left\|% \bm{E}^{\top}\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}^{\top}\bar{\bm{z}}% \right)\right\|_{\ell_{2}}\leq\frac{e^{2B_{W}}}{n}\|\bm{E}\|_{2,1}.∥ overbold_¯ start_ARG bold_italic_ε end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ bold_italic_E start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_z end_ARG ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT . (30)

Similarly, for 𝜺¯1subscript¯𝜺1\bar{\bm{\varepsilon}}_{1}over¯ start_ARG bold_italic_ε end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we can derive the following:

𝜺¯12subscriptnormsubscript¯𝜺1subscript2\displaystyle\left\|\bar{\bm{\varepsilon}}_{1}\right\|_{\ell_{2}}∥ over¯ start_ARG bold_italic_ε end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝒁[softmax(𝒁¯𝑾𝒛¯)softmax(𝒁𝑾𝒛)]2absentsubscriptnormsuperscript𝒁topdelimited-[]softmax¯𝒁superscript𝑾top¯𝒛softmax𝒁superscript𝑾top𝒛subscript2\displaystyle\leq\left\|\bm{Z}^{\top}\left[\operatorname{softmax}\left(\bar{% \bm{Z}}\bm{W}^{\top}\bar{\bm{z}}\right)-\operatorname{softmax}\left(\bm{Z}\bm{% W}^{\top}\bm{z}\right)\right]\right\|_{\ell_{2}}≤ ∥ bold_italic_Z start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_z end_ARG ) - roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ) ] ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
𝒁2,softmax(𝒁¯𝑾𝒛¯)softmax(𝒁𝑾𝒛)1absentsubscriptnorm𝒁2subscriptnormsoftmax¯𝒁superscript𝑾top¯𝒛softmax𝒁superscript𝑾top𝒛subscript1\displaystyle\leq\|\bm{Z}\|_{2,\infty}\left\|\operatorname{softmax}\left(\bar{% \bm{Z}}\bm{W}^{\top}\bar{\bm{z}}\right)-\operatorname{softmax}\left(\bm{Z}\bm{% W}^{\top}\bm{z}\right)\right\|_{\ell_{1}}≤ ∥ bold_italic_Z ∥ start_POSTSUBSCRIPT 2 , ∞ end_POSTSUBSCRIPT ∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_z end_ARG ) - roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
softmax(𝒁¯𝑾𝒛¯)softmax(𝒁𝑾𝒛)1.absentsubscriptnormsoftmax¯𝒁superscript𝑾top¯𝒛softmax𝒁superscript𝑾top𝒛subscript1\displaystyle\leq\left\|\operatorname{softmax}\left(\bar{\bm{Z}}\bm{W}^{\top}% \bar{\bm{z}}\right)-\operatorname{softmax}\left(\bm{Z}\bm{W}^{\top}\bm{z}% \right)\right\|_{\ell_{1}}.≤ ∥ roman_softmax ( over¯ start_ARG bold_italic_Z end_ARG bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_z end_ARG ) - roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Now, considering the perturbation 𝑬=δ𝑬superscript𝑬𝛿𝑬\bm{E}^{\prime}=\delta\bm{E}bold_italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_δ bold_italic_E, and letting δ0𝛿0\delta\rightarrow 0italic_δ → 0, we apply the triangle inequality to obtain the following result:

limδ0subscript𝛿0\displaystyle\lim_{\delta\rightarrow 0}roman_lim start_POSTSUBSCRIPT italic_δ → 0 end_POSTSUBSCRIPT δ1softmax((𝒁+δ𝑬)𝑾(𝒛+δ𝜺))softmax(𝒁𝑾𝒛)1superscript𝛿1subscriptnormsoftmax𝒁𝛿𝑬superscript𝑾top𝒛𝛿𝜺softmax𝒁superscript𝑾top𝒛subscript1\displaystyle\delta^{-1}\left\|\operatorname{softmax}\left((\bm{Z}+\delta\bm{E% })\bm{W}^{\top}(\bm{z}+\delta\bm{\varepsilon})\right)-\operatorname{softmax}% \left(\bm{Z}\bm{W}^{\top}\bm{z}\right)\right\|_{\ell_{1}}italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ roman_softmax ( ( bold_italic_Z + italic_δ bold_italic_E ) bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_z + italic_δ bold_italic_ε ) ) - roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
\displaystyle\leq limδ0δ1softmax((𝒁+δ𝑬)𝑾𝒛)softmax(𝒁𝑾𝒛)1subscript𝛿0superscript𝛿1subscriptnormsoftmax𝒁𝛿𝑬superscript𝑾top𝒛softmax𝒁superscript𝑾top𝒛subscript1\displaystyle\lim_{\delta\rightarrow 0}\delta^{-1}\left\|\operatorname{softmax% }\left((\bm{Z}+\delta\bm{E})\bm{W}^{\top}\bm{z}\right)-\operatorname{softmax}% \left(\bm{Z}\bm{W}^{\top}\bm{z}\right)\right\|_{\ell_{1}}roman_lim start_POSTSUBSCRIPT italic_δ → 0 end_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ roman_softmax ( ( bold_italic_Z + italic_δ bold_italic_E ) bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ) - roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+δ1softmax(𝒁𝑾(𝒛+δ𝜺))softmax(𝒁𝑾𝒛)1superscript𝛿1subscriptnormsoftmax𝒁superscript𝑾top𝒛𝛿𝜺softmax𝒁superscript𝑾top𝒛subscript1\displaystyle\quad+\delta^{-1}\left\|\operatorname{softmax}\left(\bm{Z}\bm{W}^% {\top}(\bm{z}+\delta\bm{\varepsilon})\right)-\operatorname{softmax}\left(\bm{Z% }\bm{W}^{\top}\bm{z}\right)\right\|_{\ell_{1}}+ italic_δ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_z + italic_δ bold_italic_ε ) ) - roman_softmax ( bold_italic_Z bold_italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
\displaystyle\leq BWe2BW𝑬2,1/n+BWe2BW𝜺2subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21𝑛subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnorm𝜺subscript2\displaystyle B_{W}e^{2B_{W}}\|\bm{E}\|_{2,1}/n+B_{W}e^{2B_{W}}\|\bm{% \varepsilon}\|_{\ell_{2}}italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT / italic_n + italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_ε ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
\displaystyle\leq 2BWe2BW𝑬2,1/n.2subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21𝑛\displaystyle 2B_{W}e^{2B_{W}}\|\bm{E}\|_{2,1}/n.2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT / italic_n . (31)

In a similar manner to the previous steps, we can derive the following:

𝜺¯21n(2BW+1)e2BW𝑬2,1.subscriptnorm¯𝜺subscript21𝑛2subscript𝐵𝑊1superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21\displaystyle\left\|\bar{\bm{\varepsilon}}\right\|_{\ell_{2}}\leq\frac{1}{n}(2% B_{W}+1)e^{2B_{W}}\|\bm{E}\|_{2,1}.∥ over¯ start_ARG bold_italic_ε end_ARG ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT + 1 ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT . (32)

Next, we examine the effect of the MLP layer on the model’s behavior. Let (𝑴i)i=1nd×dsuperscriptsubscriptsubscript𝑴𝑖𝑖1𝑛superscript𝑑𝑑\left(\bm{M}_{i}\right)_{i=1}^{n}\in\mathbb{R}^{d\times d}( bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT represent the weights of the parallel MLPs that follow the self-attention mechanism. Given that 𝑴i1normsubscript𝑴𝑖1\left\|\bm{M}_{i}\right\|\leq 1∥ bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ≤ 1, we denote the MLP outputs corresponding to the self-attention results 𝑨𝑨\bm{A}bold_italic_A and 𝑨¯bold-¯𝑨\bm{\bar{A}}overbold_¯ start_ARG bold_italic_A end_ARG as 𝑼𝑼\bm{U}bold_italic_U and 𝑼¯bold-¯𝑼\bm{\bar{U}}overbold_¯ start_ARG bold_italic_U end_ARG, respectively. From this, we can derive the following relationship.

Let ϕitalic-ϕ\phiitalic_ϕ denote the ReLU function, which is a 1-Lipschitz continuous activation function with ϕ(0)=0italic-ϕ00\phi(0)=0italic_ϕ ( 0 ) = 0. First, observe that each row of 𝑼𝑼\bm{U}bold_italic_U is given by 𝒖i=ϕ(𝑴i𝒂i)subscript𝒖𝑖italic-ϕsubscript𝑴𝑖subscript𝒂𝑖\bm{u}_{i}=\phi\left(\bm{M}_{i}\bm{a}_{i}\right)bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϕ ( bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where 𝑴id×dsubscript𝑴𝑖superscript𝑑𝑑\bm{M}_{i}\in\mathbb{R}^{d\times d}bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT represents the weights of the MLPs. Given the properties of the ReLU function, we can derive the following bound:

𝒖i2ϕ(𝑴i𝒂i)2𝑴i𝒂i2𝒂i21.subscriptnormsubscript𝒖𝑖subscript2subscriptnormitalic-ϕsubscript𝑴𝑖subscript𝒂𝑖subscript2subscriptnormsubscript𝑴𝑖subscript𝒂𝑖subscript2subscriptnormsubscript𝒂𝑖subscript21\left\|\bm{u}_{i}\right\|_{\ell_{2}}\leq\left\|\phi\left(\bm{M}_{i}\bm{a}_{i}% \right)\right\|_{\ell_{2}}\leq\left\|\bm{M}_{i}\bm{a}_{i}\right\|_{\ell_{2}}% \leq\left\|\bm{a}_{i}\right\|_{\ell_{2}}\leq 1.∥ bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ italic_ϕ ( bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 1 .

Next, we consider the difference between the perturbed and original outputs. We can express the difference as 𝒖i𝒖¯i2ϕ(𝑴𝒊𝒂𝒊)ϕ(𝑴𝒊𝒂¯𝒊)2subscriptnormsubscript𝒖𝑖subscriptbold-¯𝒖𝑖subscript2subscriptnormitalic-ϕsubscript𝑴𝒊subscript𝒂𝒊italic-ϕsubscript𝑴𝒊subscriptbold-¯𝒂𝒊subscript2\left\|\bm{u}_{i}-\bm{\bar{u}}_{i}\right\|_{\ell_{2}}\leq\left\|\phi\left(\bm{% M_{i}}\bm{a_{i}}\right)-\phi\left(\bm{M_{i}}\bm{\bar{a}_{i}}\right)\right\|_{% \ell_{2}}∥ bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overbold_¯ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ italic_ϕ ( bold_italic_M start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT bold_italic_a start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) - italic_ϕ ( bold_italic_M start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT overbold_¯ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which, due to the 1-Lipschitz property of ϕitalic-ϕ\phiitalic_ϕ, is further bounded by 𝑴i(𝒂i𝒂¯i)2𝒂i𝒂¯i2subscriptnormsubscript𝑴𝑖subscript𝒂𝑖subscriptbold-¯𝒂𝑖subscript2subscriptnormsubscript𝒂𝑖subscriptbold-¯𝒂𝑖subscript2\left\|\bm{M}_{i}\left(\bm{a}_{i}-\bm{\bar{a}}_{i}\right)\right\|_{\ell_{2}}% \leq\left\|\bm{a}_{i}-\bm{\bar{a}}_{i}\right\|_{\ell_{2}}∥ bold_italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overbold_¯ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overbold_¯ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Finally, we obtain:

𝒖i𝒖¯i2𝒂i𝒂¯i2.subscriptnormsubscript𝒖𝑖subscriptbold-¯𝒖𝑖subscript2subscriptnormsubscript𝒂𝑖subscriptbold-¯𝒂𝑖subscript2\displaystyle\left\|\bm{u}_{i}-\bm{\bar{u}}_{i}\right\|_{\ell_{2}}\leq\left\|% \bm{a}_{i}-\bm{\bar{a}}_{i}\right\|_{\ell_{2}}.∥ bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overbold_¯ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ ∥ bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overbold_¯ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (33)

Thus, we conclude that the perturbations in the rows of 𝑼𝑼\bm{U}bold_italic_U are controlled by the corresponding perturbations in 𝑨𝑨\bm{A}bold_italic_A. Consequently, we establish the bound

𝑼𝑼¯2,1𝑨𝑨¯2,1.subscriptnorm𝑼bold-¯𝑼21subscriptnorm𝑨¯𝑨21\|\bm{U}-\bm{\bar{U}}\|_{2,1}\leq\|\bm{A}-\bar{\bm{A}}\|_{2,1}.∥ bold_italic_U - overbold_¯ start_ARG bold_italic_U end_ARG ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ ∥ bold_italic_A - over¯ start_ARG bold_italic_A end_ARG ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT .

Thus, from inequality 28, we derive the following result:

𝑼𝑼¯2,1(2BW+1)e2BW𝑬2,1.subscriptnorm𝑼bold-¯𝑼212subscript𝐵𝑊1superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21\displaystyle\|\bm{U}-\bm{\bar{U}}\|_{2,1}\leq\left(2B_{W}+1\right)e^{2B_{W}}% \|\bm{E}\|_{2,1}.∥ bold_italic_U - overbold_¯ start_ARG bold_italic_U end_ARG ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ ( 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT + 1 ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT . (34)

Furthermore, for any i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] such that 𝜺i2𝑬2,1nsubscriptnormsubscript𝜺𝑖subscript2subscriptnorm𝑬21𝑛\left\|\bm{\varepsilon}_{i}\right\|_{\ell_{2}}\leq\frac{\|\bm{E}\|_{2,1}}{n}∥ bold_italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_n end_ARG, it holds that

𝒖i𝒖¯i21n(2BW+1)e2BW𝑬2,1,subscriptnormsubscript𝒖𝑖subscript¯𝒖𝑖subscript21𝑛2subscript𝐵𝑊1superscript𝑒2subscript𝐵𝑊subscriptnorm𝑬21\left\|\bm{u}_{i}-\bar{\bm{u}}_{i}\right\|_{\ell_{2}}\leq\frac{1}{n}(2B_{W}+1)% e^{2B_{W}}\|\bm{E}\|_{2,1},∥ bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT + 1 ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_E ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ,

where 𝒖isubscript𝒖𝑖\bm{u}_{i}bold_italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_i-th row of 𝑼𝑼\bm{U}bold_italic_U. With this, we have addressed the stability of the single-layer transformer. Moving forward, we will extend our analysis and focus on the stability of L𝐿Litalic_L-layer transformer. First, we can derive the following:

𝒁(k)𝒁¯(k)2,1(1+2BW)e2BW𝒁(k1)𝒁¯(k1)2,1,subscriptnormsubscript𝒁𝑘subscript¯𝒁𝑘2112subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊subscriptnormsubscript𝒁𝑘1subscript¯𝒁𝑘121\displaystyle\left\|\bm{Z}_{(k)}-\bar{\bm{Z}}_{(k)}\right\|_{2,1}\leq(1+2B_{W}% )e^{2B_{W}}\left\|\bm{Z}_{(k-1)}-\bar{\bm{Z}}_{(k-1)}\right\|_{2,1},∥ bold_italic_Z start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ bold_italic_Z start_POSTSUBSCRIPT ( italic_k - 1 ) end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT ( italic_k - 1 ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ,

where 1kL1𝑘𝐿1\leq k\leq L1 ≤ italic_k ≤ italic_L represents the number of layers in the transformer. Then, for L𝐿Litalic_L-layer transformer, we have the following:

𝒁(L)𝒁¯(L)2,1((1+2BW)e2BW)L𝒁(0)𝒁¯(0)2,1subscriptnormsubscript𝒁𝐿subscript¯𝒁𝐿21superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿subscriptnormsubscript𝒁0subscript¯𝒁021\displaystyle\left\|\bm{Z}_{(L)}-\bar{\bm{Z}}_{(L)}\right\|_{2,1}\leq((1+2B_{W% })e^{2B_{W}})^{L}\left\|\bm{Z}_{(0)}-\bar{\bm{Z}}_{(0)}\right\|_{2,1}∥ bold_italic_Z start_POSTSUBSCRIPT ( italic_L ) end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT ( italic_L ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ bold_italic_Z start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT

What remains is to perform induction on the difference between the last tokens 𝒛n(i)𝒛n(i)superscriptsubscript𝒛𝑛𝑖superscriptsubscript𝒛𝑛𝑖\bm{z}_{n}^{(i)}-\bm{z}_{n}^{\prime(i)}bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ( italic_i ) end_POSTSUPERSCRIPT. We claim that, for all layers,

𝒛n(i)𝒛n(i)21n((1+2BW)e2BW)i𝒁(0)𝒁¯(0)2,1.subscriptnormsuperscriptsubscript𝒛𝑛𝑖superscriptsubscript𝒛𝑛𝑖subscript21𝑛superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖subscriptnormsubscript𝒁0subscript¯𝒁021\left\|\bm{z}_{n}^{(i)}-\bm{z}_{n}^{\prime(i)}\right\|_{\ell_{2}}\leq\frac{1}{% n}((1+2B_{W})e^{2B_{W}})^{i}\left\|\bm{Z}_{(0)}-\bar{\bm{Z}}_{(0)}\right\|_{2,% 1}.∥ bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ( italic_i ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ bold_italic_Z start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT .

This claim holds at i=0𝑖0i=0italic_i = 0 because the change in the last token is at most 𝒁(0)𝒁¯(0)2,1/nsubscriptnormsubscript𝒁0subscript¯𝒁021𝑛\left\|\bm{Z}_{(0)}-\bar{\bm{Z}}_{(0)}\right\|_{2,1}/n∥ bold_italic_Z start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT / italic_n. By induction, the claim holds for all layers, and we conclude the proof by setting i=L𝑖𝐿i=Litalic_i = italic_L, covering the entire depth of the L𝐿Litalic_L-layer transformer. Finally, we obtain:

𝒛n(L)𝒛n(L)21n((1+2BW)e2BW)L𝒁(0)𝒁¯(0)2,1.subscriptnormsuperscriptsubscript𝒛𝑛𝐿superscriptsubscript𝒛𝑛𝐿subscript21𝑛superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿subscriptnormsubscript𝒁0subscript¯𝒁021\displaystyle\left\|\bm{z}_{n}^{(L)}-\bm{z}_{n}^{\prime(L)}\right\|_{\ell_{2}}% \leq\frac{1}{n}((1+2B_{W})e^{2B_{W}})^{L}\left\|\bm{Z}_{(0)}-\bar{\bm{Z}}_{(0)% }\right\|_{2,1}.∥ bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT - bold_italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ( italic_L ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ bold_italic_Z start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT - over¯ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT ( 0 ) end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT . (35)

Next, we further analyze the self-consuming process. Let S0=[𝒛0,1,,𝒛0,j,,𝒛0,n]subscript𝑆0superscriptsubscript𝒛01subscript𝒛0𝑗subscript𝒛0𝑛topS_{0}=[\bm{z}_{0,1},...,\bm{z}_{0,j},...,\bm{z}_{0,n}]^{\top}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 0 , italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and S0=[𝒛0,1,,𝒛0,j,,𝒛0,n]superscriptsubscript𝑆0superscriptsubscript𝒛01superscriptsubscript𝒛0𝑗subscript𝒛0𝑛topS_{0}^{\prime}=[\bm{z}_{0,1},...,\bm{z}_{0,j}^{\prime},...,\bm{z}_{0,n}]^{\top}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 0 , italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT represent two initial real datasets that differ only in their inputs, specifically 𝒛0,j=(𝒙0,j,𝒚0,j)subscript𝒛0𝑗subscript𝒙0𝑗subscript𝒚0𝑗\bm{z}_{0,j}=\left(\bm{x}_{0,j},\bm{y}_{0,j}\right)bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) and 𝒛0,j=(𝒙0,j,𝒚0,j)superscriptsubscript𝒛0𝑗superscriptsubscript𝒙0𝑗superscriptsubscript𝒚0𝑗\bm{z}_{0,j}^{\prime}=\left(\bm{x}_{0,j}^{\prime},\bm{y}_{0,j}^{\prime}\right)bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where jn𝑗𝑛j\leq nitalic_j ≤ italic_n. Since S0S02,12subscriptnormsubscript𝑆0superscriptsubscript𝑆0212\|S_{0}-S_{0}^{\prime}\|_{2,1}\leq 2∥ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ 2, then, we have the following:

TF(S0)TF(S0)2subscriptnormTFsubscript𝑆0TFsuperscriptsubscript𝑆0subscript2\displaystyle\left\|\mathrm{TF}\left(S_{0}\right)-\mathrm{TF}\left(S_{0}^{% \prime}\right)\right\|_{\ell_{2}}∥ roman_TF ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - roman_TF ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT 12n+1((1+2BW)e2BW)LS0S02,1absent12𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿subscriptnormsubscript𝑆0superscriptsubscript𝑆021\displaystyle\leq\frac{1}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}\|S_{0}-S_% {0}^{\prime}\|_{2,1}≤ divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT (36)
22n+1((1+2BW)e2BW)L.absent22𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\leq\frac{2}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}.≤ divide start_ARG 2 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT .

Then, S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are used as in-context examples, and i.i.d. queries {𝒙1,j}j=1nsuperscriptsubscriptsubscript𝒙1𝑗𝑗1𝑛\{\bm{x}_{1,j}\}_{j=1}^{n}{ bold_italic_x start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are sampled from 𝒳𝒳\mathcal{X}caligraphic_X. These queries, along with the in-context examples S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, are processed through the transformer model to predict their respective labels. As a result, the first generation of synthetic datasets, S1=[𝒛1,1,,𝒛1,j,,𝒛1,n]subscript𝑆1superscriptsubscript𝒛11subscript𝒛1𝑗subscript𝒛1𝑛topS_{1}=[\bm{z}_{1,1},...,\bm{z}_{1,j},...,\bm{z}_{1,n}]^{\top}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and S1=[𝒛1,1,,𝒛1,j,,𝒛1,n]superscriptsubscript𝑆1superscriptsuperscriptsubscript𝒛11superscriptsubscript𝒛1𝑗superscriptsubscript𝒛1𝑛topS_{1}^{\prime}=[\bm{z}_{1,1}^{\prime},...,\bm{z}_{1,j}^{\prime},...,\bm{z}_{1,% n}^{\prime}]^{\top}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, is produced. Then we obtain:

S1S12,12n2n+1((1+2BW)e2BW)L.subscriptnormsubscript𝑆1superscriptsubscript𝑆1212𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\|S_{1}-S_{1}^{\prime}\|_{2,1}\leq\frac{2n}{2n+1}\left((1+2B_{W})% e^{2B_{W}}\right)^{L}.∥ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT . (37)

Given the mixed dataset S~jsubscript~𝑆𝑗\widetilde{S}_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where S~j=αS0+(1α)Sjsubscript~𝑆𝑗𝛼subscript𝑆01𝛼subscript𝑆𝑗\widetilde{S}_{j}=\alpha S_{0}+(1-\alpha)S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for 1ji1𝑗𝑖1\leq j\leq i1 ≤ italic_j ≤ italic_i, we can proceed with further analysis based on the specified combination of the original dataset S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the synthetic dataset Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

S~1S~12,1subscriptnormsubscript~𝑆1superscriptsubscript~𝑆121\displaystyle\|\widetilde{S}_{1}-\widetilde{S}_{1}^{\prime}\|_{2,1}∥ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT αS0S02,1+(1α)S1S12,1absent𝛼subscriptnormsubscript𝑆0superscriptsubscript𝑆0211𝛼subscriptnormsubscript𝑆1superscriptsubscript𝑆121\displaystyle\leq\alpha\|S_{0}-S_{0}^{\prime}\|_{2,1}+(1-\alpha)\|S_{1}-S_{1}^% {\prime}\|_{2,1}≤ italic_α ∥ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) ∥ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
2α+(1α)2n2n+1((1+2BW)e2BW)L.absent2𝛼1𝛼2𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\leq 2\alpha+(1-\alpha)\frac{2n}{2n+1}\left((1+2B_{W})e^{2B_{W}}% \right)^{L}.≤ 2 italic_α + ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT . (38)

By reintroducing the mixed datasets S~1subscript~𝑆1\widetilde{S}_{1}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and S~1superscriptsubscript~𝑆1\widetilde{S}_{1}^{\prime}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as in-context examples into the transformer model, and considering the query set {𝒙2,j}j=1nsuperscriptsubscriptsubscript𝒙2𝑗𝑗1𝑛\left\{\bm{x}_{2,j}\right\}_{j=1}^{n}{ bold_italic_x start_POSTSUBSCRIPT 2 , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as i.i.d. samples from the distribution 𝒳𝒳\mathcal{X}caligraphic_X, we can derive the transformer’s output according to Equation 36:

TF(S~1)TF(S~1)2subscriptnormTFsubscript~𝑆1TFsuperscriptsubscript~𝑆1subscript2\displaystyle\left\|\mathrm{TF}\left(\widetilde{S}_{1}\right)-\mathrm{TF}\left% (\widetilde{S}_{1}^{\prime}\right)\right\|_{\ell_{2}}∥ roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
12n+1((1+2BW)e2BW)LS~1S~12,1absent12𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿subscriptnormsubscript~𝑆1superscriptsubscript~𝑆121\displaystyle\leq\frac{1}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}\|% \widetilde{S}_{1}-\widetilde{S}_{1}^{\prime}\|_{2,1}≤ divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
12n+1((1+2BW)e2BW)L(2α+(1α)2n2n+1((1+2BW)e2BW)L)absent12𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿2𝛼1𝛼2𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\leq\frac{1}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}\left(2% \alpha+(1-\alpha)\frac{2n}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}\right)≤ divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( 2 italic_α + ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT )
(1α)2n(2n+1)2((1+2BW)e2BW)2L+α22n+1((1+2BW)e2BW)L.absent1𝛼2𝑛superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿𝛼22𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\leq(1-\alpha)\frac{2n}{(2n+1)^{2}}\left((1+2B_{W})e^{2B_{W}}% \right)^{2L}+\alpha\frac{2}{2n+1}\left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^% {L}.≤ ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT + italic_α divide start_ARG 2 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT . (39)

From the above expression, we can further derive that

S2S22,1(1α)2n2(2n+1)2((1+2BW)e2BW)2L+α2n2n+1((1+2BW)e2BW)L.subscriptnormsubscript𝑆2superscriptsubscript𝑆2211𝛼2superscript𝑛2superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿𝛼2𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\left\|S_{2}-S_{2}^{\prime}\right\|_{2,1}\leq(1-\alpha)\frac{2n^{% 2}}{(2n+1)^{2}}\left((1+2B_{W})e^{2B_{W}}\right)^{2L}+\alpha\frac{2n}{2n+1}% \left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{L}.∥ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ≤ ( 1 - italic_α ) divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT + italic_α divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT .

Thus,

S~2S~22,1subscriptnormsubscript~𝑆2superscriptsubscript~𝑆221\displaystyle\|\widetilde{S}_{2}-\widetilde{S}_{2}^{\prime}\|_{2,1}∥ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
αS0S02,1+(1α)S2S22,1absent𝛼subscriptnormsubscript𝑆0superscriptsubscript𝑆0211𝛼subscriptnormsubscript𝑆2superscriptsubscript𝑆221\displaystyle\leq\alpha\|S_{0}-S_{0}^{\prime}\|_{2,1}+(1-\alpha)\|S_{2}-S_{2}^% {\prime}\|_{2,1}≤ italic_α ∥ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) ∥ italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
2α+(1α)22n2(2n+1)2((1+2BW)e2BW)2L+α(1α)2n2n+1((1+2BW)e2BW)L.absent2𝛼superscript1𝛼22superscript𝑛2superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿𝛼1𝛼2𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\leq 2\alpha+(1-\alpha)^{2}\frac{2n^{2}}{(2n+1)^{2}}\left((1+2B_{% W})e^{2B_{W}}\right)^{2L}+\alpha(1-\alpha)\frac{2n}{2n+1}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{L}.≤ 2 italic_α + ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT .

Similarly, for the 2-th generation, following analogous steps, we can derive that

TF(S~2)TF(S~2)2subscriptnormTFsubscript~𝑆2TFsuperscriptsubscript~𝑆2subscript2\displaystyle\left\|\mathrm{TF}\left(\widetilde{S}_{2}\right)-\mathrm{TF}\left% (\widetilde{S}_{2}^{\prime}\right)\right\|_{\ell_{2}}∥ roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
12n+1((1+2BW)e2BW)LS~2S~22,1absent12𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿subscriptnormsubscript~𝑆2superscriptsubscript~𝑆221\displaystyle\leq\frac{1}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}\|% \widetilde{S}_{2}-\widetilde{S}_{2}^{\prime}\|_{2,1}≤ divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
(1α)22n2(2n+1)3((1+2BW)e2BW)3L+α(1α)2n(2n+1)2((1+2BW)e2BW)2Labsentsuperscript1𝛼22superscript𝑛2superscript2𝑛13superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊3𝐿𝛼1𝛼2𝑛superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿\displaystyle\leq(1-\alpha)^{2}\frac{2n^{2}}{(2n+1)^{3}}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{3L}+\alpha(1-\alpha)\frac{2n}{(2n+1)^{2}}\left(\left% (1+2B_{W}\right)e^{2B_{W}}\right)^{2L}≤ ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT
+α22n+1((1+2BW)e2BW)L.𝛼22𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\quad+\alpha\frac{2}{2n+1}\left(\left(1+2B_{W}\right)e^{2B_{W}}% \right)^{L}.+ italic_α divide start_ARG 2 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT . (40)

Building on the above expression, we can further deduce that

S3S32,1subscriptnormsubscript𝑆3superscriptsubscript𝑆321\displaystyle\left\|S_{3}-S_{3}^{\prime}\right\|_{2,1}∥ italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
(1α)22n3(2n+1)3((1+2BW)e2BW)3L+α(1α)2n2(2n+1)2((1+2BW)e2BW)2Labsentsuperscript1𝛼22superscript𝑛3superscript2𝑛13superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊3𝐿𝛼1𝛼2superscript𝑛2superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿\displaystyle\leq(1-\alpha)^{2}\frac{2n^{3}}{(2n+1)^{3}}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{3L}+\alpha(1-\alpha)\frac{2n^{2}}{(2n+1)^{2}}\left(% \left(1+2B_{W}\right)e^{2B_{W}}\right)^{2L}≤ ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT
+α2n2n+1((1+2BW)e2BW)L.𝛼2𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\quad+\alpha\frac{2n}{2n+1}\left(\left(1+2B_{W}\right)e^{2B_{W}}% \right)^{L}.+ italic_α divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT . (41)

The discrepancy between the mixed datasets is as follows:

S~3S~32,1subscriptnormsubscript~𝑆3superscriptsubscript~𝑆321\displaystyle\|\widetilde{S}_{3}-\widetilde{S}_{3}^{\prime}\|_{2,1}∥ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
αS0S02,1+(1α)S3S32,1absent𝛼subscriptnormsubscript𝑆0superscriptsubscript𝑆0211𝛼subscriptnormsubscript𝑆3superscriptsubscript𝑆321\displaystyle\leq\alpha\|S_{0}-S_{0}^{\prime}\|_{2,1}+(1-\alpha)\|S_{3}-S_{3}^% {\prime}\|_{2,1}≤ italic_α ∥ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT + ( 1 - italic_α ) ∥ italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
(1α)32n3(2n+1)3((1+2BW)e2BW)3L+α(1α)22n2(2n+1)2((1+2BW)e2BW)2Labsentsuperscript1𝛼32superscript𝑛3superscript2𝑛13superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊3𝐿𝛼superscript1𝛼22superscript𝑛2superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿\displaystyle\leq(1-\alpha)^{3}\frac{2n^{3}}{(2n+1)^{3}}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{3L}+\alpha(1-\alpha)^{2}\frac{2n^{2}}{(2n+1)^{2}}% \left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{2L}≤ ( 1 - italic_α ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT
+α(1α)2n2n+1((1+2BW)e2BW)L+2α.𝛼1𝛼2𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿2𝛼\displaystyle\quad+\alpha(1-\alpha)\frac{2n}{2n+1}\left(\left(1+2B_{W}\right)e% ^{2B_{W}}\right)^{L}+2\alpha.+ italic_α ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT + 2 italic_α . (42)

Utilizing recursive techniques, we can obtain the following:

S~iS~i2,1subscriptnormsubscript~𝑆𝑖superscriptsubscript~𝑆𝑖21\displaystyle\|\widetilde{S}_{i}-\widetilde{S}_{i}^{\prime}\|_{2,1}∥ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
(1α)i2ni(2n+1)i((1+2BW)e2BW)iL+α(1α)i12ni1(2n+1)i1((1+2BW)e2BW)(i1)Labsentsuperscript1𝛼𝑖2superscript𝑛𝑖superscript2𝑛1𝑖superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖𝐿𝛼superscript1𝛼𝑖12superscript𝑛𝑖1superscript2𝑛1𝑖1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖1𝐿\displaystyle\leq(1-\alpha)^{i}\frac{2n^{i}}{(2n+1)^{i}}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{iL}+\alpha(1-\alpha)^{i-1}\frac{2n^{i-1}}{(2n+1)^{i-% 1}}\left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{(i-1)L}≤ ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_i - 1 ) italic_L end_POSTSUPERSCRIPT
++α(1α)22n2(2n+1)2((1+2BW)e2BW)2L+α(1α)2n2n+1((1+2BW)e2BW)L𝛼superscript1𝛼22superscript𝑛2superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿𝛼1𝛼2𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\quad+...+\alpha(1-\alpha)^{2}\frac{2n^{2}}{(2n+1)^{2}}\left(% \left(1+2B_{W}\right)e^{2B_{W}}\right)^{2L}+\alpha(1-\alpha)\frac{2n}{2n+1}% \left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{L}+ … + italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT
+2α2𝛼\displaystyle\quad+2\alpha+ 2 italic_α
2(1α)ini(2n+1)i((1+2BW)e2BW)iLabsent2superscript1𝛼𝑖superscript𝑛𝑖superscript2𝑛1𝑖superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖𝐿\displaystyle\leq 2(1-\alpha)^{i}\frac{n^{i}}{(2n+1)^{i}}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{iL}≤ 2 ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT
+2α[1(1α)n2n+1((1+2BW)e2BW)L]1[1(1α)ini(2n+1)i((1+2BW)e2BW)iL].2𝛼superscriptdelimited-[]11𝛼𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿1delimited-[]1superscript1𝛼𝑖superscript𝑛𝑖superscript2𝑛1𝑖superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖𝐿\displaystyle\quad+2\alpha\left[1-(1-\alpha)\frac{n}{2n+1}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{L}\right]^{-1}\left[1-(1-\alpha)^{i}\frac{n^{i}}{(2n% +1)^{i}}\left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{iL}\right].+ 2 italic_α [ 1 - ( 1 - italic_α ) divide start_ARG italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT ] . (43)

Ultimately, the discrepancy between the transformer outputs after i𝑖iitalic_i generations of the self-consuming loop for S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be obtained as follows:

TF(S~i)TF(S~i)2subscriptnormTFsubscript~𝑆𝑖TFsuperscriptsubscript~𝑆𝑖subscript2\displaystyle\left\|\mathrm{TF}\left(\widetilde{S}_{i}\right)-\mathrm{TF}\left% (\widetilde{S}_{i}^{\prime}\right)\right\|_{\ell_{2}}∥ roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
12n+1((1+2BW)e2BW)LS~iS~i2,1absent12𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿subscriptnormsubscript~𝑆𝑖superscriptsubscript~𝑆𝑖21\displaystyle\leq\frac{1}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}\|% \widetilde{S}_{i}-\widetilde{S}_{i}^{\prime}\|_{2,1}≤ divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∥ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT
(1α)i2ni(2n+1)i+1((1+2BW)e2BW)(i+1)L+α(1α)i12ni1(2n+1)i((1+2BW)e2BW)iLabsentsuperscript1𝛼𝑖2superscript𝑛𝑖superscript2𝑛1𝑖1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖1𝐿𝛼superscript1𝛼𝑖12superscript𝑛𝑖1superscript2𝑛1𝑖superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖𝐿\displaystyle\leq(1-\alpha)^{i}\frac{2n^{i}}{(2n+1)^{i+1}}\left(\left(1+2B_{W}% \right)e^{2B_{W}}\right)^{(i+1)L}+\alpha(1-\alpha)^{i-1}\frac{2n^{i-1}}{(2n+1)% ^{i}}\left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{iL}≤ ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT
++α(1α)22n2(2n+1)3((1+2BW)e2BW)3L+α(1α)2n(2n+1)2((1+2BW)e2BW)2L𝛼superscript1𝛼22superscript𝑛2superscript2𝑛13superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊3𝐿𝛼1𝛼2𝑛superscript2𝑛12superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊2𝐿\displaystyle\quad+...+\alpha(1-\alpha)^{2}\frac{2n^{2}}{(2n+1)^{3}}\left(% \left(1+2B_{W}\right)e^{2B_{W}}\right)^{3L}+\alpha(1-\alpha)\frac{2n}{(2n+1)^{% 2}}\left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{2L}+ … + italic_α ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 italic_L end_POSTSUPERSCRIPT + italic_α ( 1 - italic_α ) divide start_ARG 2 italic_n end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT
+2α12n+1((1+2BW)e2BW)L2𝛼12𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\quad+2\alpha\frac{1}{2n+1}\left((1+2B_{W})e^{2B_{W}}\right)^{L}+ 2 italic_α divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT
2(1α)ini(2n+1)i+1((1+2BW)e2BW)(i+1)L+2α[12n+1((1+2BW)e2BW)L]absent2superscript1𝛼𝑖superscript𝑛𝑖superscript2𝑛1𝑖1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖1𝐿2𝛼delimited-[]12𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿\displaystyle\leq 2(1-\alpha)^{i}\frac{n^{i}}{(2n+1)^{i+1}}\left(\left(1+2B_{W% }\right)e^{2B_{W}}\right)^{(i+1)L}+2\alpha\left[\frac{1}{2n+1}\left((1+2B_{W})% e^{2B_{W}}\right)^{L}\right]≤ 2 ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT + 2 italic_α [ divide start_ARG 1 end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ]
×[1(1α)n2n+1((1+2BW)e2BW)L]1[1(1α)ini(2n+1)i((1+2BW)e2BW)iL].absentsuperscriptdelimited-[]11𝛼𝑛2𝑛1superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝐿1delimited-[]1superscript1𝛼𝑖superscript𝑛𝑖superscript2𝑛1𝑖superscript12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊𝑖𝐿\displaystyle\times\left[1-(1-\alpha)\frac{n}{2n+1}\left(\left(1+2B_{W}\right)% e^{2B_{W}}\right)^{L}\right]^{-1}\left[1-(1-\alpha)^{i}\frac{n^{i}}{(2n+1)^{i}% }\left(\left(1+2B_{W}\right)e^{2B_{W}}\right)^{iL}\right].× [ 1 - ( 1 - italic_α ) divide start_ARG italic_n end_ARG start_ARG 2 italic_n + 1 end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 italic_n + 1 ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ( ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT ] .

Subsequently, given that B~W=(1+2BW)e2BWsubscript~𝐵𝑊12subscript𝐵𝑊superscript𝑒2subscript𝐵𝑊\widetilde{B}_{W}=(1+2B_{W})e^{2B_{W}}over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = ( 1 + 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_B start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we define the measure d𝑑ditalic_d as the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm to quantify the output discrepancy of the generative transformer model after i𝑖iitalic_i iterations of the self-consuming loop, starting from the initial real datasets S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S0superscriptsubscript𝑆0S_{0}^{\prime}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In this context, the recursive stability parameter γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, as described in Definition 2, can be bounded by the following expression, providing a formal measure of the model’s stability across iterations:

TF(S~i)TF(S~i)2(1α)iB~W(i+1)L2n+1.less-than-or-similar-tosubscriptnormTFsubscript~𝑆𝑖TFsuperscriptsubscript~𝑆𝑖subscript2superscript1𝛼𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿2𝑛1\displaystyle\left\|\operatorname{TF}(\widetilde{S}_{i})-\operatorname{TF}(% \widetilde{S}_{i}^{\prime})\right\|_{\ell_{2}}\lesssim(1-\alpha)^{i}\frac{% \widetilde{B}_{W}^{(i+1)L}}{2n+1}.∥ roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_TF ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≲ ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT divide start_ARG over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_n + 1 end_ARG .

The proof is complete.

A.8 Proof of Theorem 3

In this section, building on the general theoretical framework established in Theorem 1, we provide the proof of Theorem 3 by analyzing the terms βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ), leveraging recent advancements in SGD (Zhang et al., 2022) and ICL (Zhang et al., 2023). The recursive stability parameter γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is derived from Theorem 2.

Lemma 11.

(Uniform stability of SGD in the non-convex case (Zhang et al., 2022)). Assume f𝑓fitalic_f is κ𝜅\kappaitalic_κ-smooth and ρ𝜌\rhoitalic_ρ-Lipschitz. Running Tngreater-than-or-equivalent-to𝑇𝑛T\gtrsim nitalic_T ≳ italic_n iterations of SGD𝑆𝐺𝐷SGDitalic_S italic_G italic_D with step size ηt=1βtsubscript𝜂𝑡1𝛽𝑡\eta_{t}=\frac{1}{\beta t}italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_β italic_t end_ARG. Choose the stability of SGD satisfies

βn16ρ2lognn.less-than-or-similar-tosubscript𝛽𝑛16superscript𝜌2𝑛𝑛\beta_{n}\lesssim\frac{16\rho^{2}\log n}{n}.italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≲ divide start_ARG 16 italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log italic_n end_ARG start_ARG italic_n end_ARG .
Lemma 12.

(Zhang et al., 2023) Let θsubscript𝜃\mathbb{P}_{\theta}blackboard_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT represent the probability distribution induced by the transformer with parameter θ𝜃\thetaitalic_θ. Additionally, the model θ^subscript^𝜃\mathbb{P}_{\hat{\theta}}blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT is pretrained by the algorithm:

θ^=argminθΘ1nt=1n1logθ(𝒙t+1nStn),^𝜃𝜃Θargmin1𝑛superscriptsubscript𝑡1𝑛1subscript𝜃conditionalsuperscriptsubscript𝒙𝑡1𝑛superscriptsubscript𝑆𝑡𝑛\hat{\theta}=\underset{\theta\in\Theta}{\operatorname{argmin}}-\frac{1}{n}\sum% _{t=1}^{n-1}\log\mathbb{P}_{\theta}\left(\bm{x}_{t+1}^{n}\mid S_{t}^{n}\right),over^ start_ARG italic_θ end_ARG = start_UNDERACCENT italic_θ ∈ roman_Θ end_UNDERACCENT start_ARG roman_argmin end_ARG - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT roman_log blackboard_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ,

where Stn=(𝐱1,𝐲1,𝐱t,𝐲t)superscriptsubscript𝑆𝑡𝑛subscript𝐱1subscript𝐲1subscript𝐱𝑡subscript𝐲𝑡S_{t}^{n}=\left(\bm{x}_{1},\bm{y}_{1},\ldots\bm{x}_{t},\bm{y}_{t}\right)italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Furthermore, we consider the realizable setting, where ground truth probability distribution (S)\mathbb{P}(\cdot\mid S)blackboard_P ( ⋅ ∣ italic_S ) and θ(S)\mathbb{P}_{\theta^{*}}(\cdot\mid S)blackboard_P start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ ∣ italic_S ) are consistent for some θΘsuperscript𝜃Θ\theta^{*}\in\Thetaitalic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_Θ. Then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, the following inequality holds:

TV((S),θ^(S))1n1/2log(1+n)+1n1/4log(1/δ),\displaystyle\operatorname{TV}\left(\mathbb{P}(\cdot\mid S),\mathbb{P}_{\hat{% \theta}}(\cdot\mid S)\right)\lesssim\frac{1}{n^{1/2}}\log(1+n)+\frac{1}{n^{1/4% }}\log(1/\delta),roman_TV ( blackboard_P ( ⋅ ∣ italic_S ) , blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( ⋅ ∣ italic_S ) ) ≲ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG roman_log ( 1 + italic_n ) + divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT end_ARG roman_log ( 1 / italic_δ ) , (44)

where less-than-or-similar-to\lesssim denotes that we omit constants that are independent of n𝑛nitalic_n and δ𝛿\deltaitalic_δ.

Proof of Theorem 3.

First, we note that in the setting where the transformer generates data through in-context learning, the generalization error of the self-consuming loop is given by:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|=|𝔼𝒛(S0)(𝒜(S~i),𝒛)1n𝒛iS~i(𝒜(S~i),𝒛i)|.\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|=\left|% \mathbb{E}_{\bm{z}\sim\mathbb{P}(\cdot\mid S_{0})}\ell(\mathcal{A}(\widetilde{% S}_{i}),\bm{z})-\frac{1}{n}\sum_{\bm{z}_{i}\in\widetilde{S}_{i}}\ell(\mathcal{% A}(\widetilde{S}_{i}),\bm{z}_{i})\right|.| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | = | blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ blackboard_P ( ⋅ ∣ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | . (45)

Now, we are ready to prove Theorem 3. The main idea is to bound the uniform stability parameter βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the recursive stability parameter γnisuperscriptsubscript𝛾𝑛𝑖\gamma_{n}^{i}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, and the learnability of the generative model through the total variation distance dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) as stated in Theorem 1. First, as for the bound for the total variation distance dTV (n)subscript𝑑TV 𝑛d_{\text{TV }}(n)italic_d start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ( italic_n ) in Theorem 1. For Equation 8 in the proof of Theorem 1, we can rewrite it in the setting of in-context learning as follows:

|R𝒟~i1(𝒜(S~i))R𝒟i(𝒜(S~i))|subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}% _{i}))-R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | =|𝔼𝒛(S~i1)(𝒜(S~i),𝒛)𝔼𝒛(Si)(𝒜(S~i),𝒛)|\displaystyle=\left|\mathbb{E}_{\bm{z}\sim\mathbb{P}(\cdot\mid\widetilde{S}_{i% -1})}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z})-\mathbb{E}_{\bm{z}\sim\mathbb% {P}(\cdot\mid S_{i})}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z})\right|= | blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ blackboard_P ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) - blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ blackboard_P ( ⋅ ∣ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) |
=|𝔼𝒛(S~i1)(𝒜(S~i),𝒛)𝔼𝒛θ^(S~i1)(𝒜(S~i),𝒛)|\displaystyle=\left|\mathbb{E}_{\bm{z}\sim\mathbb{P}(\cdot\mid\widetilde{S}_{i% -1})}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z})-\mathbb{E}_{\bm{z}\sim\mathbb% {P}_{\hat{\theta}}(\cdot\mid\widetilde{S}_{i-1})}\ell(\mathcal{A}(\widetilde{S% }_{i}),\bm{z})\right|= | blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ blackboard_P ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) - blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) |
=|𝒛(𝒜(S~i),𝒛)((𝒛S~i1)θ^(𝒛S~i1))d𝒛|\displaystyle=\Bigg{|}\int_{\bm{z}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z})% \left(\mathbb{P}\left(\bm{z}\mid\widetilde{S}_{i-1}\right)-\mathbb{P}_{\hat{% \theta}}\left(\bm{z}\mid\widetilde{S}_{i-1}\right)\right)d\bm{z}\Bigg{|}= | ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) ( blackboard_P ( bold_italic_z ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) - blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_italic_z ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) italic_d bold_italic_z |
𝒛|(𝒜(S~),𝒛)((𝒛S~i1)θ^(𝒛S~i1))|d𝒛\displaystyle\leq\int_{\bm{z}}\biggl{|}\ell(\mathcal{A}(\widetilde{S}),\bm{z})% \left(\mathbb{P}\left(\bm{z}\mid\widetilde{S}_{i-1}\right)-\mathbb{P}_{\hat{% \theta}}\left(\bm{z}\mid\widetilde{S}_{i-1}\right)\right)\biggr{|}d\bm{z}≤ ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT | roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG ) , bold_italic_z ) ( blackboard_P ( bold_italic_z ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) - blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_italic_z ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) | italic_d bold_italic_z
M𝒛|(𝒛S~i1)θ^(𝒛S~i1)|d𝒛\displaystyle\leq M\int_{\bm{z}}\Bigl{|}\mathbb{P}\left(\bm{z}\mid\widetilde{S% }_{i-1}\right)-\mathbb{P}_{\hat{\theta}}\left(\bm{z}\mid\widetilde{S}_{i-1}% \right)\Bigr{|}d\bm{z}≤ italic_M ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT | blackboard_P ( bold_italic_z ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) - blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( bold_italic_z ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) | italic_d bold_italic_z
=2MTV((S~i1),θ^(S~i1)).\displaystyle=2MTV\left(\mathbb{P}(\cdot\mid\widetilde{S}_{i-1}),\mathbb{P}_{% \hat{\theta}}(\cdot\mid\widetilde{S}_{i-1})\right).= 2 italic_M italic_T italic_V ( blackboard_P ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) , blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ) . (46)

Where, the second equality holds because, in the (i1)𝑖1(i-1)( italic_i - 1 )-th generation of the self-consuming loop, the mixed data distribution from the (i1)𝑖1(i-1)( italic_i - 1 )-th generation is reintroduced as the ground truth distribution to train the transformer. As a result, the transformer outputs the synthetic data distribution for the i𝑖iitalic_i-th generation. Thus, TV((S~j),θ^(S~j))TV\left(\mathbb{P}(\cdot\mid\widetilde{S}_{j}),\mathbb{P}_{\hat{\theta}}(\cdot% \mid\widetilde{S}_{j})\right)italic_T italic_V ( blackboard_P ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) corresponds to dTV(n)subscript𝑑TV𝑛d_{\mathrm{TV}}(n)italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) in Theorem 1. Finally, the bound for the total variation distance dTV (n)subscript𝑑TV 𝑛d_{\text{TV }}(n)italic_d start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ( italic_n ) follows from Lemma 12.

dTV (n)1n1/2log(1+n)+1n1/4log(1/δ).less-than-or-similar-tosubscript𝑑TV 𝑛1superscript𝑛121𝑛1superscript𝑛141𝛿\displaystyle d_{\text{TV }}(n)\lesssim\frac{1}{n^{1/2}}\log(1+n)+\frac{1}{n^{% 1/4}}\log(1/\delta).italic_d start_POSTSUBSCRIPT TV end_POSTSUBSCRIPT ( italic_n ) ≲ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG roman_log ( 1 + italic_n ) + divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 1 / 4 end_POSTSUPERSCRIPT end_ARG roman_log ( 1 / italic_δ ) . (47)

Similarly, for the recursive stability parameter in the self-consuming loop, we rederive Equation 21 from the proof of Theorem 1 under the in-context learning setting:

|𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)(S0,αj)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]\displaystyle|\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_% {S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}% \left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left((S_{0,% \alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(\mathcal{A}% \left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right)\right]| blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ]
𝔼𝒛0,j𝒟0𝔼Si,1α𝒟in(1α)((S0,αt)j)[𝔼𝒛𝒟0(𝒜((S0,αt)jSi,1α),𝒛)(𝒜((S0,αt)jSi,1α),𝒛0,j)]|\displaystyle-\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\mathbb{E}_% {S_{i,1-\alpha}\sim\mathcal{D}_{i}^{n(1-\alpha)}\left((S_{0,\alpha}^{t})^{j}% \right)}\left[\mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left% ((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)-\ell\left(% \mathcal{A}\left((S_{0,\alpha}^{t})^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}% \right)\right]|- blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] |
=|𝔼𝒛0,j𝒟0𝔼𝒛𝒟0[𝔼Si,1α𝒟in(1α)(S0,αj)(𝒜((S0,αt)jSi,1α),𝒛)\displaystyle=\left|\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\left[\mathbb{E}_{S_{i,1-\alpha}\sim% \mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}\ell\left(\mathcal{% A}\left(\left(S_{0,\alpha}^{t}\right)^{j}\cup S_{i,1-\alpha}\right),\bm{z}% \right)\right.\right.= | blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z )
𝔼Si,1α𝒟in(1α)((S0,αt)j)(𝒜((S0,αt)jSi,1α),𝒛)]|\displaystyle\left.\left.\quad-\mathbb{E}_{S_{i,1-\alpha}\sim\mathcal{D}_{i}^{% n(1-\alpha)}\left((S_{0,\alpha}^{t})^{j}\right)}\ell\left(\mathcal{A}\left(% \left(S_{0,\alpha}^{t}\right)^{j}\cup S_{i,1-\alpha}\right),\bm{z}\right)% \right]\right|- blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z ) ] |
+|𝔼𝒛0,j𝒟0𝔼𝒛𝒟0[𝔼Si,1α𝒟in(1α)(S0,αj)(𝒜((S0,αt)jSi,1α),𝒛0,j)\displaystyle\quad+\left|\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\left[\mathbb{E}_{S_{i,1-\alpha}\sim% \mathcal{D}_{i}^{n(1-\alpha)}\left(S_{0,\alpha}^{j}\right)}\ell\left(\mathcal{% A}\left(\left(S_{0,\alpha}^{t}\right)^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,% j}\right)\right.\right.+ | blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT )
𝔼Si,1α𝒟in(1α)((S0,αt)j)(𝒜((S0,αt)jSi,1α),𝒛0,j)]|\displaystyle\left.\left.\quad-\mathbb{E}_{S_{i,1-\alpha}\sim\mathcal{D}_{i}^{% n(1-\alpha)}\left((S_{0,\alpha}^{t})^{j}\right)}\ell\left(\mathcal{A}\left(% \left(S_{0,\alpha}^{t}\right)^{j}\cup S_{i,1-\alpha}\right),\bm{z}_{0,j}\right% )\right]\right|- blackboard_E start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n ( 1 - italic_α ) end_POSTSUPERSCRIPT ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i , 1 - italic_α end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] |
2n(1α)βnTF((S0,αt)jSi1,1α)TF((S0,αt)jSi1,1α)2absent2𝑛1𝛼subscript𝛽𝑛subscriptnormTFsuperscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗subscript𝑆𝑖11𝛼TFsuperscriptsuperscriptsubscript𝑆0𝛼𝑡𝑗superscriptsubscript𝑆𝑖11𝛼subscript2\displaystyle\leq 2n(1-\alpha)\beta_{n}\left\|\mathrm{TF}\left(\left(S_{0,% \alpha}^{t}\right)^{j}\cup S_{i-1,1-\alpha}\right)-\mathrm{TF}\left(\left(S_{0% ,\alpha}^{t}\right)^{j}\cup S_{i-1,1-\alpha}^{\prime}\right)\right\|_{\ell_{2}}≤ 2 italic_n ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ roman_TF ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i - 1 , 1 - italic_α end_POSTSUBSCRIPT ) - roman_TF ( ( italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_i - 1 , 1 - italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
2n(1α)βn2B~WL2n+1[((1α)B~WL)i1+α1((1α)B~WL)i11(1α)B~WL]less-than-or-similar-toabsent2𝑛1𝛼subscript𝛽𝑛2superscriptsubscript~𝐵𝑊𝐿2𝑛1delimited-[]superscript1𝛼superscriptsubscript~𝐵𝑊𝐿𝑖1𝛼1superscript1𝛼superscriptsubscript~𝐵𝑊𝐿𝑖111𝛼superscriptsubscript~𝐵𝑊𝐿\displaystyle\lesssim 2n(1-\alpha)\beta_{n}\frac{2\widetilde{B}_{W}^{L}}{2n+1}% \left[((1-\alpha)\widetilde{B}_{W}^{L})^{i-1}+\alpha\frac{1-((1-\alpha)% \widetilde{B}_{W}^{L})^{i-1}}{1-(1-\alpha)\widetilde{B}_{W}^{L}}\right]≲ 2 italic_n ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT divide start_ARG 2 over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_n + 1 end_ARG [ ( ( 1 - italic_α ) over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT + italic_α divide start_ARG 1 - ( ( 1 - italic_α ) over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - ( 1 - italic_α ) over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG ]
=2n(1α)βnγni1.absent2𝑛1𝛼subscript𝛽𝑛superscriptsubscript𝛾𝑛𝑖1\displaystyle=2n(1-\alpha)\beta_{n}\gamma_{n}^{i-1}.= 2 italic_n ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT . (48)

For the uniform stability parameter βnsubscript𝛽𝑛\beta_{n}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of SGD algorithm, we can derive the bound from Lemma 11. Substituting above results into Theorem 3, we obtain the following conclusion:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
((1α)βnlog(n(1α))+α(βn+(1α)ρ2γni1)log(nα))log(1δ)absent1𝛼subscript𝛽𝑛𝑛1𝛼𝛼subscript𝛽𝑛1𝛼superscript𝜌2superscriptsubscript𝛾𝑛𝑖1𝑛𝛼1𝛿\displaystyle\leq\left((1-\alpha)\beta_{n}\log(n(1-\alpha))+\alpha\left(\beta_% {n}+(1-\alpha)\rho^{2}\gamma_{n}^{i-1}\right)\log(n\alpha)\right)\log(\frac{1}% {\delta})≤ ( ( 1 - italic_α ) italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_log ( italic_n ( 1 - italic_α ) ) + italic_α ( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ) roman_log ( italic_n italic_α ) ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG )
+((1α)nαβn+Mn1/2(1α+α))log(1δ)+2M(1(1α)i)α1dTV(n)1𝛼𝑛𝛼subscript𝛽𝑛𝑀superscript𝑛121𝛼𝛼1𝛿2𝑀1superscript1𝛼𝑖superscript𝛼1subscript𝑑TV𝑛\displaystyle\quad+\left(\sqrt{(1-\alpha)n}\alpha\beta_{n}+Mn^{-1/2}(\sqrt{1-% \alpha}+\sqrt{\alpha})\right)\sqrt{\log(\frac{1}{\delta})}+2M\left(1-(1-\alpha% )^{i}\right)\alpha^{-1}d_{\mathrm{TV}}(n)+ ( square-root start_ARG ( 1 - italic_α ) italic_n end_ARG italic_α italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + italic_M italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( square-root start_ARG 1 - italic_α end_ARG + square-root start_ARG italic_α end_ARG ) ) square-root start_ARG roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n )
βn[(1α)log(n(1α))log(1δ)+αlog(nα)log(1δ)+α(1α)nlog1δ]absentsubscript𝛽𝑛delimited-[]1𝛼𝑛1𝛼1𝛿𝛼𝑛𝛼1𝛿𝛼1𝛼𝑛1𝛿\displaystyle\leq\beta_{n}\left[(1-\alpha)\log(n(1-\alpha))\log(\frac{1}{% \delta})+\alpha\log(n\alpha)\log(\frac{1}{\delta})+\alpha\sqrt{(1-\alpha)n\log% \frac{1}{\delta}}\right]≤ italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [ ( 1 - italic_α ) roman_log ( italic_n ( 1 - italic_α ) ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) + italic_α roman_log ( italic_n italic_α ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) + italic_α square-root start_ARG ( 1 - italic_α ) italic_n roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG ]
+γni1α(1α)ρ2log(nα)log(1δ)+n1/2M(1α+α)log(1δ)+2dTV(n)M(1(1α)i)α1superscriptsubscript𝛾𝑛𝑖1𝛼1𝛼superscript𝜌2𝑛𝛼1𝛿superscript𝑛12𝑀1𝛼𝛼1𝛿2subscript𝑑TV𝑛𝑀1superscript1𝛼𝑖superscript𝛼1\displaystyle\quad+\gamma_{n}^{i-1}\alpha(1-\alpha)\rho^{2}\log(n\alpha)\log(% \frac{1}{\delta})+n^{-1/2}M(\sqrt{1-\alpha}+\sqrt{\alpha})\sqrt{\log(\frac{1}{% \delta})}+2d_{\mathrm{TV}}(n)M\left(1-(1-\alpha)^{i}\right)\alpha^{-1}+ italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_α ( 1 - italic_α ) italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_n italic_α ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) + italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_M ( square-root start_ARG 1 - italic_α end_ARG + square-root start_ARG italic_α end_ARG ) square-root start_ARG roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG + 2 italic_d start_POSTSUBSCRIPT roman_TV end_POSTSUBSCRIPT ( italic_n ) italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
n1/2log(n)Mρ2α1αlog1δ+n1ρ2((1α)B~WL)iαlog2(n)log(1δ)less-than-or-similar-toabsentsuperscript𝑛12𝑛𝑀superscript𝜌2𝛼1𝛼1𝛿superscript𝑛1superscript𝜌2superscript1𝛼superscriptsubscript~𝐵𝑊𝐿𝑖𝛼superscript2𝑛1𝛿\displaystyle\lesssim n^{-1/2}\log(n)M\rho^{2}\alpha\sqrt{1-\alpha}\log\frac{1% }{\delta}+n^{-1}\rho^{2}((1-\alpha)\widetilde{B}_{W}^{L})^{i}\alpha\log^{2}(n)% \log\left(\frac{1}{\delta}\right)≲ italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT roman_log ( italic_n ) italic_M italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α square-root start_ARG 1 - italic_α end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG + italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ( 1 - italic_α ) over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_α roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG )
+n1/4α1M(1(1α)i)log(1δ).superscript𝑛14superscript𝛼1𝑀1superscript1𝛼𝑖1𝛿\displaystyle\quad+n^{-1/4}\alpha^{-1}M\left(1-(1-\alpha)^{i}\right)\log(\frac% {1}{\delta}).+ italic_n start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_M ( 1 - ( 1 - italic_α ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) . (49)

A.9 Proof of Theorem 4

In this section, we prove Theorem 4. The proof follows a similar approach to that of Theorem 3; however, it is more intricate due to the fact that the mixed dataset in Theorem 4 contains synthetic data from all previous generations. Each generation’s synthetic dataset depends on the synthetic datasets of previous generations, leading to a more complex non-i.i.d. setting. Similar to Theorem 3, we begin by decomposing the generalization error into two components: the Cumulative Distribution Shift Across Generations and the Generalization Error on Mixed Distributions.

The main proof is as follows:

Proof of Theorem 4.

We begin by decomposing the generalization error as follows:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))||R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|Cumulative distribution shift across generations+|R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|Generalization error on mixed distributions.subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖subscriptsubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖Cumulative distribution shift across generationssubscriptsubscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖Generalization error on mixed distributions\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|\leq% \underbrace{\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|}_{\text{% Cumulative distribution shift across generations}}+\underbrace{\left|R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_{% \widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|}_{\text{% Generalization error on mixed distributions}}.| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ under⏟ start_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | end_ARG start_POSTSUBSCRIPT Cumulative distribution shift across generations end_POSTSUBSCRIPT + under⏟ start_ARG | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | end_ARG start_POSTSUBSCRIPT Generalization error on mixed distributions end_POSTSUBSCRIPT .

Upper Bounding Cumulative Distribution Shift Term

For the term |R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{% \mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |, we first note that 𝒟~i=11+iλ𝒟0+λ1+iλ𝒟1+λ1+iλ𝒟2++λ1+iλ𝒟isubscript~𝒟𝑖11𝑖𝜆subscript𝒟0𝜆1𝑖𝜆subscript𝒟1𝜆1𝑖𝜆subscript𝒟2𝜆1𝑖𝜆subscript𝒟𝑖\widetilde{\mathcal{D}}_{i}=\frac{1}{1+i\lambda}\mathcal{D}_{0}+\frac{\lambda}% {1+i\lambda}\mathcal{D}_{1}+\frac{\lambda}{1+i\lambda}\mathcal{D}_{2}+...+% \frac{\lambda}{1+i\lambda}\mathcal{D}_{i}over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_i italic_λ end_ARG caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + … + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, we obtain:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
=|R𝒟0(𝒜(S~i))11+iλR𝒟0(𝒜(S~i)λ1+iλR𝒟1(𝒜(S~1))λ1+iλR𝒟i(𝒜(S~i))|\displaystyle=\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{% 1}{1+i\lambda}R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})-\frac{\lambda% }{1+i\lambda}R_{\mathcal{D}_{1}}(\mathcal{A}(\widetilde{S}_{1}))-...-\frac{% \lambda}{1+i\lambda}R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|= | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - … - divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
=|iλ1+iλR𝒟0(𝒜(S~i)λ1+iλR𝒟1(𝒜(S~1))λ1+iλR𝒟i(𝒜(S~i))|\displaystyle=\left|\frac{i\lambda}{1+i\lambda}R_{\mathcal{D}_{0}}(\mathcal{A}% (\widetilde{S}_{i})-\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{1}}(\mathcal{A}(% \widetilde{S}_{1}))-...-\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{i}}(\mathcal% {A}(\widetilde{S}_{i}))\right|= | divide start_ARG italic_i italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - … - divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
λ1+iλ|R𝒟0(𝒜(S~i)R𝒟1(𝒜(S~i))|++λ1+iλ|R𝒟0(𝒜(S~i)R𝒟i(𝒜(S~i))|\displaystyle\leq\frac{\lambda}{1+i\lambda}\left|R_{\mathcal{D}_{0}}(\mathcal{% A}(\widetilde{S}_{i})-R_{\mathcal{D}_{1}}(\mathcal{A}(\widetilde{S}_{i}))% \right|+...+\frac{\lambda}{1+i\lambda}\left|R_{\mathcal{D}_{0}}(\mathcal{A}(% \widetilde{S}_{i})-R_{\mathcal{D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|≤ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + … + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
λ1+iλj=1i|R𝒟0(𝒜(S~i)R𝒟j(𝒜(S~i))|.\displaystyle\leq\frac{\lambda}{1+i\lambda}\sum_{j=1}^{i}\left|R_{\mathcal{D}_% {0}}(\mathcal{A}(\widetilde{S}_{i})-R_{\mathcal{D}_{j}}(\mathcal{A}(\widetilde% {S}_{i}))\right|.≤ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | . (50)

Furthermore, we can further decompose it as follows:

|R𝒟0(𝒜(S~i)R𝒟j(𝒜(S~i))||R𝒟0(𝒜(S~i))R𝒟~j1(𝒜(S~i))|+|R𝒟~j1(𝒜(S~i))R𝒟j(𝒜(S~i))|.\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i})-R_{% \mathcal{D}_{j}}(\mathcal{A}(\widetilde{S}_{i}))\right|\leq\left|R_{\mathcal{D% }_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{j-1}}(% \mathcal{A}(\widetilde{S}_{i}))\right|+\left|R_{\widetilde{\mathcal{D}}_{j-1}}% (\mathcal{A}(\widetilde{S}_{i}))-R_{\mathcal{D}_{j}}(\mathcal{A}(\widetilde{S}% _{i}))\right|.| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≤ | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | . (51)

By substituting inequality 51 into inequality 50, we obtain:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
λ1+iλj=1i(|R𝒟0(𝒜(S~i))R𝒟~j1(𝒜(S~i))|+|R𝒟~j1(𝒜(S~i))R𝒟j(𝒜(S~i))|).absent𝜆1𝑖𝜆superscriptsubscript𝑗1𝑖subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑗1𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑗1𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟𝑗𝒜subscript~𝑆𝑖\displaystyle\leq\frac{\lambda}{1+i\lambda}\sum_{j=1}^{i}\left(\left|R_{% \mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{j% -1}}(\mathcal{A}(\widetilde{S}_{i}))\right|+\left|R_{\widetilde{\mathcal{D}}_{% j-1}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\mathcal{D}_{j}}(\mathcal{A}(% \widetilde{S}_{i}))\right|\right).≤ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + | italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ) . (52)

Thus, from equation 46 in the proof of Theorem 3 and lemma 12, we obtain:

|R𝒟~j1(𝒜(S~i))R𝒟j(𝒜(S~i))|subscript𝑅subscript~𝒟𝑗1𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟𝑗𝒜subscript~𝑆𝑖\displaystyle\left|R_{\widetilde{\mathcal{D}}_{j-1}}(\mathcal{A}(\widetilde{S}% _{i}))-R_{\mathcal{D}_{j}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | 2MTV((S~j1),θ^(S~j1))\displaystyle\leq 2MTV\left(\mathbb{P}\left(\cdot\mid\widetilde{S}_{j-1}\right% ),\mathbb{P}_{\hat{\theta}}\left(\cdot\mid\widetilde{S}_{j-1}\right)\right)≤ 2 italic_M italic_T italic_V ( blackboard_P ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) , blackboard_P start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT ( ⋅ ∣ over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) )
Mnj11/4lognj1log(1/δ).less-than-or-similar-toabsent𝑀superscriptsubscript𝑛𝑗114subscript𝑛𝑗11𝛿\displaystyle\lesssim Mn_{j-1}^{-1/4}\log n_{j-1}\log(1/\delta).≲ italic_M italic_n start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT roman_log italic_n start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT roman_log ( 1 / italic_δ ) . (53)

Incorporating inequality 53 into inequality 52, we arrive at:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
λ1+iλj=1i|R𝒟0(𝒜(S~i))R𝒟~j1(𝒜(S~i))|+λ1+iλj=0i1Mnj1/4lognjlog(1/δ).less-than-or-similar-toabsent𝜆1𝑖𝜆superscriptsubscript𝑗1𝑖subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑗1𝒜subscript~𝑆𝑖𝜆1𝑖𝜆superscriptsubscript𝑗0𝑖1𝑀superscriptsubscript𝑛𝑗14subscript𝑛𝑗1𝛿\displaystyle\lesssim\frac{\lambda}{1+i\lambda}\sum_{j=1}^{i}\left|R_{\mathcal% {D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{j-1}}(% \mathcal{A}(\widetilde{S}_{i}))\right|+\frac{\lambda}{1+i\lambda}\sum_{j=0}^{i% -1}Mn_{j}^{-1/4}\log n_{j}\log(1/\delta).≲ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_M italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT roman_log italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( 1 / italic_δ ) . (54)

Let f(i)=j=0i1Mnj1/4lognjlog(1/δ)𝑓𝑖superscriptsubscript𝑗0𝑖1𝑀superscriptsubscript𝑛𝑗14subscript𝑛𝑗1𝛿f(i)=\sum_{j=0}^{i-1}Mn_{j}^{-1/4}\log n_{j}\log(1/\delta)italic_f ( italic_i ) = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_M italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 4 end_POSTSUPERSCRIPT roman_log italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( 1 / italic_δ ), Then, we obtain:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
λ1+iλ|R𝒟0(𝒜(S~i))R𝒟~i1(𝒜(S~i))|++λ1+iλ|R𝒟0(𝒜(S~i))R𝒟~1(𝒜(S~i))|+λ1+iλf(i).less-than-or-similar-toabsent𝜆1𝑖𝜆subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖𝜆1𝑖𝜆subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟1𝒜subscript~𝑆𝑖𝜆1𝑖𝜆𝑓𝑖\displaystyle\lesssim\frac{\lambda}{1+i\lambda}\left|R_{\mathcal{D}_{0}}(% \mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(% \widetilde{S}_{i}))\right|+...+\frac{\lambda}{1+i\lambda}\left|R_{\mathcal{D}_% {0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{1}}(\mathcal{% A}(\widetilde{S}_{i}))\right|+\frac{\lambda}{1+i\lambda}f(i).≲ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + … + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_f ( italic_i ) .

Similarly, we get:

|R𝒟0(𝒜(S~i))R𝒟~i1(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖1𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i-1}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
λ1+(i1)λ|R𝒟0(𝒜(S~i))R𝒟~i2(𝒜(S~i))|++λ1+(i1)λ|R𝒟0(𝒜(S~i))R𝒟~1(𝒜(S~i))|less-than-or-similar-toabsent𝜆1𝑖1𝜆subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖2𝒜subscript~𝑆𝑖𝜆1𝑖1𝜆subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟1𝒜subscript~𝑆𝑖\displaystyle\lesssim\frac{\lambda}{1+(i-1)\lambda}\left|R_{\mathcal{D}_{0}}(% \mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{i-2}}(\mathcal{A}(% \widetilde{S}_{i}))\right|+...+\frac{\lambda}{1+(i-1)\lambda}\left|R_{\mathcal% {D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{\mathcal{D}}_{1}}(% \mathcal{A}(\widetilde{S}_{i}))\right|≲ divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + … + divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
+λ1+(i1)λf(i1).𝜆1𝑖1𝜆𝑓𝑖1\displaystyle\quad+\frac{\lambda}{1+(i-1)\lambda}f(i-1).+ divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG italic_f ( italic_i - 1 ) .

Then, we have

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|λ1+iλf(i)+λ1+iλλ1+(i1)λf(i1)+less-than-or-similar-tosubscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖𝜆1𝑖𝜆𝑓𝑖limit-from𝜆1𝑖𝜆𝜆1𝑖1𝜆𝑓𝑖1\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|\lesssim\frac{% \lambda}{1+i\lambda}f(i)+\frac{\lambda}{1+i\lambda}\frac{\lambda}{1+(i-1)% \lambda}f(i-1)+| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ≲ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_f ( italic_i ) + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG italic_f ( italic_i - 1 ) +
(λ1+iλ+λ1+iλλ1+(i1)λ)(|R𝒟0(𝒜(S~i))R𝒟~i2(𝒜(S~i))|++|R𝒟0(𝒜(S~i))R𝒟~1(𝒜(S~i))|).𝜆1𝑖𝜆𝜆1𝑖𝜆𝜆1𝑖1𝜆subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖2𝒜subscript~𝑆𝑖subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟1𝒜subscript~𝑆𝑖\displaystyle(\frac{\lambda}{1+i\lambda}+\frac{\lambda}{1+i\lambda}\frac{% \lambda}{1+(i-1)\lambda})(\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_% {i}))-R_{\widetilde{\mathcal{D}}_{i-2}}(\mathcal{A}(\widetilde{S}_{i}))\right|% +...+\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{\widetilde{% \mathcal{D}}_{1}}(\mathcal{A}(\widetilde{S}_{i}))\right|).( divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG ) ( | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | + … + | italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) | ) . (55)

Thus, by applying recursive techniques, we obtain the following result:

|R𝒟0(𝒜(S~i))R𝒟~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖\displaystyle|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-R_{% \widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
λ1+iλf(i)+λ1+iλλ1+(i1)λf(i1)+(λ1+iλλ1+(i2)λ+𝒪(1(1+iλ)2))f(i2)less-than-or-similar-toabsent𝜆1𝑖𝜆𝑓𝑖𝜆1𝑖𝜆𝜆1𝑖1𝜆𝑓𝑖1𝜆1𝑖𝜆𝜆1𝑖2𝜆𝒪1superscript1𝑖𝜆2𝑓𝑖2\displaystyle\lesssim\frac{\lambda}{1+i\lambda}f(i)+\frac{\lambda}{1+i\lambda}% \frac{\lambda}{1+(i-1)\lambda}f(i-1)+(\frac{\lambda}{1+i\lambda}\frac{\lambda}% {1+(i-2)\lambda}+\mathcal{O}(\frac{1}{(1+i\lambda)^{2}}))f(i-2)≲ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_f ( italic_i ) + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG italic_f ( italic_i - 1 ) + ( divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 2 ) italic_λ end_ARG + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ) italic_f ( italic_i - 2 )
++(λ1+iλλ1+λ+𝒪(1(1+iλ)))f(1)𝜆1𝑖𝜆𝜆1𝜆𝒪11𝑖𝜆𝑓1\displaystyle+...+(\frac{\lambda}{1+i\lambda}\frac{\lambda}{1+\lambda}+% \mathcal{O}(\frac{1}{(1+i\lambda)}))f(1)+ … + ( divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG divide start_ARG italic_λ end_ARG start_ARG 1 + italic_λ end_ARG + caligraphic_O ( divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) end_ARG ) ) italic_f ( 1 )
λ1+iλ[f(i)+λ1+(i1)λf(i1)+λ1+(i2)λf(i2)++λ1+λf(1)]less-than-or-similar-toabsent𝜆1𝑖𝜆delimited-[]𝑓𝑖𝜆1𝑖1𝜆𝑓𝑖1𝜆1𝑖2𝜆𝑓𝑖2𝜆1𝜆𝑓1\displaystyle\lesssim\frac{\lambda}{1+i\lambda}\left[f(i)+\frac{\lambda}{1+(i-% 1)\lambda}f(i-1)+\frac{\lambda}{1+(i-2)\lambda}f\left(i-2\right)+...+\frac{% \lambda}{1+\lambda}f(1)\right]≲ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG [ italic_f ( italic_i ) + divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG italic_f ( italic_i - 1 ) + divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 2 ) italic_λ end_ARG italic_f ( italic_i - 2 ) + … + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_λ end_ARG italic_f ( 1 ) ]
Mlog1δλ1+iλ[ni114log(ni1)+(1+λ1+(i1)λ)ni214log(ni2)+\displaystyle\lesssim M\log\frac{1}{\delta}\frac{\lambda}{1+i\lambda}\Big{[}n_% {i-1}^{-\frac{1}{4}}\log(n_{i-1})+(1+\frac{\lambda}{1+(i-1)\lambda})n_{i-2}^{-% \frac{1}{4}}\log(n_{i-2})+≲ italic_M roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG [ italic_n start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( italic_n start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + ( 1 + divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG ) italic_n start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( italic_n start_POSTSUBSCRIPT italic_i - 2 end_POSTSUBSCRIPT ) +
(1+λ1+(i1)λ+λ1+(i2)λ)ni314log(ni3)++(1++λ1+λ)n014log(n0)]\displaystyle(1+\frac{\lambda}{1+(i-1)\lambda}+\frac{\lambda}{1+(i-2)\lambda})% n_{i-3}^{-\frac{1}{4}}\log(n_{i-3})+...+(1+...+\frac{\lambda}{1+\lambda})n_{0}% ^{-\frac{1}{4}}\log(n_{0})\Big{]}( 1 + divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 1 ) italic_λ end_ARG + divide start_ARG italic_λ end_ARG start_ARG 1 + ( italic_i - 2 ) italic_λ end_ARG ) italic_n start_POSTSUBSCRIPT italic_i - 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( italic_n start_POSTSUBSCRIPT italic_i - 3 end_POSTSUBSCRIPT ) + … + ( 1 + … + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_λ end_ARG ) italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ]
n14log((1+iλ)n)Mlog1δ.less-than-or-similar-toabsentsuperscript𝑛141𝑖𝜆𝑛𝑀1𝛿\displaystyle\lesssim n^{-\frac{1}{4}}\log((1+i\lambda)n)M\log\frac{1}{\delta}.≲ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_M roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG . (56)

Upper Bounding Generalization Error on Mixed Distributions Term

Next, we turn our attention to the term |R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\widehat{R}_{% \widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))|| italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |. Our primary objective is to establish a moment bound for this expression.

R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))psubscriptnormsubscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖𝑝\displaystyle\left\|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_% {i}))-\widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right\|_% {p}∥ italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
=11+iλR𝒟0(𝒜(S~i))+λ1+iλR𝒟1(𝒜(S~i))+λ1+iλR𝒟2(𝒜(S~i))+λ1+iλR𝒟i(𝒜(S~i))\displaystyle=\Big{\|}\frac{1}{1+i\lambda}R_{\mathcal{D}_{0}}(\mathcal{A}(% \widetilde{S}_{i}))+\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{1}}(\mathcal{A}(% \widetilde{S}_{i}))+\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{2}}(\mathcal{A}(% \widetilde{S}_{i}))...+\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{i}}(\mathcal{% A}(\widetilde{S}_{i}))= ∥ divide start_ARG 1 end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) … + divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
1(1+iλ)n𝒛iS0(𝒜(S~i),𝒛i)1(1+iλ)n𝒛iS1(𝒜(S~i),𝒛i)1(1+iλ)n𝒛iSi(𝒜(S~i),𝒛i)p11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝒜subscript~𝑆𝑖subscript𝒛𝑖11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆1𝒜subscript~𝑆𝑖subscript𝒛𝑖evaluated-at11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆𝑖𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle\quad-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}_{i}\in S_{0}}\ell(% \mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}% _{i}\in S_{1}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})-...-\frac{1}{(1+% i\lambda)n}\sum_{\bm{z}_{i}\in S_{i}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z% }_{i})\Big{\|}_{p}- divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - … - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
11+iλR𝒟0(𝒜(S~i))1(1+iλ)n𝒛iS0(𝒜(S~i),𝒛i)pTerm 0+λ1+iλR𝒟1(𝒜(S~i))1(1+iλ)n𝒛iS1(𝒜(S~i),𝒛i)pTerm 1absentsubscriptsubscriptnorm11𝑖𝜆subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝Term 0subscriptsubscriptnorm𝜆1𝑖𝜆subscript𝑅subscript𝒟1𝒜subscript~𝑆𝑖11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆1𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝Term 1\displaystyle\leq\underbrace{\left\|\frac{1}{1+i\lambda}R_{\mathcal{D}_{0}}(% \mathcal{A}(\widetilde{S}_{i}))-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}_{i}\in S_{% 0}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})\right\|_{p}}_{\text{Term 0}% }+\underbrace{\left\|\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{1}}(\mathcal{A}% (\widetilde{S}_{i}))-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}_{i}\in S_{1}}\ell(% \mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})\right\|_{p}}_{\text{Term 1}}≤ under⏟ start_ARG ∥ divide start_ARG 1 end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Term 0 end_POSTSUBSCRIPT + under⏟ start_ARG ∥ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Term 1 end_POSTSUBSCRIPT
+..+λ1+iλR𝒟i(𝒜(S~i))1(1+iλ)n𝒛iSi(𝒜(S~i),𝒛i)pTerm i.\displaystyle\quad+..+\underbrace{\left\|\frac{\lambda}{1+i\lambda}R_{\mathcal% {D}_{i}}(\mathcal{A}(\widetilde{S}_{i}))-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}_{% i}\in S_{i}}\ell(\mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})\right\|_{p}}_{% \text{Term i}}.+ . . + under⏟ start_ARG ∥ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT Term i end_POSTSUBSCRIPT . (57)

Fixing S0,S1,,Si1subscript𝑆0subscript𝑆1subscript𝑆𝑖1S_{0},S_{1},\dots,S_{i-1}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, the data in Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are independent. Following a similar approach to the proof of Theorem 1, we utilize this property along with Lemma 8 to bound Term i. Consequently, from Equation 15 in the proof of Theorem 1, we obtain:

λ1+iλR𝒟i(𝒜(S~i))1(1+iλ)n𝒛iSi(𝒜(S~i),𝒛i)ppλ1+iλβ(1+iλ)nlog(λn)+M1+iλpλn.less-than-or-similar-tosubscriptnorm𝜆1𝑖𝜆subscript𝑅subscript𝒟𝑖𝒜subscript~𝑆𝑖11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆𝑖𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝𝑝𝜆1𝑖𝜆subscript𝛽1𝑖𝜆𝑛𝜆𝑛𝑀1𝑖𝜆𝑝𝜆𝑛\displaystyle\left\|\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{i}}(\mathcal{A}(% \widetilde{S}_{i}))-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}_{i}\in S_{i}}\ell(% \mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})\right\|_{p}\lesssim p\frac{\lambda}% {1+i\lambda}\beta_{(1+i\lambda)n}\log(\lambda n)+\frac{M}{1+i\lambda}\sqrt{% \frac{p\lambda}{n}}.∥ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≲ italic_p divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_β start_POSTSUBSCRIPT ( 1 + italic_i italic_λ ) italic_n end_POSTSUBSCRIPT roman_log ( italic_λ italic_n ) + divide start_ARG italic_M end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG divide start_ARG italic_p italic_λ end_ARG start_ARG italic_n end_ARG end_ARG . (58)

Next, we consider Term 0. Similar to Proof of Theorem 3, we first introduce a set of functions and apply Lemma 8 to bound Term 0. Specifically, we define hj(S)subscript𝑗𝑆h_{j}(S)italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S ), which serves a similar role to the gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ’s in Lemma 8, as follows:

hj(S0)subscript𝑗subscript𝑆0\displaystyle h_{j}(S_{0})italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
=𝔼𝒛0,j𝒟0[𝔼𝒛𝒟0(𝒜(S0jS1Si),𝒛)(𝒜(S0jS1Si),𝒛0,j)],absentsubscript𝔼similar-tosuperscriptsubscript𝒛0𝑗subscript𝒟0delimited-[]subscript𝔼similar-to𝒛subscript𝒟0𝒜superscriptsubscript𝑆0𝑗subscript𝑆1subscript𝑆𝑖𝒛𝒜superscriptsubscript𝑆0𝑗subscript𝑆1subscript𝑆𝑖subscript𝒛0𝑗\displaystyle=\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\left[% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left(S_{0}^{j}\cup S% _{1}\cup...\cup S_{i}\right),\bm{z}\right)-\ell\left(\mathcal{A}\left(S_{0}^{j% }\cup S_{1}\cup...\cup S_{i}\right),\bm{z}_{0,j}\right)\right],= blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ … ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ … ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] , (59)

where 𝒛0,jsubscript𝒛0𝑗\bm{z}_{0,j}bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT denote the j𝑗jitalic_j-th data point in S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and S0jsuperscriptsubscript𝑆0𝑗S_{0}^{j}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT represent the dataset obtained by replacing 𝒛0,jsubscript𝒛0𝑗\bm{z}_{0,j}bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT with 𝒛0,jsuperscriptsubscript𝒛0𝑗\bm{z}_{0,j}^{\prime}bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Moreover, following the procedure above, we observe that |hj|Msubscript𝑗𝑀\left|h_{j}\right|\leq M| italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ italic_M and 𝔼[hjS0,α\j]=0𝔼delimited-[]conditionalsubscript𝑗superscriptsubscript𝑆0𝛼\absent𝑗0\mathbb{E}\left[h_{j}\mid S_{0,\alpha}^{\backslash j}\right]=0blackboard_E [ italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_S start_POSTSUBSCRIPT 0 , italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT \ italic_j end_POSTSUPERSCRIPT ] = 0 . More intricately, we will now prove that hjsubscript𝑗h_{j}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT exhibits a bounded difference. However, it is important to note that S1,,Sisubscript𝑆1subscript𝑆𝑖S_{1},\ldots,S_{i}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT all depend on S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, so when a single data point in S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is changed, the corresponding datasets will also change. We denote these modified datasets as S1,,Sisuperscriptsubscript𝑆1superscriptsubscript𝑆𝑖S_{1}^{\prime},\ldots,S_{i}^{\prime}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and consequently, we have the following:

|hj(S0)hj(S0t)|subscript𝑗subscript𝑆0subscript𝑗superscriptsubscript𝑆0𝑡\displaystyle|h_{j}(S_{0})-h_{j}(S_{0}^{t})|| italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) |
=|𝔼𝒛0,j𝒟0[𝔼𝒛𝒟0(𝒜(S0jS1Si),𝒛)(𝒜(S0jS1Si),𝒛0,j)]|absentsubscript𝔼similar-tosuperscriptsubscript𝒛0𝑗subscript𝒟0delimited-[]subscript𝔼similar-to𝒛subscript𝒟0𝒜superscriptsubscript𝑆0𝑗subscript𝑆1subscript𝑆𝑖𝒛𝒜superscriptsubscript𝑆0𝑗subscript𝑆1subscript𝑆𝑖subscript𝒛0𝑗\displaystyle=|\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\left[% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left(S_{0}^{j}\cup S% _{1}\cup...\cup S_{i}\right),\bm{z}\right)-\ell\left(\mathcal{A}\left(S_{0}^{j% }\cup S_{1}\cup...\cup S_{i}\right),\bm{z}_{0,j}\right)\right]|= | blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ … ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ … ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ] |
𝔼𝒛0,j𝒟0[𝔼𝒛𝒟0(𝒜((S0t)jS1Si),𝒛)(𝒜((S0t)jS1Si),𝒛0,j)]subscript𝔼similar-tosuperscriptsubscript𝒛0𝑗subscript𝒟0delimited-[]subscript𝔼similar-to𝒛subscript𝒟0𝒜superscriptsuperscriptsubscript𝑆0𝑡𝑗superscriptsubscript𝑆1superscriptsubscript𝑆𝑖𝒛𝒜superscriptsuperscriptsubscript𝑆0𝑡𝑗superscriptsubscript𝑆1superscriptsubscript𝑆𝑖subscript𝒛0𝑗\displaystyle-\mathbb{E}_{\bm{z}_{0,j}^{\prime}\sim\mathcal{D}_{0}}\left[% \mathbb{E}_{\bm{z}\sim\mathcal{D}_{0}}\ell\left(\mathcal{A}\left((S_{0}^{t})^{% j}\cup S_{1}^{\prime}\cup...\cup S_{i}^{\prime}\right),\bm{z}\right)-\ell\left% (\mathcal{A}\left((S_{0}^{t})^{j}\cup S_{1}^{\prime}\cup...\cup S_{i}^{\prime}% \right),\bm{z}_{0,j}\right)\right]- blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ … ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , bold_italic_z ) - roman_ℓ ( caligraphic_A ( ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∪ … ∪ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT 0 , italic_j end_POSTSUBSCRIPT ) ]
2β(1+iλ)n(S0j(S0t)j2+S1S12++SiSi2).absent2subscript𝛽1𝑖𝜆𝑛subscriptnormsuperscriptsubscript𝑆0𝑗superscriptsuperscriptsubscript𝑆0𝑡𝑗subscript2subscriptnormsubscript𝑆1superscriptsubscript𝑆1subscript2subscriptnormsubscript𝑆𝑖superscriptsubscript𝑆𝑖subscript2\displaystyle\leq 2\beta_{(1+i\lambda)n}\Big{(}\|S_{0}^{j}-(S_{0}^{t})^{j}\|_{% \ell_{2}}+\|S_{1}-S_{1}^{\prime}\|_{\ell_{2}}+...+\|S_{i}-S_{i}^{\prime}\|_{% \ell_{2}}\Big{)}.≤ 2 italic_β start_POSTSUBSCRIPT ( 1 + italic_i italic_λ ) italic_n end_POSTSUBSCRIPT ( ∥ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∥ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + … + ∥ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (60)

Thus, by applying the recursive stability established in Theorem 2, it is important to first note that in Theorem 2, the mixed dataset is defined as S~j=αS0+(1α)Sjsubscript~𝑆𝑗𝛼subscript𝑆01𝛼subscript𝑆𝑗\widetilde{S}_{j}=\alpha S_{0}+(1-\alpha)S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_α italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 - italic_α ) italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, whereas in this theorem, the mixed dataset is defined as S~i=j=0iSjsubscript~𝑆𝑖superscriptsubscript𝑗0𝑖subscript𝑆𝑗\widetilde{S}_{i}=\sum_{j=0}^{i}S_{j}over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, by following the proof steps outlined in Theorem 2, we can derive the following:

|hj(S0)hj(S0t)|2β(1+iλ)n(i!B~WiL).less-than-or-similar-tosubscript𝑗subscript𝑆0subscript𝑗superscriptsubscript𝑆0𝑡2subscript𝛽1𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖𝐿\displaystyle|h_{j}(S_{0})-h_{j}(S_{0}^{t})|\lesssim 2\beta_{(1+i\lambda)n}% \Big{(}i!\widetilde{B}_{W}^{iL}\Big{)}.| italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | ≲ 2 italic_β start_POSTSUBSCRIPT ( 1 + italic_i italic_λ ) italic_n end_POSTSUBSCRIPT ( italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT ) .

Thus, we apply lemma 8:

j=1nhj(S0)psubscriptnormsuperscriptsubscript𝑗1𝑛subscript𝑗subscript𝑆0𝑝\displaystyle\left\|\sum_{j=1}^{n}h_{j}(S_{0})\right\|_{p}∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 122pn2β(1+iλ)n(i!B~WiL)log(n)+4Mpnabsent122𝑝𝑛2subscript𝛽1𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖𝐿𝑛4𝑀𝑝𝑛\displaystyle\leq 12\sqrt{2}pn2\beta_{(1+i\lambda)n}\left(i!\widetilde{B}_{W}^% {iL}\right)\log(n)+4M\sqrt{pn}≤ 12 square-root start_ARG 2 end_ARG italic_p italic_n 2 italic_β start_POSTSUBSCRIPT ( 1 + italic_i italic_λ ) italic_n end_POSTSUBSCRIPT ( italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT ) roman_log ( italic_n ) + 4 italic_M square-root start_ARG italic_p italic_n end_ARG
pρ21+iλ(i!B~WiL)log(n(1+iλ))+Mpn.less-than-or-similar-toabsent𝑝superscript𝜌21𝑖𝜆𝑖superscriptsubscript~𝐵𝑊𝑖𝐿𝑛1𝑖𝜆𝑀𝑝𝑛\displaystyle\lesssim p\frac{\rho^{2}}{1+i\lambda}\Big{(}i!\widetilde{B}_{W}^{% iL}\Big{)}\log(n(1+i\lambda))+M\sqrt{pn}.≲ italic_p divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_i italic_λ end_ARG ( italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT ) roman_log ( italic_n ( 1 + italic_i italic_λ ) ) + italic_M square-root start_ARG italic_p italic_n end_ARG .

We observe that the difference between Term 0 and 1(1+iλ)nj=1nhj(S0)p11𝑖𝜆𝑛subscriptnormsuperscriptsubscript𝑗1𝑛subscript𝑗subscript𝑆0𝑝\frac{1}{(1+i\lambda)n}\left\|\sum_{j=1}^{n}h_{j}(S_{0})\right\|_{p}divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is negligible. Thus, we can bound Term 0 as follows:

11+iλR𝒟0(𝒜(S~i))1(1+iλ)n𝒛iS0(𝒜(S~i),𝒛i)psubscriptnorm11𝑖𝜆subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆0𝒜subscript~𝑆𝑖subscript𝒛𝑖𝑝\displaystyle\left\|\frac{1}{1+i\lambda}R_{\mathcal{D}_{0}}(\mathcal{A}(% \widetilde{S}_{i}))-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}_{i}\in S_{0}}\ell(% \mathcal{A}(\widetilde{S}_{i}),\bm{z}_{i})\right\|_{p}∥ divide start_ARG 1 end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
pρ2(1+iλ)2n(i!B~WiL)log(n(1+iλ))+11+iλMp/n.less-than-or-similar-toabsent𝑝superscript𝜌2superscript1𝑖𝜆2𝑛𝑖superscriptsubscript~𝐵𝑊𝑖𝐿𝑛1𝑖𝜆11𝑖𝜆𝑀𝑝𝑛\displaystyle\lesssim p\frac{\rho^{2}}{(1+i\lambda)^{2}n}\Big{(}i!\widetilde{B% }_{W}^{iL}\Big{)}\log(n(1+i\lambda))+\frac{1}{1+i\lambda}M\sqrt{p/n}.≲ italic_p divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG ( italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_L end_POSTSUPERSCRIPT ) roman_log ( italic_n ( 1 + italic_i italic_λ ) ) + divide start_ARG 1 end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_M square-root start_ARG italic_p / italic_n end_ARG . (61)

Using the same method, for Term j𝑗jitalic_j, where 1ji11𝑗𝑖11\leq j\leq i-11 ≤ italic_j ≤ italic_i - 1, we can derive the following:

λ1+iλR𝒟j(𝒜(S~i))1(1+iλ)n𝒛iS1(𝒜(S~j),𝒛i)psubscriptnorm𝜆1𝑖𝜆subscript𝑅subscript𝒟𝑗𝒜subscript~𝑆𝑖11𝑖𝜆𝑛subscriptsubscript𝒛𝑖subscript𝑆1𝒜subscript~𝑆𝑗subscript𝒛𝑖𝑝\displaystyle\left\|\frac{\lambda}{1+i\lambda}R_{\mathcal{D}_{j}}(\mathcal{A}(% \widetilde{S}_{i}))-\frac{1}{(1+i\lambda)n}\sum_{\bm{z}_{i}\in S_{1}}\ell(% \mathcal{A}(\widetilde{S}_{j}),\bm{z}_{i})\right\|_{p}∥ divide start_ARG italic_λ end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - divide start_ARG 1 end_ARG start_ARG ( 1 + italic_i italic_λ ) italic_n end_ARG ∑ start_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
pρ2(1+iλ)2n(j!B~WjL)log(n(1+iλ))+11+iλMp/n.less-than-or-similar-toabsent𝑝superscript𝜌2superscript1𝑖𝜆2𝑛𝑗superscriptsubscript~𝐵𝑊𝑗𝐿𝑛1𝑖𝜆11𝑖𝜆𝑀𝑝𝑛\displaystyle\lesssim p\frac{\rho^{2}}{(1+i\lambda)^{2}n}\Big{(}j!\widetilde{B% }_{W}^{jL}\Big{)}\log(n(1+i\lambda))+\frac{1}{1+i\lambda}M\sqrt{p/n}.≲ italic_p divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG ( italic_j ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j italic_L end_POSTSUPERSCRIPT ) roman_log ( italic_n ( 1 + italic_i italic_λ ) ) + divide start_ARG 1 end_ARG start_ARG 1 + italic_i italic_λ end_ARG italic_M square-root start_ARG italic_p / italic_n end_ARG . (62)

In summary, we can finally bound the Generalization Error on the Mixed Distributions term as follows:

R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))psubscriptnormsubscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖𝑝\displaystyle\left\|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_% {i}))-\widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right\|_% {p}∥ italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
pρ2(1+iλ)2nlog((1+iλ)n)i!B~W(i+1)L+Mi1+iλpn.less-than-or-similar-toabsent𝑝superscript𝜌2superscript1𝑖𝜆2𝑛1𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿𝑀𝑖1𝑖𝜆𝑝𝑛\displaystyle\lesssim p\frac{\rho^{2}}{(1+i\lambda)^{2}n}\log((1+i\lambda)n)i!% \widetilde{B}_{W}^{(i+1)L}+\frac{Mi}{1+i\lambda}\sqrt{\frac{p}{n}}.≲ italic_p divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT + divide start_ARG italic_M italic_i end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG divide start_ARG italic_p end_ARG start_ARG italic_n end_ARG end_ARG .

Then, according to Lemma 9, we obtain, with probability at least 1δ1𝛿1-\delta1 - italic_δ:

R𝒟~i(𝒜(S~i))R^S~i(𝒜(S~i))psubscriptnormsubscript𝑅subscript~𝒟𝑖𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖𝑝\displaystyle\left\|R_{\widetilde{\mathcal{D}}_{i}}(\mathcal{A}(\widetilde{S}_% {i}))-\widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right\|_% {p}∥ italic_R start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
ρ2(1+iλ)2nlog((1+iλ)n)i!B~W(i+1)Llog1δ+Mi1+iλ1nlog1δ.less-than-or-similar-toabsentsuperscript𝜌2superscript1𝑖𝜆2𝑛1𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿1𝛿𝑀𝑖1𝑖𝜆1𝑛1𝛿\displaystyle\lesssim\frac{\rho^{2}}{(1+i\lambda)^{2}n}\log((1+i\lambda)n)i!% \widetilde{B}_{W}^{(i+1)L}\log\frac{1}{\delta}+\frac{Mi}{1+i\lambda}\sqrt{% \frac{1}{n}\log\frac{1}{\delta}}.≲ divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG + divide start_ARG italic_M italic_i end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG .

Then, combine the above inequality with inequality 56, we obtain:

|R𝒟0(𝒜(S~i))R^S~i(𝒜(S~i))|subscript𝑅subscript𝒟0𝒜subscript~𝑆𝑖subscript^𝑅subscript~𝑆𝑖𝒜subscript~𝑆𝑖\displaystyle\left|R_{\mathcal{D}_{0}}(\mathcal{A}(\widetilde{S}_{i}))-% \widehat{R}_{\widetilde{S}_{i}}(\mathcal{A}(\widetilde{S}_{i}))\right|| italic_R start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_A ( over~ start_ARG italic_S end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) |
n14log((1+iλ)n)Mlog1δ+ρ2(1+iλ)2nlog((1+iλ)n)i!B~W(i+1)Llog1δ+Mi1+iλ1nlog1δless-than-or-similar-toabsentsuperscript𝑛141𝑖𝜆𝑛𝑀1𝛿superscript𝜌2superscript1𝑖𝜆2𝑛1𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿1𝛿𝑀𝑖1𝑖𝜆1𝑛1𝛿\displaystyle\lesssim n^{-\frac{1}{4}}\log((1+i\lambda)n)M\log\frac{1}{\delta}% +\frac{\rho^{2}}{(1+i\lambda)^{2}n}\log((1+i\lambda)n)i!\widetilde{B}_{W}^{(i+% 1)L}\log\frac{1}{\delta}+\frac{Mi}{1+i\lambda}\sqrt{\frac{1}{n}\log\frac{1}{% \delta}}≲ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_M roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n end_ARG roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG + divide start_ARG italic_M italic_i end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG
n12Mi1+iλlog1δ+n1ρ2(1+iλ)2log((1+iλ)n)i!B~W(i+1)Llog1δless-than-or-similar-toabsentsuperscript𝑛12𝑀𝑖1𝑖𝜆1𝛿superscript𝑛1superscript𝜌2superscript1𝑖𝜆21𝑖𝜆𝑛𝑖superscriptsubscript~𝐵𝑊𝑖1𝐿1𝛿\displaystyle\lesssim n^{-\frac{1}{2}}\frac{Mi}{1+i\lambda}\sqrt{\log\frac{1}{% \delta}}+n^{-1}\frac{\rho^{2}}{(1+i\lambda)^{2}}\log((1+i\lambda)n)i!% \widetilde{B}_{W}^{(i+1)L}\log\frac{1}{\delta}≲ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_M italic_i end_ARG start_ARG 1 + italic_i italic_λ end_ARG square-root start_ARG roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG end_ARG + italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_i italic_λ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_i ! over~ start_ARG italic_B end_ARG start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_L end_POSTSUPERSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG
+n14log((1+iλ)n)Mlog1δ.superscript𝑛141𝑖𝜆𝑛𝑀1𝛿\displaystyle\quad+n^{-\frac{1}{4}}\log((1+i\lambda)n)M\log\frac{1}{\delta}.+ italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT roman_log ( ( 1 + italic_i italic_λ ) italic_n ) italic_M roman_log divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG .

The proof is complete.

Appendix B Experiments

In this section, we present some experimental results. Specifically, we trained transformer models to in-context learn linear functions within STLs.

In these experiments, we considered the class of linear functions:

={ff(𝒙)=𝒘𝒙,𝒘d},conditional-set𝑓formulae-sequence𝑓𝒙superscript𝒘top𝒙𝒘superscript𝑑\mathcal{F}=\left\{f\mid f(\bm{x})=\bm{w}^{\top}\bm{x},\bm{w}\in\mathbb{R}^{d}% \right\},caligraphic_F = { italic_f ∣ italic_f ( bold_italic_x ) = bold_italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x , bold_italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } ,

in d=5𝑑5d=5italic_d = 5 dimensions. We sampled 𝒙1,,𝒙k,𝒙querysubscript𝒙1subscript𝒙𝑘subscript𝒙query\bm{x}_{1},\ldots,\bm{x}_{k},\bm{x}_{\text{query}}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT query end_POSTSUBSCRIPT, and 𝒘𝒘\bm{w}bold_italic_w independently from the isotropic Gaussian distribution 𝒩(0,Id)𝒩0subscript𝐼𝑑\mathcal{N}(0,I_{d})caligraphic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). For each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we computed yi=𝒘𝒙isubscript𝑦𝑖superscript𝒘topsubscript𝒙𝑖y_{i}=\bm{w}^{\top}\bm{x}_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and constructed the prompt as:

P=(𝒙1,y1,𝒙2,y2,,𝒙k,yk,𝒙query).𝑃subscript𝒙1subscript𝑦1subscript𝒙2subscript𝑦2subscript𝒙𝑘subscript𝑦𝑘subscript𝒙queryP=(\bm{x}_{1},y_{1},\bm{x}_{2},y_{2},\ldots,\bm{x}_{k},y_{k},\bm{x}_{\text{% query}}).italic_P = ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT query end_POSTSUBSCRIPT ) .

We employed a 12-layer, 8-head GPT-2 model with a hidden size of 256, trained on an 5superscript5\mathbb{R}^{5}blackboard_R start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT linear regression task with 40 in-context examples. Two cases were considered:

  • Mixed Case: Fresh data and generated data were mixed in a 0.5 ratio.

  • Full Synthetic Case: No fresh data was used.

The results of these experiments are summarized below:

Loop123456Full Synthetic0.38171.49751.53962.08362.39122.8764Mixed0.38170.42080.43910.45030.46410.4702missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionLoop123456missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionFull Synthetic0.38171.49751.53962.08362.39122.8764missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionMixed0.38170.42080.43910.45030.46410.4702\begin{array}[]{|c|c|c|c|c|c|c|}\hline\cr\text{Loop}&1&2&3&4&5&6\\ \hline\cr\text{Full Synthetic}&0.3817&1.4975&1.5396&2.0836&2.3912&2.8764\\ \hline\cr\text{Mixed}&0.3817&0.4208&0.4391&0.4503&0.4641&0.4702\\ \hline\cr\end{array}start_ARRAY start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL Loop end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 3 end_CELL start_CELL 4 end_CELL start_CELL 5 end_CELL start_CELL 6 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL Full Synthetic end_CELL start_CELL 0.3817 end_CELL start_CELL 1.4975 end_CELL start_CELL 1.5396 end_CELL start_CELL 2.0836 end_CELL start_CELL 2.3912 end_CELL start_CELL 2.8764 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL Mixed end_CELL start_CELL 0.3817 end_CELL start_CELL 0.4208 end_CELL start_CELL 0.4391 end_CELL start_CELL 0.4503 end_CELL start_CELL 0.4641 end_CELL start_CELL 0.4702 end_CELL end_ROW end_ARRAY

As observed, the error accumulates progressively with more self-consuming loops, particularly in the full synthetic case, where the error grows rapidly. In contrast, maintaining a constant-sized proportion of real data effectively reduces the loss, which is consistent with our theoretical findings.