A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops
Abstract
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion of recursive stability and presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.
1 Introduction
The quest of high-quality data is paramount in the training of generative artificial intelligence (AI), such as large language models (LLMs). However, the vast reservoir of publicly available data on the internet has nearly been exhausted (Villalobos et al., 2022), pushing the research community to seek innovative yet plausible solutions. One promising approach is to train the next generation of LLMs using synthetic data generated by earlier generations of the models themselves (Briesch et al., 2023). Additionally, reliance on synthetic data has become almost unavoidable, as many existing datasets are already polluted with synthetic content (Schuhmann et al., 2022), which proves difficult to detect reliably (Sadasivan et al., 2023). This has led to the development of Self-consuming Training Loops (STLs), as illustrated in Figure 1, where generative models are recursively trained on a mix of real and synthetic data generated by the models themselves. In theory, these STLs of data creation and refinement could propel models to new levels of capability, reducing reliance on external datasets.
However, despite their potential, the empirical results of STLs have been highly inconsistent across studies (Shumailov et al., 2024; Alemohammad et al., 2024a; Xing et al., 2025; Dohmatob et al., 2024b). Some studies (Shumailov et al., 2024) have encountered significant setbacks—certain models have shown signs of stagnation, failing to improve or adapt, while others have even regressed, leading to sharp declines in performance. Conversely, other works (Gerstgrasser et al., 2024; Gillman et al., 2024; Alemohammad et al., 2024b; Ferbach et al., 2024) have successfully avoided model collapse by incorporating sufficient real data, augmenting with synthetic data, or introducing guidance during the generation process. However, these observed phenomena lack thorough theoretical explanations.
When and how do STLs generalize effectively, thereby preventing model collapse from a theoretical perspective? Even among “refined” LLMs drawing from similar pools of model-generated data, the results vary significantly (Briesch et al., 2023; Fu et al., 2024a). These inconsistencies highlight the urgency of establishing theoretical guarantees for STLs by exploring the underlying mechanisms that determine when synthetic data generation either facilitates or impedes model development. Initial theoretical explorations have started to address these gaps. For instance, Shumailov et al. (2024) and Alemohammad et al. (2024a) demonstrated model collapse when models were trained exclusively on synthetic data, using simplified Gaussian models to illustrate this issue. In a more detailed theoretical study, Bertrand et al. (2024) derived upper bounds on parameter deviations for likelihood-based models in STLs, establishing convergence under strict statistical and optimization assumptions. Meanwhile, Fu et al. (2024b) relaxed these assumptions and provided bounds on the divergence between synthetic and real data distributions for a simplified diffusion model.
However, existing theoretical research lacks a unified framework and has yet to thoroughly investigate the generalization error of STLs. Additionally, current studies often overlook the role of model architectures in preventing model collapse. Moreover, the behavior of transformers within STLs remains largely unexamined, leaving significant theoretical gaps in the literature. Notably, there is limited exploration of the theoretical trade-offs introduced by synthetic data augmentation. This paper aims to address these gaps with the following contributions:
1. Theoretical Generalization Framework: This paper fills a gap in prior research by being the first to establish generalization error bounds. The key innovation, recursive stability, is introduced to address the core theoretical challenges, specifically the complex recursive structures and the non-i.i.d. nature of the data. Moreover, we demonstrate that the generalization error converges under the following conditions: (1) the generative model satisfies recursive stability, and (2) the proportion of real data is maintained at a non-negligible constant level, thus preventing model collapse.
2. Application to Transformers in In-Context Learning: This paper is the first to extend the theoretical framework to transformer models in in-context learning. We prove that transformers in this setting satisfy recursive stability with a constant-level proportion of real data, controlling output differences in STLs under small perturbations to the initial dataset. Consequently, we show that the generalization error is bounded by .
3. Trade-off Analysis of Synthetic Data Augmentation: We investigate the trade-off in synthetic data augmentation. By employing decomposition techniques, we demonstrate that while synthetic data improves the generalization performance of each generation on mixed datasets, it concurrently exacerbates distribution divergence across successive generations. Our theoretical findings further show that the optimal size of synthetic augmentation increases as the size of real dataset decreases.
2 Related Work
This section reviews STLs research and algorithmic stability studies.
Self-consuming Training Loops. Recent research has increasingly focused on generative models trained within STLs (Shumailov et al., 2024), with much of the analysis conducted from an empirical perspective (Martínez et al., 2023). For example, Shumailov et al. (2024); Briesch et al. (2023) observe a decline in diversity in language models when a portion of the model’s outputs is recursively used as inputs. Additionally, Wyllie et al. (2024) highlights that recursive training on synthetic data amplifies biases, resulting in significant fairness concerns. To mitigate model collapse, some studies suggest incorporating real data into the training process (Alemohammad et al., 2024a), expanding the size of synthetic datasets (Dohmatob et al., 2024c; Gerstgrasser et al., 2024; Dohmatob et al., 2024a; Feng et al., 2024b), or providing guidance during the generation process (Gillman et al., 2024; Alemohammad et al., 2024b; Feng et al., 2024a).
While empirical research has extensively explored STLs of generative models, theoretical studies on this process remain relatively sparse (Kanabar & Gastpar, 2025; Seddik et al., 2024; Marchi et al., 2024; Gerstgrasser et al., 2024; Zhu et al., 2024; Tao et al., 2024). Notably, Shumailov et al. (2024) and Alemohammad et al. (2024a) offer initial theoretical insights by analyzing a simplified Gaussian model. In a more comprehensive analysis, Bertrand et al. (2024) derive upper bounds on parameter deviations between those obtained within a STL and the optimal values, relying on assumptions about statistical and optimization error bounds. In contrast, Fu et al. (2024b) propose bounds on the divergence between synthetic and real-world data distributions, without such assumptions. However, current research lacks a unified theoretical framework that accounts for the influence of different model architectures and does not provide generalization error bounds for STLs, thus failing to rigorously establish the conditions that guarantee the prevention of model collapse. Furthermore, the behavior of transformers within STLs remains unexplored, leaving substantial theoretical gaps.
Algorithmic stability. Algorithmic stability ensures generalization bounds independent of model capacity. A key measure, uniform stability, was introduced by Bousquet & Elisseeff (2002) and has been instrumental in analyzing the generalization behavior of regularization methods. This measure was later extended to SGD (Hardt et al., 2016), including non-convex and non-smooth settings (Charles & Papailiopoulos, 2018; Bassily et al., 2020; Lei, 2023). Recent work shows that uniform stability can also provide near-optimal bounds with high probability (Feldman & Vondrak, 2019; Bousquet et al., 2020; Klochkov & Zhivotovskiy, 2021; Li & Liu, 2022; Wang et al., 2024).
Building on these foundations, recent research has focused on stability in more complex, non-i.i.d. settings. A common approach models data from stationary and mixing sequences (Doukhan, 1994; Yu, 1994), where weakening dependencies allow stability bounds through mixing coefficients (Mohri & Rostamizadeh, 2010; He et al., 2016; Fu et al., 2023). However, estimating these coefficients remains challenging. Additionally, some studies (Zheng et al., 2023) address non-i.i.d. data by leveraging conditional independence properties. Nonetheless, current methodologies struggle with the complexities of STLs, as the non-i.i.d. nature of mixed datasets, where each generation’s data is influenced by previous generations, presents unresolved challenges for stability frameworks.
Remark 1.
Building on previous challenges, our work advances this area by developing a more comprehensive theoretical framework for analyzing generative models within STLs. Specifically, we present the first generalization error bound by addressing the additional complexity arising from the non-i.i.d. nature of mixed datasets. To address this, we propose the key innovation of recursive stability, which quantifies error propagation across generations of synthetic data. Moreover, we are the first to extend this theoretical framework to transformers, explicitly utilizing error decomposition to illustrate the trade-off introduced by augmenting datasets with synthetic data.
3 Preliminaries
In this section, we begin by formally describing the training process of generative models in STLs, then introduce algorithmic stability with a focus on uniform stability, and finally define recursive stability to address the challenges specific to STLs.
3.1 Generative Models within Self-consuming Training Loops
Generative models have made significant strides in producing highly realistic data, such as images and text, which are frequently shared online and often indistinguishable from real content. Meanwhile, the supply of real data has nearly been exhausted. Consequently, deep generative models increasingly rely on synthetic data, either unintentionally (Schuhmann et al., 2022) or intentionally (Huang et al., 2022). This reliance creates a recursive cycle where successive generations are trained on mixed datasets of real and synthetic data, a process known as an STL, as shown in Figure 1.
More concretely, we explore a stochastic process that evolves through sequential generations. In an STL, we start with an initial dataset , consisting of real data points , sampled from the original real distribution . The initial generative model is trained on this real dataset , producing the first generation synthetic dataset , whose distribution is denoted as . Next, the real dataset is combined with the synthetic dataset in a certain proportion to form a new mixed dataset , with distribution . The next generation generative model is then trained on this mixed dataset . Moving forward, for each subsequent generation , the mixed dataset is composed of real data and synthetic data from previous generations. The generative model is trained on , producing the synthetic dataset . This STL proceeds iteratively until the maximum generation, denoted as , is reached.
3.2 Algorithmic Stability
Algorithmic stability measures the impact of modifying or removing a small number of examples from the training set on the resulting model, a key concept in statistical learning theory (Bousquet & Elisseeff, 2002). Its primary advantage lies in providing generalization bounds independent of model capacity. Among various stability notions (Shalev-Shwartz et al., 2010), we focus on uniform stability, the most widely studied form. Let and be two datasets differing by one point. Then, we formally define uniform stability as follows:
Definition 1.
(Uniform Stability (Bousquet & Elisseeff, 2002)). Algorithm is uniformly -stable with respect to the loss function if the following holds
Traditional notions of stability have predominantly been studied in the context of learning algorithms, such as SGD (Lei & Ying, 2020). More recently, there has been significant progress in extending the concept of stability to generative models (Farnia & Ozdaglar, 2021; Zheng et al., 2023; Li et al., 2023). Building on these advancements, we propose recursive stability to specifically address generative models within STLs. This new stability measure is designed to quantify the differences in a generative model’s outputs after multiple generations of recursive training when small perturbations are applied to the initial real dataset. The formal definition of recursive stability is presented below.
Definition 2.
(Recursive Stability) Let represent the original real dataset, and denote a dataset differing from by a single example. A generative model is said to be recursively -stable with respect to the distance measure after the -th generation of STLs, where the ratio of real to synthetic data is set to , if the following condition holds:
where denotes the output of the generative model at the -th generation in the STLs. The distance measure quantifies the deviation between the outputs generated from inputs and across STLs. Specifically, can be defined using Total Variation (TV) distance, Kullback-Leibler (KL) divergence, or various norms (e.g., norm), allowing flexibility in assessing the differences in generated outputs.
4 General Theoretical Results
In this section, we present a general framework for analyzing generalization error. Moving beyond traditional analyses of parameter changes (Bertrand et al., 2024) and distributional discrepancies (Fu et al., 2024b), we focus on evaluating the utility of synthetic data after recursive training (Hittmeir et al., 2019; Xu et al., 2023). Specifically, we examine the behavior of a uniformly stable learning algorithm trained on the mixed dataset in the -th generation. Our goal is to study the generalization error of the hypothesis . Formally, we aim to bound , where represents the population risk of under the real distribution , and denotes the empirical risk on the mixed dataset. To derive this bound, we first decompose the generalization error as follows.
The first term captures the accumulation of error and distribution divergence over multiple generations within the STLs. This heavily depends on the capacity of the generative model to preserve distributional fidelity across generations, requiring recursive techniques to manage error propagation. The second term reflects the generalization performance of the learning algorithm on the non-i.i.d. mixed dataset, where synthetic data points are influenced by the initial real dataset. Drawing on Zheng et al. (2023), we observe that while satisfies the i.i.d. assumption, the synthetic datasets follow a conditional i.i.d. assumption given . Leveraging this, along with moment bounds and concentration inequalities, we address the challenge of bounding the second term and managing dependencies within the STLs. We now present the following result.
Theorem 1 (General Generalization Bound).
Assume that is a -uniformly stable learning algorithm and the loss function is bounded by . Let represent the sample size of the mixed dataset , defined as for , where denotes the proportion of real data. Assume further that the generative model is recursively -stable, and the TV distance for each generation is of the same order, denoted by . Then, for any , with probability at least , the following holds:
(1) |
where , with and representing two real datasets of size , differing by only a single data point.
Remark 2.
Recursive Stability in STLs. In Theorem 1, the recursive stability parameter is quantified using the TV distance to measure the divergence between the distributions of the synthetic data points generated by the model at the -th generation. Notably, the concept of recursive stability, introduced in Definition 2, is adaptable to various metrics, making it applicable across different types of generative models. In Theorem 2, the recursive stability parameter for transformers is instead defined using the norm between tokens, allowing this concept to be generalized to a broader range of model architectures.
Moreover, Theorem 1 demonstrates that generative models with higher recursive stability exhibit better performance after undergoing the STL. Specifically, the results indicate that the convergence rate of recursive stability parameter is at least faster than , which is a relatively mild condition. Furthermore, Theorem 2 shows that, under mild assumptions, the recursive stability parameter for transformers in in-context learning settings achieves a convergence rate of when measured by the norm between tokens.
Remark 3.
Effect of Real Data Proportion on Error Control. Previous experimental results (Shumailov et al., 2024; Alemohammad et al., 2024a) have demonstrated that incorporating real data can mitigate model collapse and help control errors. This remark focuses on exploring the effect of the real data proportion on the generalization error within the STLs. As shown in Theorem 1, the real data proportion plays a significant role in the cumulative distribution shift across generations, specifically in the term .
When , we observe that , leading to a linear accumulation of errors due to the Distribution Shift, making it increasingly challenging to control the overall error. This observation aligns with the theoretical results reported in Shumailov et al. (2024); Dohmatob et al. (2024a); Fu et al. (2024b). However, it is important to note that the conditions on for controlling this term are not strict. In fact, as long as remains at a non-negligible constant level, the expression remains bounded, effectively controlling the error. This aligns with theoretical intuition: when is too small, the mixed dataset contains insufficient real data, resulting in a more severe distribution shift.
Moreover, the proportion of real data also impacts the generalization error on mixed distributions, primarily through its effect on the recursive stability parameter . As increases, the generative model becomes more recursively stable. We will further explore the influence of on the recursive stability parameter for specific generative models, such as transformers, in Theorem 3, particularly in Remark 8.
Remark 4.
Convergence Rate of Uniform Stability Parameter. With respect to the uniform stability parameter , we observe from the third term on the right-hand side of inequality 1 that the convergence rate of must be at least to adequately control the error. This is a relatively mild requirement.
For example, in the case of widely used algorithms such as SGD, it has been shown that the uniform stability parameter converges at a rate of under the assumptions of Lipschitz continuity and smoothness of the loss function (Zhang et al., 2022). Additionally, for regularization-based algorithms, such as kernel regularization schemes and the Minimum Relative Entropy (MRE) algorithm, it has been demonstrated that can achieve a convergence rate of under certain conditions (Bousquet & Elisseeff, 2002).
Remark 5.
Convergence of the Distribution Shift Term . Regarding the convergence of the term , as discussed in Remark 3, when remains a non-negligible constant, attention turns to the distribution shift term . This term critically depends on the generative model’s capacity and quantifies the divergence between the learned distribution and the input distribution in each generation.
Theoretical studies have provided various convergence rates for across different generative models. For instance, in diffusion models, has been shown to converge at a rate of (Fu et al., 2024b). Similarly, for GANs, the convergence rate is also (Liang, 2021). More generally, by applying Pinsker’s inequality to relate KL divergence and TV distance, the convergence rates for other models, such as Bias potential models and Normalizing flows, have been explored in previous works (Yang, 2022). Additionally, we will further examine the behavior of transformer models in Theorem 3, demonstrating the flexibility of our theoretical framework across a wide range of generative models.
Remark 6.
Comparision with Previous Works. In the realm of theoretical research on the STL, where models are recursively trained on the synthetic data they generate, the foundational work was introduced by Shumailov et al. (2024) and Alemohammad et al. (2024a). They provided the initial theoretical definitions and analyzed the behavior of a simplistic multivariate Gaussian toy model in such loops. However, their analyses were limited to basic theoretical insights and lacked in-depth exploration of more complex generative models.
Recent advancements in this field have primarily come from Bertrand et al. (2024) and Fu et al. (2024b). Bertrand et al. (2024) established an upper bound on the deviation of likelihood-based model output parameters from the optimal ones, denoted as . This was achieved by making direct assumptions on the upper bounds of both statistical and optimization errors in generative models, as outlined in their Assumption 3. In contrast, Fu et al. (2024b) derived bounds on the TV distance, addressing the distribution divergence between the synthetic data distributions produced by future models and the original real data distribution, with a specific focus on diffusion models. Our work makes significant theoretical advancements over both Bertrand et al. (2024) and Fu et al. (2024b) in several key aspects:
1. Innovative Concept of Recursive Stability. A central technical contribution of our work is the extension of the traditional notion of algorithmic stability. We define recursive stability, a crucial factor for controlling error propagation across generations. This novel concept tackles the theoretical challenges posed by non-i.i.d. data and recursive structures within STLs, while also incorporating the influence of model architectures into the generalization error. Moreover, recursive stability serves as a new measure for assessing the stability of generative models within STLs. In Theorem 2, we further establish an upper bound on the recursive stability parameter for transformers under mild conditions, underscoring the broad applicability and robustness of our framework.
2. Establishing the First Generalization Error Bound for STLs. While Bertrand et al. (2024) primarily focused on parameter deviations in generative models and Fu et al. (2024b) concentrated on distribution divergence, our work emphasizes the utility of the generated data produced by STLs. Specifically, by utilizing recursive stability, we present the first generalization error bound that quantifies the gap between the population risk on the initial real data distribution and the empirical risk of the hypothesis , generated by applying learning algorithms to the synthetic data produced after multiple generations of STLs. This introduces a new layer of complexity compared to prior work, as it necessitates handling not only the distribution shifts within STLs but also the challenges arising from the non-i.i.d. nature of the mixed datasets, where each generation’s data is influenced by all preceding generations.
3. A More General Framework Accounting for Model Structure. Our proposed theoretical framework is more comprehensive than previous studies. Bertrand et al. (2024) restricted their analysis to simplified likelihood-based generative models, while Fu et al. (2024b) focused specifically on diffusion models. Importantly, neither of their theoretical results accounted for the impact of different model architectures. In contrast, as discussed in Remark 5, our framework explicitly incorporates the effects of varying model structures, thereby extending its applicability to a broader range of generative models. Notably, we are the first to extend the theory of SLTs to transformer models, further broadening the scope of our approach across diverse generative model architectures.
4. Comprehensive Collapse Prevention Through Recursive Stability. In addition to the existing theoretical work, which primarily analyzes conditions to avoid model collapse based on the proportion of real data (e.g., Bertrand et al. (2024); Fu et al. (2024b)), our work extends these analyses by considering the impact of model architecture. Specifically, Theorem 1 demonstrates that under a recursive stability condition and a non-negligible constant level of real data, model collapse can be avoided across a variety of model architectures. This analysis offers broader conditions for preventing collapse by incorporating recursive stability, deepening the understanding of how model architecture affects training robustness.
Remark 7.
Proof Sketch of Theorem 1. We first decompose the generalization error of STLs into two distinct terms: (1) the cumulative distribution shift across generations, and (2) the generalization error on the mixed dataset.
Cumulative Distribution Shift: This term measures the shift between the real dataset and the mixed distribution after the -th generation. Using the TV distance to quantify the shift introduced by the generative model, we bound the difference as:
By leveraging the recursive structure of the generative process, this cumulative distribution shift can be bounded across all generations as:
Generalization Error on the Mixed Dataset: The second term quantifies the generalization error when training on the mixed dataset , which consists of both real and synthetic data. Our goal is to establish a moment bound on the generalization error, which can be decomposed as follows:
In this context, represents a proportion of the data points in , leading to a total of data points. Similarly, denotes a subset of the synthetic dataset , where and its size is . For each term, we leverage the uniform stability of the learning algorithm and the recursive stability of the generative model to address the non-i.i.d. nature of the mixed dataset. The mixed dataset exhibits conditional independence (Zheng et al., 2023), with synthetic data conditioned on the initial real dataset , allowing the application of recursive techniques to derive the moment bound. Subsequently, Lemma 8 and Lemma 9 are utilized to derive the high-probability bound for the final result.
5 Theoretical Analysis of Transformers in In-Context Learning
In this section, we first present the transformer in in-context learning (ICL) and its settings within SLTs in Section 5.1. In Section 5.2, we prove that it satisfies recursive stability, followed by the derivation of the generalization error bound for transformers in ICL in Section 5.3. Finally, in Section 5.4, we explore the scenario of synthetic data augmentation and investigate the associated trade-offs.
5.1 Settings of Transformer in In-context Learning
In-Context Learning Setting. ICL involves a transformer model processing a sequence of input-output examples to perform inference without parameter updates. Unlike traditional supervised learning, where a model is trained on a fixed dataset and then makes predictions, ICL allows the model to adapt on-the-fly to new queries based on the provided examples. We denote a prompt, containing in-context examples followed by the ()-th query input, as where represents i.i.d. in-context samples, and is the query input whose label we want to predict. The transformer model, denoted as , takes the prompt as input and outputs the predicted label for the query : .
Recursive Data Generation in STLs with ICL. We extend the traditional ICL setting to an STL, where the transformer recursively generates new data using its own ICL predictions. Starting with an initial real dataset , this serves as the initial real in-context examples for the transformer. The process begins by sampling the first generation queries i.i.d. from the input distribution . Each query is incorporated into the in-context examples from as a new query , and the transformer predicts the corresponding label . This produces a synthetic dataset , consisting of inputs and their predicted labels . A mixed dataset is then formed and used as the in-context examples for the next generation. This process continues, with each generation producing a new synthetic dataset based on the updated mixed dataset .
5.2 Recursive Stability of In-Context Learning with Transformers
In this section, we demonstrate that transformers exhibit recursive stability within the ICL framework. Following the ICL setting from Li et al. (2023), we show that the model effectively controls error propagation from perturbations in the initial real dataset, ensuring stability across the STLs.
Theorem 2.
Let be two initial real datasets that only differ at the inputs and where . Assume the inputs and labels lie within the unit Euclidean ball in . Represent the prompts and as matrices . Let be an -layer transformer. Given as the initial input, the -th layer applies MLPs and self-attention, producing the output:
Assume is normalized as and obey with . Let output the last token of the final layer that corresponds to the query . Let represent the sample size of the mixed dataset , where for . Then, we obtain:
where and denotes the mixed dataset at the -th generation in the STL when the initial real dataset is . Additionally, if the measure for the recursive stability parameter in Definition 2 is taken as the norm, then the recursive stability .
Remark 8.
Controlling Exponential Growth with Real Data Proportion. In this remark, we further investigate the influence of the proportion of real data on the recursive stability of transformers. As outlined in Theorem 2, the upper bound of the recursive stability parameter includes a term that grows exponentially with the number of generations in the STL, specifically . However, we show that even a constant proportion of real data, , is sufficient to control this growth.
Specifically, setting , we establish that the recursive stability parameter in Theorem 2 satisfies . Additionally, as the number of generations in the STL approaches infinity, the proportion asymptotically converges to . Notably, the depth is typically small in practical settings. For example, studies on LLM performance in STLs, such as Briesch et al. (2023), often employ models with . Furthermore, techniques like layer normalization effectively constrain the norm of weights , ensuring numerical stability during training. Thus, with a constant real data proportion independent of the STL generation number , the exponential growth term can be effectively controlled, ensuring that .
5.3 Generalization Bound for Transformers in In-Context Learning
In this section, we investigate the behavior of transformers under the ICL framework in STLs. We select SGD as the learning algorithm and consider a binary task with . Applying our general theoretical framework from Theorem 1, we derive the generalization error bound by addressing the terms and using recent results on SGD (Zhang et al., 2022) and ICL (Zhang et al., 2023). The recursive stability parameter is obtained from Theorem 2. We assume that the loss function is -smooth and -Lipschitz, which are standard assumptions in related works (Hardt et al., 2016; Lei & Ying, 2020), with formal definitions provided in Appendix A.1. Examples include logistic and Huber losses. We now present the generalization error bound:
Theorem 3.
Consider an -layer transformer under the setting described in Theorem 2. Let represent the sample size of the mixed dataset , where for . Suppose that the loss function is -smooth, -Lipschitz and bounded by for every . Let denote the output after running SGD for iterations with a step size on the mixed dataset . Then, for any , with probability at least , the following holds:
(2) |
Remark 9.
In this remark, we provide a detailed explanation of the theoretical results of Theorem 3. As discussed earlier in Remark 8, is set to . To enhance clarity and focus on the primary results, we omit constant terms and the factor. Consequently, the bound in Theorem 3 can be expressed as follows:
In this bound, the terms correspond to the generalization error on the mixed dataset, while the term represents the cumulative distribution shift across generations, which is primarily governed by the learnability of the generative model.
It is evident from this result that the generative model’s capacity plays a crucial role in the performance within the STLs. The ability of the generative model to maintain distributional fidelity over multiple generations directly impacts the generalization error and determines how well the model can control the propagation of errors across generations.
5.4 Synthetic Data Augmentation
The previous theorem addresses the scenario where the training dataset is unintentionally contaminated by synthetic data, leading to STLs. In contrast, many researchers intentionally incorporate synthetic data to augment the real dataset, also creating STLs. Next, we explore this synthetic data augmentation scenario, where each generation’s synthetic data is added to the mixed dataset, i.e., .
Theorem 4.
Consider an -layer transformer under the setting described in Theorem 2. Let and represent the sample size of the real dataset and the synthetic dataset , respectively, where . The mixed dataset is denoted as . Suppose that the loss function is -smooth, -Lipschitz and bounded by for every . Let denote the output after running SGD for iterations with a step size on the mixed dataset . Then, for any , with probability at least , the following holds:
Remark 10.
Analyzing the Trade-off in Synthetic Data Augmentation for STLs. In this remark, we examine the trade-off between generalization and distribution shifts from increased synthetic data, providing insights into optimal synthetic data size. At each generation, synthetic data points are added to the mixed dataset. We analyze how the coefficient , representing the scale of synthetic data augmentation, affects the generalization error in STLs. From the bound in Theorem 4, we observe that the Cumulative Distribution Shift Across Generations term is expressed as:
As the coefficient increases, the cumulative distribution shift correspondingly grows, thereby amplifying the associated error. This behavior aligns with intuition, as an increase in reduces the proportion of real data within the mixed dataset at each generation. Consequently, this reduction in real data leads to a greater divergence between the mixed distribution and the true underlying distribution, exacerbating the deviation and compounding the error across successive generations. In contrast, for the Generalization Error on Mixed Distributions term:
We observe that as increases, the corresponding error decreases. This outcome is consistent with theoretical intuition, as augmenting the dataset with synthetic data effectively enlarges the mixed dataset. A larger dataset provides a more comprehensive representation of the mixed distribution, which in turn reduces the generalization error associated with this distribution. By incorporating more synthetic data, the mixed dataset better approximates the underlying mixed distribution, leading to improved generalization performance.
From the above discussion, we can conclude that the inclusion of synthetic data introduces a trade-off: on one hand, it increases the error from the cumulative distribution shift, while on the other, it reduces the generalization error on the mixed distribution. This trade-off has been explored theoretically in Fu et al. (2024b), though they primarily provided theoretical intuition. In contrast, our work explicitly decomposes the error into two terms, offering a deeper understanding of this trade-off and its implications for model performance in STLs. As for the optimal augmentation coefficient , it must satisfy the following condition:
Unfortunately, obtaining a closed-form solution for from this equation proves to be analytically intractable. However, we can derive the relationship between , the size of the real dataset from the above equation. Specifically, by omitting irrelevant constants and the term, we obtain that should satisfy the following expression:
We observe an important trend: the value of increases as the size of the real dataset decreases. This aligns with theoretical intuition, as a smaller real dataset struggles to adequately represent the underlying distribution, leading to higher generalization error. Consequently, more synthetic data is required to control the generalization error of each generation on the mixed distribution. Conversely, when the real dataset is sufficiently large, the need for synthetic data augmentation diminishes.
6 Conclusion
As real-world data becomes increasingly scarce and existing datasets are progressively contaminated with synthetic content, STLs have emerged as a necessary strategy. STLs enable generative models to recursively train on a mix of real and synthetic data. However, empirical outcomes have varied significantly, revealing the need for a theoretical foundation to guide their successful application.
In this work, we introduced recursive stability as a key technical innovation and established the first generalization error bounds for STLs, which consider the impact of different model architectures. Our analysis demonstrated that preventing model collapse requires two critical conditions: maintaining a non-negligible proportion of real data and ensuring that models satisfy recursive stability. Furthermore, we were the first to extend this framework to transformers in in-context learning, showing that they also satisfy recursive stability and establish their generalization error bounds. Finally, we explored the trade-off introduced by synthetic data augmentation, balancing generalization improvement with potential distributional shifts. These contributions provide new insights into enhancing the stability and performance of generative models in STLs.
Acknowledgement
This project is supported by the National Research Foundation, Singapore, under its NRF Professorship Award No. NRF-P2024-001.
References
- Alemohammad et al. (2024a) Sina Alemohammad, Josue Casco-Rodriguez, Lorenzo Luzi, Ahmed Imtiaz Humayun, Hossein Babaei, Daniel LeJeune, Ali Siahkoohi, and Richard Baraniuk. Self-consuming generative models go mad. In The Twelfth International Conference on Learning Representations, 2024a.
- Alemohammad et al. (2024b) Sina Alemohammad, Ahmed Imtiaz Humayun, Shruti Agarwal, John Collomosse, and Richard Baraniuk. Self-improving diffusion models with synthetic data. arXiv preprint arXiv:2408.16333, 2024b.
- Bassily et al. (2020) Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33, 2020.
- Bertrand et al. (2024) Quentin Bertrand, Joey Bose, Alexandre Duplessis, Marco Jiralerspong, and Gauthier Gidel. On the stability of iterative retraining of generative models on their own data. In The Twelfth International Conference on Learning Representations, 2024.
- Bousquet & Elisseeff (2002) Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002.
- Bousquet et al. (2020) Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pp. 610–626, 2020.
- Briesch et al. (2023) Martin Briesch, Dominik Sobania, and Franz Rothlauf. Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822, 2023.
- Charles & Papailiopoulos (2018) Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pp. 744–753, 2018.
- Dohmatob et al. (2024a) Elvis Dohmatob, Yunzhen Feng, and Julia Kempe. Model collapse demystified: The case of regression. arXiv preprint arXiv:2402.07712, 2024a.
- Dohmatob et al. (2024b) Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, and Julia Kempe. Strong model collapse. arXiv preprint arXiv:2410.04840, 2024b.
- Dohmatob et al. (2024c) Elvis Dohmatob, Yunzhen Feng, Pu Yang, Francois Charton, and Julia Kempe. A tale of tails: Model collapse as a change of scaling laws. In Forty-first International Conference on Machine Learning, 2024c.
- Doukhan (1994) P. Doukhan. Mixing: Properties and examples. Lecture notes in statistics. New York: Springer, 1994.
- Farnia & Ozdaglar (2021) Farzan Farnia and Asuman Ozdaglar. Train simultaneously, generalize better: Stability of gradient-based minimax learners. In International Conference on Machine Learning, pp. 3174–3185. PMLR, 2021.
- Feldman & Vondrak (2019) Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pp. 1270–1279, 2019.
- Feng et al. (2024a) Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, and Julia Kempe. Beyond model collapse: Scaling up with synthesized data requires reinforcement. In ICML 2024 Workshop on Theoretical Foundations of Foundation Models, 2024a.
- Feng et al. (2024b) Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe, and FAIR Meta. Beyond model collapse: Scaling up with syn-thesized data requires verification. arXiv preprint arXiv:2406.07515, 2024b.
- Ferbach et al. (2024) Damien Ferbach, Quentin Bertrand, Avishek Joey Bose, and Gauthier Gidel. Self-consuming generative models with curated data provably optimize human preferences. arXiv preprint arXiv:2407.09499, 2024.
- Fu et al. (2023) Shi Fu, Yunwen Lei, Qiong Cao, Xinmei Tian, and Dacheng Tao. Sharper bounds for uniformly stable algorithms with stationary mixing process. In The Eleventh International Conference on Learning Representations, 2023.
- Fu et al. (2024a) Shi Fu, Yuzhu Chen, Yingjie Wang, and Dacheng Tao. On championing foundation models: From explainability to interpretability. arXiv preprint arXiv:2410.11444, 2024a.
- Fu et al. (2024b) Shi Fu, Sen Zhang, Yingjie Wang, Xinmei Tian, and Dacheng Tao. Towards theoretical understandings of self-consuming generative models. In Forty-first International Conference on Machine Learning, 2024b.
- Gerstgrasser et al. (2024) Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Tomasz Korbak, Henry Sleight, Rajashree Agrawal, John Hughes, Dhruv Bhandarkar Pai, Andrey Gromov, Dan Roberts, Diyi Yang, David L. Donoho, and Sanmi Koyejo. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. In First Conference on Language Modeling, 2024.
- Gillman et al. (2024) Nate Gillman, Michael Freeman, Daksh Aggarwal, HSU Chia-Hong, Calvin Luo, Yonglong Tian, and Chen Sun. Self-correcting self-consuming loops for generative model training. In Forty-first International Conference on Machine Learning, 2024.
- Hardt et al. (2016) Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225–1234, 2016.
- He et al. (2016) Fangchao He, Ling Zuo, and Hong Chen. Stability analysis for ranking with stationary -mixing samples. Neurocomputing, 171:1556–1562, 2016.
- Hittmeir et al. (2019) Markus Hittmeir, Andreas Ekelhart, and Rudolf Mayer. On the utility of synthetic data: An empirical evaluation on machine learning tasks. In Proceedings of the 14th International Conference on Availability, Reliability and Security, pp. 1–6, 2019.
- Huang et al. (2022) Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Kanabar & Gastpar (2025) Millen Kanabar and Michael Gastpar. Minimax discrete distribution estimation with self-consumption. arXiv preprint arXiv:2501.19273, 2025.
- Klochkov & Zhivotovskiy (2021) Yegor Klochkov and Nikita Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate O(1/n). In Advances in Neural Information Processing Systems, 2021.
- Lei (2023) Yunwen Lei. Stability and generalization of stochastic optimization with nonconvex and nonsmooth problems. In The Thirty Sixth Annual Conference on Learning Theory, pp. 191–227. PMLR, 2023.
- Lei & Ying (2020) Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, pp. 5809–5819, 2020.
- Li & Liu (2022) Shaojie Li and Yong Liu. High probability generalization bounds with fast rates for minimax problems. In International Conference on Learning Representations, 2022.
- Li et al. (2023) Yingcong Li, Muhammed Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp. 19565–19594. PMLR, 2023.
- Liang (2021) Tengyuan Liang. How well generative adversarial networks learn distributions. Journal of Machine Learning Research, 22(228):1–41, 2021.
- Marchi et al. (2024) Matteo Marchi, Stefano Soatto, Pratik Chaudhari, and Paulo Tabuada. Heat death of generative models in closed-loop learning. arXiv preprint arXiv:2404.02325, 2024.
- Martínez et al. (2023) Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, and Rik Sarkar. Towards understanding the interplay of generative artificial intelligence and the internet. In International Workshop on Epistemic Uncertainty in Artificial Intelligence, pp. 59–73. Springer, 2023.
- Mohri & Rostamizadeh (2010) Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary -mixing and -mixing processes. Journal of Machine Learning Research, 11(2), 2010.
- Sadasivan et al. (2023) Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
- Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Seddik et al. (2024) Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, and Merouane Debbah. How bad is training on synthetic data? a statistical analysis of language model collapse. arXiv preprint arXiv:2404.05090, 2024.
- Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635–2670, 2010.
- Shumailov et al. (2024) Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, 2024.
- Tao et al. (2024) Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024.
- Villalobos et al. (2022) Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325, 2022.
- Wang et al. (2024) Peng Wang, Li Shen, Zerui Tao, Shuaida He, and Dacheng Tao. Generalization analysis of stochastic weight averaging with general sampling. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=XwVkqvyziD.
- Wyllie et al. (2024) Sierra Wyllie, Ilia Shumailov, and Nicolas Papernot. Fairness feedback loops: training on synthetic data amplifies bias. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pp. 2113–2147, 2024.
- Xing et al. (2025) Xiaodan Xing, Fadong Shi, Jiahao Huang, Yinzhe Wu, Yang Nan, Sheng Zhang, Yingying Fang, Michael Roberts, Carola-Bibiane Schönlieb, Javier Del Ser, et al. On the caveats of ai autophagy. Nature Machine Intelligence, pp. 1–9, 2025.
- Xu et al. (2023) Shirong Xu, Will Wei Sun, and Guang Cheng. Utility theory of synthetic data generation. arXiv preprint arXiv:2305.10015, 2023.
- Yang (2022) Hongkang Yang. A mathematical framework for learning probability distributions. arXiv preprint arXiv:2212.11481, 2022.
- Yu (1994) Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pp. 94–116, 1994.
- Zhang et al. (2022) Yikai Zhang, Wenjia Zhang, Sammy Bald, Vamsi Pingali, Chao Chen, and Mayank Goswami. Stability of sgd: Tightness analysis and improved bounds. In Uncertainty in artificial intelligence, pp. 2364–2373. PMLR, 2022.
- Zhang et al. (2023) Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv:2305.19420, 2023.
- Zheng et al. (2023) Chenyu Zheng, Guoqiang Wu, and Chongxuan Li. Toward understanding generative data augmentation. Advances in Neural Information Processing Systems, 36:54046–54060, 2023.
- Zhu et al. (2024) Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, and Bowen Zhou. How to synthesize text data without model collapse? arXiv preprint arXiv:2412.14689, 2024.
Appendix A Appendix
A.1 Auxiliary Definitions
Below, we present some essential definitions.
Definition 3.
(Lipschitz and Smoothness). Let constants . Consider the function . We define the following properties:
-
•
Lipschitz Continuity: The loss is said to be -Lipschitz continuous if for any .
-
•
Smoothness: The loss is said to be -Smooth if for any .
A.2 Expansion to Gaussian Mixture Models
We adopt the setup from prior works Zheng et al. (2023) and consider a binary classification task where . Given a vector with and noise variance , the data distribution is specified as follows: and . We define the conditional generative model using parameters and , where and . For data points, let represent the number of samples in class . The parameters of the Gaussian mixture model are then learned as:
Then we can generate new samples from the distribution: and , where . Additionally, the learning algorithm functions as a linear classifier, parameterized by , with predictions given by: . The loss function is defined as:
Thus, the output is
In this setting, we demonstrate recursive stability for the Gaussian mixture model as follows:
Theorem 5.
Let denote two initial real datasets differing by a single example. Let represent the sample size of the mixed dataset , where for . Choose . Consider the previously described sampling and learning steps, where real data samples are drawn from the Gaussian Mixture Model distribution , and the synthetic data for the -th generation is generated from the learned Gaussian Mixture distribution of the -th generation. Then with probability at least , we have:
(3) |
where the measure for the recursive stability parameter is taken as the KL divergence.
As approaches 0, indicating that no real data is incorporated during each generation of training, we observe
which suggests a linear accumulation of errors. This finding aligns closely with the theoretical insights presented in Shumailov et al. (2024); Alemohammad et al. (2024a), where a Gaussian model trained without real data demonstrated a linear divergence in variance. Thus, this underscores the validity of our theoretical results, confirming that the derived bound is meaningful and not vacuous.
Moreover, by leveraging the generalization error bound established in Theorem 1, we derive the following:
Theorem 6.
Consider the Gaussian Mixture Model in the setting outlined above. Let represent the sample size of the mixed dataset , where for . Suppose the loss function is defined as . Let denote the output of applying the linear classifier described above to the mixed dataset . Then, for any , with probability at least , the following holds:
(4) |
We observe that when is set to a constant (e.g., ), the generalization error can be effectively controlled, preventing model collapse. This result aligns with the experimental findings in Alemohammad et al. (2024a) for Gaussian models.
A.3 Additional Comparison with Related Work on Theorem 1
Dohmatob et al. (2024a) examined a linear regression setting, focusing solely on statistical approximation error without addressing the functional approximation error described in Shumailov et al. (2024). They did not consider incorporating real data to prevent collapse and demonstrated a linear dependency of degradation on the generation number in the case of fully synthetic data. Similarly, Alemohammad et al. (2024a) and Shumailov et al. (2024) provided theoretical insights using simple Gaussian models without incorporating real data, proving that the variance diverges linearly with the generation number. Seddik et al. (2024) explored a linear softmax classifier and, while also neglecting functional approximation error, demonstrated that adding real data can mitigate model collapse. Marchi et al. (2024) used asymptotic analysis to study parameter variance, assuming an infinite number of training generations and considering scenarios where the generative model is controlled via a “temperature” parameter. They proved that parameter variance is bounded under these conditions.
In contrast, our work addresses a much more complex and realistic scenario by introducing the novel concept of recursive stability and providing the first generalization analysis for STLs. Our analysis accounts for statistical approximation error, functional approximation error, and optimization error during the training of generative models. Unlike the settings explored in prior theoretical works, such as linear regression (Dohmatob et al., 2024a; Gerstgrasser et al., 2024), Gaussian models (Alemohammad et al., 2024a; Shumailov et al., 2024), or asymptotic assumptions (Marchi et al., 2024), our framework accommodates more complex generative model architectures, such as transformers. Specifically, we reveal how both model architecture and the ratio of real to synthetic data influence the success of STLs. For example, in Theorem 3, we demonstrate how our general generalization bound applies to transformer-based generative models, providing a theoretical framework that aligns with practical and more sophisticated use cases.
Additionally, while Marchi et al. (2024) assumed an infinite number of training generations for their asymptotic analysis, we consider finite generations, which is more practical since most experimental setups limit generations to fewer than 10 (as noted in Shumailov et al. (2024)). Moreover, our results confirm that when (i.e., no real data is used), the last term in our bound, representing the Cumulative Distribution Shift (), grows linearly. This finding aligns with the theoretical results of Dohmatob et al. (2024a); Alemohammad et al. (2024a); Shumailov et al. (2024); Fu et al. (2024b). Furthermore, we show that introducing even a constant proportion of real data significantly mitigates model collapse, aligning with experimental findings by Alemohammad et al. (2024a) and Bertrand et al. (2024).
A.4 Additional Comparison with Related Work on Theorem 4
Gerstgrasser et al. (2024) also explored the use of accumulating data to prevent model collapse. They considered a simple linear regression setting without accounting for the dynamic process of training generative models, focusing solely on statistical approximation error. They demonstrated that under the assumption of fixed synthetic data quality matching the original real data, statistical approximation error can be controlled.
By contrast, our work addresses a much more complex and realistic scenario, incorporating the dynamic behavior of transformer-based generative models, learning algorithms, and both statistical and functional approximation errors. Additionally, we allow for dynamic regulation of synthetic data size via a coefficient, enabling us to identify the optimal synthetic dataset size for avoiding model collapse in these more challenging settings.
A.5 Auxiliary Lemmas
In this section, we begin by introducing a set of auxiliary theorems that will be utilized in the subsequent proofs.
Lemma 7 (McDiarmid’s Inequality).
Consider independent random variables and a mapping . If, for all , and for all , the function satisfies
then,
Furthermore, for any ,
Lemma 8.
((Bousquet et al., 2020)). Let be a vector of independent random variables each taking values in , and let be some functions such that the following holds for any :
-
•
,
-
•
,
-
•
has a bounded difference with respect to all variables except the -th variable, that is, for all and , we have .
Then, for any ,
Lemma 9.
If for any , then for any , with probability at least ,
In addition, we introduce the definition of the Total Variation (TV) distance as follows:
Definition 4 (Total Variation Distance).
Given two probability distributions and over a multidimensional space , the Total Variation Distance between and is:
A.6 Proof of Theorem 1
In this Section, we prove Theorem 1 by first decomposing the generalization error into two components: the Cumulative Distribution Shift Across Generations and the Generalization Error on Mixed Distributions. We then proceed to bound the Cumulative Distribution Shift Across Generations by leveraging the properties of the generative model and recursive techniques. For the Generalization Error on Mixed Distributions, we follow the framework of Zheng et al. (2023), leveraging the fact that within the mixed dataset , the set satisfies the conditional i.i.d. assumption when is fixed. Combined with moment bounds, this allows us to effectively bound the generalization error.
The main proof is as follows:
Proof of Theorem 1.
We begin by decomposing the generalization error as follows:
Upper Bounding Cumulative Distribution Shift Term
For the term , we first note that . Therefore, we obtain:
(5) |
Furthermore, we can further decompose it as follows:
(6) |
By substituting inequality 6 into inequality 5, we obtain:
(7) |
Then, for the term , we have:
(8) |
Incorporating inequality 8 into inequality 7, we arrive at:
(9) |
Next, we apply recursive techniques to address the problem further. First, we obtain
(10) |
Plugging inequality 10 into inequality 9into the inequality, we obtain that:
By recursion, we obtain:
Let represent the sample size of the real dataset , and let denote the sample size of the mixed dataset in the -th generation. Thus, can be written as a function of . Assuming that the sample size for each generation’s dataset is identical, i.e., , and that the TV distance for each generation is of the same order, denoted by , we can derive the following result:
(11) |
Then we obtain:
(12) |
Upper Bounding Generalization Error on Mixed Distributions Term
Next, we turn our attention to the term . Our primary objective is to establish a moment bound for this expression.
(13) |
The newly sampled dataset, denoted as , is a subset of the original dataset , where and its size is . Specifically, contains a proportion of the data points in , resulting in a total of data points. Similarly, is a subset of the synthetic dataset , where , and its size is . Specifically, contains a proportion of the data points in , resulting in data points.
We observe that for any function , if there exists a bound for some subset , then we have the following:
Fix , then data in are independent. We use this property and Lemma 8 to bound the Term 2. We introduce functions which play the same role as ’s in Lemma 8 as
where is the -th data in , and obtained by replacing by Next, we prove that satisfies the three conditions outlined in Lemma 8. First, we demonstrate condition .
We then continue by proving conditions :
Finally, we prove that has a bounded difference with respect to all variables except the -th variable. Let , then we obtain:
Therefore, for any fixed , by Lemma 8, for any , we have
(14) |
We note that the difference between Term 2 and is minimal. Consequently, for any fixed , we can bound Term 2 using inequality 14 as follows.
Next, for Term 2, we derive the following result:
(15) |
Now, we use a similar idea to bound Term 1 . We decompose it as the following.
(16) |
We proceed by bounding each term. Specifically, Term 3 can be bounded using McDiarmid’s inequality, as outlined in Lemma 7, and Term 4 can be bounded by applying Lemma 8.
To bound Term 3, we begin by fixing and utilizing the conditional independence property of once again. In order to apply Lemma 8, we must show that exhibits a bounded difference with respect to when is fixed. This expression can be formulated as follows.
Thus, by Mcdiarmid Inequality, we have
(17) |
We now introduce a set of functions and apply Lemma 8 once more to bound Term 4. Specifically, we define , which serves a similar role to the ’s in Lemma 8, as follows:
(18) |
where denote the -th data point in , and represent the dataset obtained by replacing with . Additionally, it is important to note that indicates that is the synthetic dataset generated after the self-consuming loop, following -generations, and obtained by modifying a single data point from the initial real dataset . This complex scenario can be addressed using the recursive stability we have defined for the self-consuming loop in Definition 2. Moreover, similar to the process above, we observe that and . More intricately, we will now prove that exhibits a bounded difference. This will be demonstrated as follows.
(19) | ||||
(20) |
We can bound equation 19 by applying the concept of uniform stability, resulting in an upper bound of . Regarding equation 20, for ease of notation, let us represent as . Consequently, we obtain the following:
(21) |
Thus, exhibits a bounded difference of with respect to all variables except the -th variable. By applying Lemma 8, we obtain the following:
We observe that the difference between Term 4 and is negligible. Thus, we can bound Term 4 as follows:
By substituting the above inequality and inequality 17 into the decomposition 16, we obtain:
(22) |
Plug inequalities 22 and 15 into the inequality 57, then we obtain:
(23) |
By applying Lemma 8, we can derive a bound on the generalization error with respect to the mixed distribution. as follows.
Finally, we conclude that:
(24) |
∎
A.7 Proof of Theorem 2
In this section, we prove that transformers in in-context learning exhibit recursive stability. Specifically, we utilize the framework and lemmas from Li et al. (2023), combined with recursive techniques, to establish the proof.
Proof of Theorem 2.
. Let and be the input and perturbation matrices respectively. Given that the tokens lie in the unit ball, and assuming also lies in the unit ball, we can proceed with the following. For a matrix, let denote the -norm of the vector formed by the -norms of its rows. Therefore, we obtain and 1. Let the attention outputs be defined as and . Define the perturbation as .
Let us examine the attention output difference , which can be further decomposed as follows:
(25) |
We first observe that preserves the norm, meaning that satisfies and . Moreover, for any pair of tokens, it holds that . Applying Lemma 10, we can therefore derive the following:
(26) |
Subsequently, for , we establish the following expression
To advance the analysis, we introduce the -scaled perturbation , where is constrained within . Our approach involves first bounding the derivative as , and then integrating this bound along the path of , effectively covering the interval from to . Notably, as , the quadratic terms proportional to diminish, simplifying the analysis at this limit.
To bound the latter, we focus on each row separately. Consider a row from and its perturbed version , represented by the pair . It follows that for any cross product, we have the guarantees and . Additionally, the bounds and hold. Applying the perturbation bounds provided by Lemma 10, we obtain the desired result
Summing across all rows, we obtain the following:
By integrating the derivative over the interval to , we obtain the final expression,
(27) |
By substituting inequality 27 and inequality 26 into the decomposition 25, we derive the following result:
(28) |
To continue, we aim to control the output for a specific index where the input perturbation remains small, specifically . To address this, we will apply the same argument, focusing on the -th token. For the -th token (omitting subscripts for clarity), let the inputs be denoted as , and the corresponding outputs as . Similar to the previous decomposition, we can derive the following:
(29) |
By leveraging the fact that for all , and applying Lemma 10, we can establish a bound similar to that in equation 26. Specifically, we can constrain the terms involved as follows:
(30) |
Similarly, for , we can derive the following:
Now, considering the perturbation , and letting , we apply the triangle inequality to obtain the following result:
(31) |
In a similar manner to the previous steps, we can derive the following:
(32) |
Next, we examine the effect of the MLP layer on the model’s behavior. Let represent the weights of the parallel MLPs that follow the self-attention mechanism. Given that , we denote the MLP outputs corresponding to the self-attention results and as and , respectively. From this, we can derive the following relationship.
Let denote the ReLU function, which is a 1-Lipschitz continuous activation function with . First, observe that each row of is given by , where represents the weights of the MLPs. Given the properties of the ReLU function, we can derive the following bound:
Next, we consider the difference between the perturbed and original outputs. We can express the difference as , which, due to the 1-Lipschitz property of , is further bounded by . Finally, we obtain:
(33) |
Thus, we conclude that the perturbations in the rows of are controlled by the corresponding perturbations in . Consequently, we establish the bound
Thus, from inequality 28, we derive the following result:
(34) |
Furthermore, for any such that , it holds that
where represents the -th row of . With this, we have addressed the stability of the single-layer transformer. Moving forward, we will extend our analysis and focus on the stability of -layer transformer. First, we can derive the following:
where represents the number of layers in the transformer. Then, for -layer transformer, we have the following:
What remains is to perform induction on the difference between the last tokens . We claim that, for all layers,
This claim holds at because the change in the last token is at most . By induction, the claim holds for all layers, and we conclude the proof by setting , covering the entire depth of the -layer transformer. Finally, we obtain:
(35) |
Next, we further analyze the self-consuming process. Let and represent two initial real datasets that differ only in their inputs, specifically and , where . Since , then, we have the following:
(36) | ||||
Then, and are used as in-context examples, and i.i.d. queries are sampled from . These queries, along with the in-context examples and , are processed through the transformer model to predict their respective labels. As a result, the first generation of synthetic datasets, and , is produced. Then we obtain:
(37) |
Given the mixed dataset , where for , we can proceed with further analysis based on the specified combination of the original dataset and the synthetic dataset .
(38) |
By reintroducing the mixed datasets and as in-context examples into the transformer model, and considering the query set as i.i.d. samples from the distribution , we can derive the transformer’s output according to Equation 36:
(39) |
From the above expression, we can further derive that
Thus,
Similarly, for the 2-th generation, following analogous steps, we can derive that
(40) |
Building on the above expression, we can further deduce that
(41) |
The discrepancy between the mixed datasets is as follows:
(42) |
Utilizing recursive techniques, we can obtain the following:
(43) |
Ultimately, the discrepancy between the transformer outputs after generations of the self-consuming loop for and can be obtained as follows:
Subsequently, given that , we define the measure as the -norm to quantify the output discrepancy of the generative transformer model after iterations of the self-consuming loop, starting from the initial real datasets and . In this context, the recursive stability parameter , as described in Definition 2, can be bounded by the following expression, providing a formal measure of the model’s stability across iterations:
The proof is complete.
∎
A.8 Proof of Theorem 3
In this section, building on the general theoretical framework established in Theorem 1, we provide the proof of Theorem 3 by analyzing the terms and , leveraging recent advancements in SGD (Zhang et al., 2022) and ICL (Zhang et al., 2023). The recursive stability parameter is derived from Theorem 2.
Lemma 11.
(Uniform stability of SGD in the non-convex case (Zhang et al., 2022)). Assume is -smooth and -Lipschitz. Running iterations of with step size . Choose the stability of SGD satisfies
Lemma 12.
(Zhang et al., 2023) Let represent the probability distribution induced by the transformer with parameter . Additionally, the model is pretrained by the algorithm:
where . Furthermore, we consider the realizable setting, where ground truth probability distribution and are consistent for some . Then, with probability at least , the following inequality holds:
(44) |
where denotes that we omit constants that are independent of and .
Proof of Theorem 3.
First, we note that in the setting where the transformer generates data through in-context learning, the generalization error of the self-consuming loop is given by:
(45) |
Now, we are ready to prove Theorem 3. The main idea is to bound the uniform stability parameter , the recursive stability parameter , and the learnability of the generative model through the total variation distance as stated in Theorem 1. First, as for the bound for the total variation distance in Theorem 1. For Equation 8 in the proof of Theorem 1, we can rewrite it in the setting of in-context learning as follows:
(46) |
Where, the second equality holds because, in the -th generation of the self-consuming loop, the mixed data distribution from the -th generation is reintroduced as the ground truth distribution to train the transformer. As a result, the transformer outputs the synthetic data distribution for the -th generation. Thus, corresponds to in Theorem 1. Finally, the bound for the total variation distance follows from Lemma 12.
(47) |
Similarly, for the recursive stability parameter in the self-consuming loop, we rederive Equation 21 from the proof of Theorem 1 under the in-context learning setting:
(48) |
For the uniform stability parameter of SGD algorithm, we can derive the bound from Lemma 11. Substituting above results into Theorem 3, we obtain the following conclusion:
(49) |
∎
A.9 Proof of Theorem 4
In this section, we prove Theorem 4. The proof follows a similar approach to that of Theorem 3; however, it is more intricate due to the fact that the mixed dataset in Theorem 4 contains synthetic data from all previous generations. Each generation’s synthetic dataset depends on the synthetic datasets of previous generations, leading to a more complex non-i.i.d. setting. Similar to Theorem 3, we begin by decomposing the generalization error into two components: the Cumulative Distribution Shift Across Generations and the Generalization Error on Mixed Distributions.
The main proof is as follows:
Proof of Theorem 4.
We begin by decomposing the generalization error as follows:
Upper Bounding Cumulative Distribution Shift Term
For the term , we first note that . Therefore, we obtain:
(50) |
Furthermore, we can further decompose it as follows:
(51) |
By substituting inequality 51 into inequality 50, we obtain:
(52) |
Thus, from equation 46 in the proof of Theorem 3 and lemma 12, we obtain:
(53) |
Incorporating inequality 53 into inequality 52, we arrive at:
(54) |
Let , Then, we obtain:
Similarly, we get:
Then, we have
(55) |
Thus, by applying recursive techniques, we obtain the following result:
(56) |
Upper Bounding Generalization Error on Mixed Distributions Term
Next, we turn our attention to the term . Our primary objective is to establish a moment bound for this expression.
(57) |
Fixing , the data in are independent. Following a similar approach to the proof of Theorem 1, we utilize this property along with Lemma 8 to bound Term i. Consequently, from Equation 15 in the proof of Theorem 1, we obtain:
(58) |
Next, we consider Term 0. Similar to Proof of Theorem 3, we first introduce a set of functions and apply Lemma 8 to bound Term 0. Specifically, we define , which serves a similar role to the ’s in Lemma 8, as follows:
(59) |
where denote the -th data point in , and represent the dataset obtained by replacing with . Moreover, following the procedure above, we observe that and . More intricately, we will now prove that exhibits a bounded difference. However, it is important to note that all depend on , so when a single data point in is changed, the corresponding datasets will also change. We denote these modified datasets as and consequently, we have the following:
(60) |
Thus, by applying the recursive stability established in Theorem 2, it is important to first note that in Theorem 2, the mixed dataset is defined as , whereas in this theorem, the mixed dataset is defined as . Therefore, by following the proof steps outlined in Theorem 2, we can derive the following:
Thus, we apply lemma 8:
We observe that the difference between Term 0 and is negligible. Thus, we can bound Term 0 as follows:
(61) |
Using the same method, for Term , where , we can derive the following:
(62) |
In summary, we can finally bound the Generalization Error on the Mixed Distributions term as follows:
Then, according to Lemma 9, we obtain, with probability at least :
Then, combine the above inequality with inequality 56, we obtain:
The proof is complete.
∎
Appendix B Experiments
In this section, we present some experimental results. Specifically, we trained transformer models to in-context learn linear functions within STLs.
In these experiments, we considered the class of linear functions:
in dimensions. We sampled , and independently from the isotropic Gaussian distribution . For each , we computed and constructed the prompt as:
We employed a 12-layer, 8-head GPT-2 model with a hidden size of 256, trained on an linear regression task with 40 in-context examples. Two cases were considered:
-
•
Mixed Case: Fresh data and generated data were mixed in a 0.5 ratio.
-
•
Full Synthetic Case: No fresh data was used.
The results of these experiments are summarized below:
As observed, the error accumulates progressively with more self-consuming loops, particularly in the full synthetic case, where the error grows rapidly. In contrast, maintaining a constant-sized proportion of real data effectively reduces the loss, which is consistent with our theoretical findings.