TRACE for Tracking the Emergence of Semantic Representations in Transformers
Abstract
Modern transformer models exhibit phase transitions during training, distinct shifts from memorisation to abstraction, but the mechanisms underlying these transitions remain poorly understood. Prior work has often focused on endpoint representations or isolated signals like curvature or mutual information, typically in symbolic or arithmetic domains, overlooking the emergence of linguistic structure. We introduce TRACE (Tracking Representation Abstraction and Compositional Emergence), a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs. TRACE leverages a frame-semantic data generation method, ABSynth, that produces annotated synthetic corpora with controllable complexity, lexical distributions, and structural entropy, while being fully annotated with linguistic categories, enabling precise analysis of abstraction emergence. Experiments reveal that (i) phase transitions align with clear intersections between curvature collapse and dimension stabilisation; (ii) these geometric shifts coincide with emerging syntactic and semantic accuracy; (iii) abstraction patterns persist across architectural variants, with components like feedforward networks affecting optimisation stability rather than fundamentally altering trajectories. This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation that could inform more principled approaches to LM development.
1 Introduction
Transformer models 111Throughout this paper, we use the term ”transformer” to refer to Transformer-based architectures as implemented in Vaswani et al. [50] exhibit evolving internal representations during training, with recent work showing that these representations undergo phase transitions—abrupt shifts in representational structure, generalisation behaviour, and learning dynamics [31, 36, 41]. These transitions mark critical points where models reorganise internal representations and develop increasingly abstract, structured encodings [4, 49, 31].
Understanding the mechanisms and timing of these shifts is essential for interpretability, model steering, and failure mode detection [20]. While prior studies have characterised models’ behaviour at convergence or via final-layer probes, less is known about how internal linguistic structures form over the course of training. Furthermore, much of the literature on transformer interpretability and representation focuses on algorithmic or mathematical tasks [33, 53, 54], or examines geometric properties at the level of individual tokens or local concepts [39, 49]. These works leave open the question of how holistic, sentence-level semantic representations arise in transformer representations.
We address this gap by introducing TRACE: Tracking Representation Abstraction and Compositional Emergence (Fig. 1), a diagnostic framework that combines geometric, linguistic, and information-theoretic signals to characterise how transformers transition from memorisation to abstraction 222 We use ”compositional emergence” to denote the formation of structured internal representations (e.g., roles, syntactic categories), rather than formal compositional generalisation..
Our central hypothesis is that abstraction emerges through a measurable phase transition, marked by: (i) characteristic rise-then-stabilise patterns in intrinsic dimensionality of hidden representations; (ii) transient spikes in loss curvature; (iii) surges in linguistic category alignment, particularly for part-of-speech and semantic accuracy; and (iv) decrease in mutual information between input and hidden representations. We test whether these phenomena, each informative in isolation, exhibit coordinated temporal dynamics that can serve as reliable sentence-level representation/abstraction indicators.

To isolate model dynamics from data confounds, we introduce a synthetic data generation framework, ABSynth, based on formal frame semantics [19, 5]. Unlike template-based approaches, ABSynth samples from abstract event frames with predefined semantic roles (agent, patient, etc.), producing corpora with transparent syntactic and semantic structure. We instantiate this framework with ABSynth25K, enabling precise tracking of how theoretically-motivated linguistic abstractions emerge across layers and training iterations.
This work addresses the following research questions:
-
•
What geometric and statistical signals accompany the transition from memorisation to abstraction w.r.t. sentence-level representations?
-
•
When do syntactic and semantic categories emerge in transformer representations over training?
-
•
What mechanisms and training dynamics trigger these phase transitions, and how do architectural and optimisation factors influence their onset?
This paper makes three key contributions. First, we introduce TRACE, a unified diagnostic framework that jointly tracks abstraction and early representational structure formation using coordinated geometric (curvature), statistical (mutual information, intrinsic dimensionality), and linguistic (probing-based) signals throughout training. Second, an original spectral curvature complexity measure characterising loss landscape properties. Third, a frame-semantics grounded synthetic sentence generation framework ABSynth, from which we derive the supporting ABSynth-25K corpus for controlled analysis of representational dynamics.
Across all experiments, we observe a consistent phase transition, indicated by coordinated shifts in curvature flattening, intrinsic dimensionality, and probe performance, which marks the onset of abstraction. These patterns persist across model scales and ablation variants, pointing to structural regularities in transformer learning dynamics. Understanding these transitions supports designing more interpretable, adaptive, and resource-efficient language models.
2 Related Work
Phase Transitions in Training Dynamics.
Representational transitions during training have been documented in small-scale settings like grokking [36, 41], where models shift from memorisation to generalisation. Lee et al. [31] attributed these to geometric reorganisation, while Clauw et al. [11] identified emergent information-theoretic structure. Nakkiran et al. [35] described double-descent phenomena, where generalisation is preceded by changes in spectral behaviour. Stagewise development has also been observed in attention heads and induction circuits [37].
Loss Landscape Geometry and Generalisation.
Curvature properties help illuminate learning dynamics in neural networks [7, 51]. Early work linked sharp minima to overfitting [26], while later studies found flatter regions correspond to better generalisation [21, 44]. Spectral metrics like Hessian trace and effective rank capture curvature anisotropy [2, 7], with recent transformer analyses showing systematic evolution of these metrics during training [23, 51]. Empirically, models trained in overparameterised regimes often exhibit flat Hessian spectra with many near-zero eigenvalues [46], corresponding to improved generalisation, indicative of abstract representation formation [2].
Intrinsic Dimensionality and Representation Compression.
Intrinsic dimensionality (ID) serves as a proxy for the representational complexity of neural networks [1, 17, 4]. Under the manifold hypothesis, high-dimensional activations are assumed to lie on lower-dimensional submanifolds [8], with the ID reflecting the necessary degrees of freedom to explain observed variation. During training, representations typically exhibit a rise–fall pattern—initially increasing as features entangle, then compressing as abstraction emerges [4, 10, 49]. While Aghajanyan et al. [1] showed pre-trained models can be fine-tuned in low-dimensional reparameterisations, Cheng et al. [9] and Lee et al. [31] correlate geometric compression with linguistic information acquisition. Recent work confirms linguistic features occupy low-dimensional subspaces [42] and that compressibility enables compositional generalisation [16]. ID estimation methods range from PCA [34] to geometric approaches like TWO-NN [17] and GRIDE [15].
Intermediate Layers and Representation Studies.
Recent studies highlight intermediate layers’ role in shaping model representations [38, 18, 47]. These layers often show stronger linguistic alignment than final layers [47]. Lepori et al. [32] introduced structural compositionality, revealing that models can decompose complex tasks into modular subroutines, with intermediate layers playing a crucial role in this decomposition process. Related work on symbolic domains examines abstraction in transformers trained on code or algorithmic tasks [33, 54, 53], though these focus on task-specific behaviours rather than semantic abstraction in linguistic representations.
Synthetic datasets
Several synthetic benchmarks have been developed to study abstraction and generalisation in neural models, though most focus on algorithmic or symbolic reasoning tasks rather than grounded linguistic structure. SCAN [29] tests systematic compositional skills through command-to-action mappings, while PCFG-based datasets [24] probe models’ syntactic generalisation abilities using controlled linguistic commands. Mathematical reasoning datasets [45] and algorithmic tasks [41] provide controlled environments for studying learning dynamics but lack linguistic structure.
Unlike prior work that investigates abstraction using symbolic or algorithmic datasets with limited linguistic grounding, our approach targets sentence-level semantic abstraction in transformer models trained on structured, English-like input. We introduce a synthetic corpus generator grounded in frame semantics, enabling precise control over contextual entropy, role structure, and token distributions. This design reflects key properties of natural language while retaining full annotation and sampling transparency. While previous studies provide valuable insights, they often examine a single diagnostic signal in isolation, and are typically restricted to tasks or domains that do not generalise to realistic linguistic settings or scale to larger models. By contrast, our multi-signal approach offers a holistic view of how abstraction emerges during training. This principled integration enables more transferable and interpretable analysis of representation learning in modern transformers.
3 Methodology
As shown in Fig. 1, our method integrates the following signals to detect phase transitions during transformer training: spectral curvature of the loss landscape, intrinsic dimensionality of representations, and linguistic category alignment. Our intuition is that these metrics capture complementary aspects of representation learning: optimisation dynamics reflect updates in model weights, geometric measures and probing reveal reorganisation of semantic relationships, and linguistic alignment reflects emergent structure in the model’s outputs. We define semantic abstraction as the model’s ability to internalise role-based generalisations that extend beyond surface lexical forms—for example, recognising the ARG2 role regardless of whether it is realised as "noun3", or "location22". In this setting, abstraction is evidenced when internal representations align with underlying semantic functions rather than memorised token identities. The following sections detail each aspect, along with our synthetic data generator and experimental setup.
3.1 Spectral Signals of Loss Landscape Geometry
We characterise loss landscape geometry using Hessian-based curvature metrics to detect structural shifts in representation learning. We adopt a scalable approximation via the Lanczos algorithm [30], which estimates the top eigenvalues of the Hessian using efficient Hessian-vector products. Let be the training loss with gradient and Hessian . We compute the truncated spectrum where , motivated by observations that curvature information concentrates in dominant modes [44]. Our spectral metrics include:
- •
- •
To unify curvature magnitude and directional concentration, we define a curvature complexity score:
(1) |
This measure increases with both the overall curvature and its spectral concentration. High values correspond to sharp, anisotropic curvature, often reflecting representational reorganisation or memorisation. In contrast, low values denote flatter, more isotropic landscapes, typically aligned with abstraction and generalisation.
3.2 Intrinsic Dimensionality
We characterise abstraction in transformer representations through the lens of intrinsic dimensionality (ID), motivated by the manifold hypothesis [8]. Given hidden representations , ID is defined as the minimal number of degrees of freedom required to locally parameterise the data distribution [17]. That is, although may lie in a high-dimensional space, it may concentrate around a lower-dimensional manifold of dimension . We adopt the TWO-NN estimator [17], a non-parametric, maximum likelihood estimator based on local geometry. Given a batch of activation vectors , the intrinsic dimension is estimated as:
(2) |
where and denote distances to the first and second nearest neighbours of , respectively. This approach requires no tuning parameters and assumes only uniform local density. To provide a more holistic view, we average the ID over the model layers as , and use in our analysis. By computing ID across training steps and network layers, we capture the dynamic evolution of representational structure. We hypothesise a characteristic trajectory aligned with other metrics: low initial ID during early training, rising ID during transition, and stabilisation or decrease as the model projects data onto semantically coherent structures—signalling abstraction.
3.3 Linguistic Category Alignment
We evaluate abstraction emergence through two complementary approaches: internal representation probing and output generation analysis. For each, we examine both semantic roles (e.g., AGENT, PATIENT) for event structure understanding, and part-of-speech (POS) categories for syntactic abstraction.
Internal Probing.
We apply diagnostic probes to hidden states at layer to measure category-specific confidence scores during training. These probes quantify how well internal representations encode linguistic structures at each training step :
(3) |
where is the hidden representation, is the evaluation batch size and is a linear classifier trained on trained model frozen checkpoints. Probes are used to capture the evolving alignment between internal features and abstract linguistic categories. To validate that observed linguistic alignment reflects learned abstraction rather randomness, we trained probes on randomly initialised models. These probes performed at or near chance, confirming that linguistic features are not encoded prior to training. This supports the view that abstraction emerges progressively and is localised in stages as training evolves. Full results are reported in Appendix B.9.
Output Generation Analysis.
We also assess whether generated tokens respect linguistic constraints. For each generated token , we compute category-specific accuracy:
(4) |
where contains sequences with category , and denotes the set of valid tokens for the expected category at position . This metric reveals whether abstract patterns learned internally are successfully deployed during generation. By jointly analysing internal representation alignment and output conformity alongside geometric metrics, we identify precisely when models transition from memorising token associations to acquiring structured abstractions. Divergence between internal and output measures reveals intermediate states where models have partially acquired abstract representations but cannot yet reliably deploy them in generation. The complete methodology for token categorisation, probe architecture, and training procedures is detailed in Appendix B.6.
3.4 Information Compression via Mutual Information
While we explored mutual information (MI) as a potential signal of abstraction, we found that MI estimates were highly volatile and did not consistently align with the phase transitions observed through other metrics. This behaviour likely stems from two issues: (i) MI estimation is inherently noisy in high-dimensional settings, and (ii) abstraction in transformers involves structural reorganisation rather than pure information compression. These patterns were consistent across both types of MI we measured: (i) , the information retained about the input in the hidden states ; and (ii) , the information shared between adjacent layers. Due to this instability, MI lacked the resolution to serve as a reliable diagnostic. We include full experimental results, estimation procedures, and MI trajectories in Appendix C for completeness, but do not consider MI a core component of our results.
3.5 Synthetic Data Generator
To isolate representational dynamics from confounding factors in natural data, we employ ABSynth, our controllable synthetic data generation framework grounded in frame semantics [19, 5]. ABSynth controllably builds synthetic corpora by sampling from structured sentence frame representations with predefined semantic roles (e.g., AGENT, PATIENT, THEME). The framework supports precise manipulation of structural properties, including vocabulary size, token frequency, syntactic/semantic complexity, and contextual entropy.
The generation pipeline consists of three modular components: (1) a lexicon module that assigns words to categories under a Zipfian frequency distribution [55, 40] (), augmented with variable-strength collocations; (2) a frame-based sentence constructor that assembles grammatically well-formed sentences across three levels of structural complexity; and (3) an entropy-aware token selector that modulates predictability by adjusting sampling probabilities for the corpus. For this study, we use ABSynth to generate ABSynth25K, a dataset of 25,000 sentences with complete frame-semantic annotations. Each example includes ground-truth semantic roles and syntactic categories (POS tags) derived directly from the underlying frame structure, enabling precise investigation of how linguistic abstractions emerge in neural representations. ABSynth25K follows an 80/10/10 training/validation/test split. Complete generation procedures and frame specifications are detailed in Appendix A.
3.6 Models Architectures and Training Setup
Transformer Architectures.
We train three decoder-only transformer models of increasing capacity. Small model (2-head, 128 FFN, 64 , 1-layer), medium model (3-head, 384 FFN, 96 , 2-layer) and large model (4-head, 512 FFN, 128 , 3-layer). Models share the same positional encoding scheme, tokenisation, and vocabulary. A fixed sequence length (16) and batch size (128) are used across runs to standardise training dynamics. We record dense checkpoints throughout training to record our metrics. Complete training details, downstream task formalisations and model hyperparameters are reported in Appendix B.
Ablation of Architectural Components.
To isolate the architectural factors driving abstraction, we ablate key transformer components by removing feed-forward (FFN) blocks and reducing the number of attention heads. These interventions are designed to test whether the capacity for abstraction depends on transformation depth or attention expressivity, and to determine which mechanisms are necessary for triggering representational phase transitions.
4 Results & Analysis
4.1 Coordinated Phase Shift in Training Dynamics
Across all model configurations, we observe a robust two-phase training dynamic: an initial regime of rising ID and elevated curvature, followed by a transition into flatter curvature and stabilised representational complexity (Fig. 2). This transition is marked by a consistent intersection between the Hessian curvature score (blue) and ID trajectories (red), which we interpret as a phase shift in learning dynamics.



In larger models (right panel), this intersection occurs early (around step 5000) and sharply, with curvature rapidly collapsing while ID plateaus at higher values, indicating the emergence of high-capacity, structured representations. Medium-scale models (centre panel) follow qualitatively similar transitions but exhibit notable periodic spikes in curvature throughout training. These persistent oscillations suggest recurring reorganisation events where the model temporarily revisits higher-curvature regions of the loss landscape, reflecting optimisation instabilities with limited architectural capacity.
Smaller models (left panel) exhibit delayed transitions (after step 30000), noisier trajectories, and lower equilibrium ID values, demonstrating clear architectural bottlenecks in abstraction capacity. Despite variations in timing, amplitude, and stability, the fundamental pattern of curvature-ID intersection remains consistent, suggesting a universal geometric signature of abstraction emergence that scales with model capacity but preserves its essential character.
4.2 Differential Impact of Architectural Components
Our ablation experiments examine how specific architectural components influence abstraction dynamics, revealing subtle but informative effects. Fig. 2 presents curvature and ID trajectories across three architectural variants: standard models (top row), models without feed-forward networks (FFNs) (middle row), and models with a single attention head (bottom row).
All variants preserve the fundamental pattern of an initial curvature peak followed by a decline, concurrent with rising ID that eventually stabilises. This consistency demonstrates that abstraction emergence is surprisingly robust to architectural modifications and may be an inherent property of transformer-based optimisation.
Removing FFNs (middle row) increases curvature volatility, especially in small and medium models, with persistent oscillations but reduced spike amplitudes. This suggests FFNs contribute to optimisation stability and smooth representational development, functioning as distributed lookup structures [14]. Despite volatility, the phase transition timeline remains preserved, indicating FFNs enhance rather than enable abstraction emergence.
Single attention head models (bottom row) show scale-dependent effects. Medium and large models exhibit more frequent but lower-amplitude spikes, revealing a stability-smoothness trade-off. Small models show greater impact: reduced ID values and delayed phase transitions, indicating attention capacity constraints affect smaller architectures more severely. Nevertheless, the fundamental curvature-ID intersection pattern persists across all configurations.
These results demonstrate the architectural resilience of core learning dynamics in transformers. The preserved geometric signature across configurations establishes that abstraction is observable and a fundamental property of overall transformers’ gradient-based learning on sequential data, rather than a consequence of specific architectural features. This resilience also explains why transformers with varying configurations achieve comparable language task performance.

4.3 Linguistic Alignment with Geometric Transitions
The intersection of curvature and ID trajectories coincides with key transitions in linguistic abstraction emergence. Fig. 3 shows how internal representations evolve across decoder layers, measured by probe confidence scores for linguistic categories (semantics). Layer 1 exhibits an interesting pattern of representational reorganisation, evident by a temporary dip and increased volatility in confidence scores, occurring precisely at the curvature-ID intersection shown in Fig. 2. This suggests global geometric shifts correspond to evolving category structure.
Layer 2 shows increased category confidence around the same intersection region, mirroring Layer 1’s earlier drop, indicating a hand-off dynamic between middle and upper layers. This temporal complementarity reflects upward abstraction shifts, whereby higher layers specialise in more abstract linguistic features. Unlike Layer 0’s stable dominance and Layer 1’s volatility, Layer 2 maintains a fragmented, dynamic profile throughout training, supporting the formation of higher-order abstractions or interpretability-relevant circuits.
Medium-sized models (Appendix B.9) show more harmonised behaviour across layers, demonstrating tighter coupling between abstraction capacity and model scale, with similar behaviour to layer 1 in the large model. These results support that evolving probe confidence reflects internal reorganisation aligned temporally with geometric transitions.
This internal development diverges from output predictions (Fig. 4), where semantic classification accuracy rapidly improves and stabilises early in training. This indicates that while models quickly generate syntactically appropriate tokens, internal representations continue restructuring long after. This dissociation implies a two-phase developmental process: in the first phase, output behaviour reflects coarse category distinctions likely driven by surface-level statistical regularities; in the second, deeper abstraction is gradually encoded into the model’s internal geometry, as indicated by evolving probe confidence. Full results across models (Appendix B.6) show consistent probe dynamics for both syntactic and semantic categories.
4.4 Limitations of Statistical Diagnostics
Despite theoretical appeal, MI analysis failed to yield reliable insights in our experimental setting. The observed MI dynamics exhibited high variance and showed minimal alignment with phase transitions identified through geometric and linguistic metrics. This instability aligns with concerns raised by Aljaafari et al. [3], which argues that such patterns in transformers may reflect stochastic fluctuations in early representation formation rather than meaningful abstraction signals. Given these limitations, we do not report MI in the main body of this paper, though complete MI trajectories are reported in Appendix C.

5 Discussion and Conclusion
Our results suggest that transformers undergo structured representational reorganisation during training. Rather than emerging gradually, abstraction is evident through phase transitions, coordinated shifts across geometric, information-theoretic, and linguistic signals. These transitions mark a distinct boundary between memorisation and generalisation, where linguistic representations begin to stabilise.
We observe consistent alignment between curvature flattening, ID rise and stabilisation, and increased probe accuracy. This coordination suggests certain geometric signals may serve as markers of emerging abstraction. Low-rank curvature and stable ID appear to signal when models begin internalising structure beyond surface patterns. The phenomenon remains robust across model scales and architectural variants, suggesting that abstraction emergence follows predictable patterns.
While our experiments use synthetic corpora, TRACE is compatible with broader domains. Metrics such as curvature and ID are model-agnostic and can be applied to pre-trained transformers or fine-tuning regimes. Probing-based signals can be approximated using weak supervision or automated annotation tools. Extending TRACE to large-scale pretraining could reveal whether similar phase transitions emerge in noisier, real-world settings. Finally, integrating TRACE with mechanistic interpretability tools could help localise where and how abstraction-related circuits emerge.
6 Limitations
Despite its interpretability, our synthetic corpus does not fully capture the ambiguity and richness of NL. While probe-based diagnostics offer valuable insights, they provide a static view of representation content and may not reflect the dynamic computational mechanisms that transformers deploy at inference time. Finally, while TRACE establishes strong correlations across geometric, informational, and linguistic signals, it does not establish causal relationships or quantify the relative contribution of each factor.
7 Impact Statement
This work reveals that abstract reasoning in language models emerges through predictable phase transitions rather than gradual accumulation. Identifying these critical transitions could enable more efficient training strategies, yielding more interpretable models with reduced computational costs. While this understanding may help detect harmful behaviours, it also presents the usual interpretability trade-off of potentially facilitating model manipulation. Our frame-semantic data generation framework provides a reusable tool for studying abstraction and learning dynamics in language models, with fine-grained control over linguistic properties and transparent evaluation capabilities.
References
- Aghajanyan et al. [2021] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, Online, 2021. Association for Computational Linguistics.
- Ahn et al. [2024] Kwangjun Ahn, Ali Jadbabaie, and Suvrit Sra. How to escape sharp minima with random perturbations. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024.
- Aljaafari et al. [2025] Nura Aljaafari, Danilo S Carvalho, and André Freitas. Carma: Enhanced compositionality in llms via advanced regularisation and mutual information alignment. arXiv preprint arXiv:2502.11066, 2025.
- Ansuini et al. [2019] Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Baker et al. [1998] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The Berkeley FrameNet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90, Montreal, Quebec, Canada, August 1998. Association for Computational Linguistics. doi: 10.3115/980845.980860. URL https://aclanthology.org/P98-1013/.
- Belghazi et al. [2018] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 531–540. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/belghazi18a.html.
- Böttcher and Wheeler [2024] Lucas Böttcher and Gregory Wheeler. Visualizing high-dimensional loss landscapes with hessian directions. Journal of Statistical Mechanics: Theory and Experiment, 2024(2):023401, 2024.
- Cayton et al. [2005] Lawrence Cayton et al. Algorithms for manifold learning. eScholarship, University of California, 2005.
- Cheng et al. [2023] Emily Cheng, Corentin Kervadec, and Marco Baroni. Bridging information-theoretic and geometric compression in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12397–12420, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.762. URL https://aclanthology.org/2023.emnlp-main.762/.
- Cheng et al. [2025] Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Lei Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=0fD3iIBhlV.
- Clauw et al. [2024] Kenzo Clauw, Daniele Marinazzo, and Sebastiano Stramaglia. Information-theoretic progress measures reveal grokking is an emergent phase transition. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum?id=Q4NH6hEPIX.
- Csordás et al. [2021] Róbert Csordás, Kazuki Irie, and Jürgen Schmidhuber. The devil is in the detail: Simple tricks improve systematic generalization of transformers. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, November 2021.
- Cunningham et al. [2023] Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
- Dai et al. [2021] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696, 2021.
- Denti et al. [2022] Francesco Denti, Diego Doimo, Alessandro Laio, and Antonietta Mira. The generalized ratios intrinsic dimension estimator. Scientific Reports, 12(1):20005, 2022.
- Elmoznino et al. [2024] Eric Elmoznino, Thomas Jiralerspong, Yoshua Bengio, and Guillaume Lajoie. A complexity-based theory of compositionality. arXiv preprint arXiv:2410.14817, 2024.
- Facco et al. [2017] Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports, 7(1):12140, 2017.
- Fan et al. [2024] Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang. Not all layers of llms are necessary during inference. CoRR, abs/2403.02181, 2024. URL https://doi.org/10.48550/arXiv.2403.02181.
- Fillmore [1982] Charles J. Fillmore. Frame semantics. In Linguistics in the Morning Calm, pages 111–137. Hanshin Publishing Co., Seoul, 1982.
- Grosse [2024] Roger Grosse. Studying large language model generalization with influence functions. In Proceedings of the 38th Conference on Neural Information Processing Systems, NeurIPS ’24, 2024. Workshop on Scalable Continual Learning for Lifelong Foundation Models.
- Hao et al. [2019] Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Visualizing and understanding the effectiveness of bert. arXiv preprint arXiv:1908.05620, 2019.
- Hoffmann et al. [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems, 35:30016–30030, 2022.
- Hoogland et al. [2024] Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. The developmental landscape of in-context learning. arXiv preprint arXiv:2402.02364, 2024.
- Hupkes et al. [2020] Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757–795, 2020.
- Kantamneni et al. [2025] Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. arXiv preprint arXiv:2502.16681, 2025.
- Keskar et al. [2016] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- Khadivi et al. [2018] Pejman Khadivi, Ravi Tandon, and Naren Ramakrishnan. Flow of information in feed-forward denoising neural networks. In 2018 IEEE 17th International Conference on Cognitive Informatics and Cognitive Computing (ICCI*CC), pages 166–173, 2018. doi: 10.1109/ICCI-CC.2018.8482098.
- Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Lake and Baroni [2018] Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
- Lanczos [1950] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of research of the National Bureau of Standards, 45(4):255–282, 1950.
- Lee et al. [2024] Jin Hwa Lee, Thomas Jiralerspong, Lei Yu, Yoshua Bengio, and Emily Cheng. Geometric signatures of compositionality across a language model’s lifetime. arXiv preprint arXiv:2410.01444, 2024.
- Lepori et al. [2023] Michael Lepori, Thomas Serre, and Ellie Pavlick. Break it down: Evidence for structural compositionality in neural networks. Advances in Neural Information Processing Systems, 36:42623–42660, 2023.
- Li and McClelland [2023] Yuxuan Li and James McClelland. Systematic generalization and emergent structures in transformers trained on structured tasks, 2023. URL https://openreview.net/forum?id=pXDmbfVL_SB.
- Little et al. [2017] Anna V Little, Mauro Maggioni, and Lorenzo Rosasco. Multiscale geometric methods for data sets i: Multiscale svd, noise and curvature. Applied and Computational Harmonic Analysis, 43(3):504–567, 2017.
- Nakkiran et al. [2021] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- Nanda et al. [2023] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW.
- Olsson et al. [2022] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Pan et al. [2024] Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 57018–57049. Curran Associates, Inc., 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/687163285b8affc8ee933bdca8e75747-Paper-Conference.pdf.
- Park et al. [2024] Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum?id=KXuYjuBzKo.
- Piantadosi [2014] Steven T Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review, 21:1112–1130, 2014.
- Power et al. [2022] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
- Razzhigaev et al. [2024] Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 868–874, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.58/.
- Roy and Vetterli [2007] Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pages 606–610. IEEE, 2007.
- Sankar et al. [2021] Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, and Vineeth N Balasubramanian. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9481–9488, 2021.
- Saxton et al. [2019] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
- Singh et al. [2021] Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network hessian maps. Advances in Neural Information Processing Systems, 34:23914–23927, 2021.
- Skean et al. [2025] Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013, 2025.
- Smith et al. [2025] Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2), March 2025. URL https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks. LessWrong post.
- Valeriani et al. [2023] Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36:51234–51252, 2023.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Wang et al. [2024] George Wang, Matthew Farrugia-Roberts, Jesse Hoogland, Liam Carroll, Susan Wei, and Daniel Murfet. Loss landscape geometry reveals stagewise development of transformers. In High-dimensional Learning Dynamicsß: The Emergence of Structure and Reasoning, 2024. URL https://openreview.net/forum?id=2JabyZjM5H.
- Zhao et al. [2022] Yang Zhao, Hao Zhang, and Xiuyuan Hu. Penalizing gradient norm for efficiently improving generalization in deep learning. In International conference on machine learning, pages 26982–26992. PMLR, 2022.
- Zhong and Andreas [2024] Ziqian Zhong and Jacob Andreas. Algorithmic capabilities of random transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=plH8gW7tPQ.
- Zhou et al. [2024] Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Joshua M. Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AssIuHnmHX.
- Zipf [1949] George Kingsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
Appendix A Frame-Semantic Data Generation Framework
This appendix describes our controllable synthetic data generation framework ABSynth. Unlike template-based synthetic datasets [29] or task-specific benchmarks [45], ABSynth is grounded in formal frame semantics [19, 5] and is able to generate English-like corpora by sampling from abstract event frames with predefined semantic roles, enabling mechanistic study of how linguistic abstractions emerge during transformer training.

ABSynth operationalises frame semantics by representing events as structured predicate-argument frames. Each frame specifies: (i) frame elements; (ii) core semantic roles (e.g., AGENT, PATIENT, INSTRUMENT); (iii) corresponding syntactic categories (e.g., NOUN, VERB), and (iv) complexity constraints based on sequence length and entropy calibration.
This grounding ensures that generated sentences exhibit genuine compositional structure rather than arbitrary token sequences. The frame-based approach enables direct tracking of how neural models learn to represent abstract semantic categories that underlie surface forms.
As illustrated in Figure 5, ABSynth generates datasets through a multi-stage pipeline that includes: (i) semantic frame selection with role specification, (ii) lexical realization following Zipfian frequency distributions and semantic clustering, (iii) syntactic construction respecting grammatical constraints and frame-to-syntax mappings, and (iv) entropy calibration to control contextual predictability. The resulting corpora exhibit theoretically-grounded compositional structure while maintaining naturalistic statistical properties.
We detail the generation process below using our ABSynth25K instantiation as an example, emphasising that each component is modular and configurable for different research objectives. This flexibility enables systematic exploration of how specific linguistic properties influence abstraction emergence in neural models.
A.1 Lexicon Construction: Scaling and Semantic Clustering
The lexicon is built by assigning tokens to POS and semantic role categories, then applying Zipfian scaling and semantic clustering. Each token is associated with a tuple [POS, SRL, Zipf_rank, ClusterID], ensuring interpretability and alignment across analysis stages.
To mimic natural lexical statistics, token frequencies follow a Zipfian distribution [55, 40]:
(5) |
where , , and is the vocabulary size. This maintains a realistic long-tail frequency spectrum.
Semantic clustering introduces collocational structure by forming token groups with variable intra- and inter-cluster association strengths. For tokens and belonging to clusters and :
(6) |
where and are tuned to create structured but noisy associations, emulating semantic co-occurrence patterns. Intra-cluster associations are drawn from , while cross-cluster links are weaker .
Each word receives a naturalised name (e.g., result1, noun5), allowing transparent reverse-mapping for analysis.
A.2 Frame-Based Syntactic Realisation and Entropy Control
Frame-to-syntax mappings define how semantic frames are realised as surface forms while controlling contextual predictability through entropy calibration. All realisations follow valid English grammatical constructions, respecting part-of-speech ordering, agreement patterns, and canonical phrase structure. This grounding enables the study of syntactic and semantic abstraction within structurally coherent input sequences.
Each frame component is annotated with expected entropy based on its role in the frame:
-
•
Low entropy (0.5–1.5 bits): Grammatically determined positions (e.g., determiners required by nouns)
-
•
Medium entropy (1.5–3.0 bits): Semantically constrained positions with multiple valid fillers (e.g., theme roles that accept various object types)
-
•
High entropy (3.0–4.5 bits): Optional frame elements with high variability (e.g., adverbial modifier)
Frames are instantiated according to a target complexity distribution (55% simple, 35% medium, 10% complex), which guides the global entropy profile of the corpus. Example frame realisations include:
Simple TRANSFER: [AGENT=NOUN] [ACTION=VERB] [THEME=NOUN] "noun2 verb3 noun5" Medium CREATION: [AGENT=NOUN] [ACTION=VERB] [THEME=NOUN] [PURPOSE=PREP+NOUN] "noun1 verb2 noun4 prep3 noun7" Complex MOTION: [AGENT=NOUN] [REL] [ACTION=VERB] [SOURCE=NOUN] [ACTION=VERB] [GOAL=NOUN] "noun3 rel1 verb5 noun6 verb7 noun9"
A.3 Dynamic Entropy Adjustment Algorithm
To enforce statistical balance across complexity levels, the sentence generator incorporates an entropy-aware sampling mechanism. During generation, the system maintains a global entropy profile, defined as the frequency distribution of sentence positions assigned to low, medium, and high entropy tiers. This profile is updated in real time and compared to the desired target distribution.
If the observed distribution diverges from the target (e.g., too many low-entropy tokens have been sampled), the system increases the sampling weight for frames or token categories that contribute to underrepresented tiers. This feedback mechanism modulates the difficulty of the dataset without sacrificing grammaticality.
The final vocabulary consists of 9,000 tokens distributed across categories as shown in Table 1.
A.4 Output Format and Probing Supervision
Each sentence is stored with a naturalised token sequence and associated structured annotations, including:
-
•
POS labels (e.g., NOUN, ADJ, VERB)
-
•
Semantic roles (e.g., AGENT, PATIENT, RESULT)
-
•
Entropy tier
-
•
Other Contextual complexity metadata
These annotations enable direct supervision for probing tasks. During model training, hidden states are extracted layer-wise and evaluated using linear probes trained on these annotations. This setup facilitates fine-grained analysis of how and when compositional representations emerge, and how they correlate with curvature, intrinsic dimensionality, and mutual information.
Token Category | Vocabulary Size |
---|---|
Noun | 2,780 |
Transitive Verb | 694 |
Intransitive Verb | 694 |
Communication Verb | 347 |
Motion Verb | 347 |
Adjective | 1,388 |
Adverb | 555 |
Location | 694 |
Temporal | 694 |
Preposition | 416 |
Determiner | 111 |
Conjunction | 277 |
Result | 277 |
Total | 9,000 |
Appendix B Technical Implementation Details
B.1 Main Model Architecture
We implemented decoder-only transformer architectures based on the original design of Vaswani et al. [50]. Each model consists of layers, each with hidden size , attention heads (where ), and a feedforward dimension . We focus on decoder-only models given their increasing prevalence in production LLMs and recent arguments that causal architectures provide a cleaner demonstration for emergence [47].
Our architectural choices are guided by recent Transformer scaling laws, notably those articulated by Hoffmann et al. [22]. Namely, we evaluate three configurations that adhere to the Chinchilla scaling law (see below), with approximate parameter counts and architectural specifications as presented in Table 2. All models are trained with a maximum sequence length of tokens and a dropout rate of .
We use a simple whitespace-based tokeniser, where each token corresponds to a space-separated word or symbol. This choice allows us to maintain interpretability and simplify downstream representational analyses, while aligning with our controlled, low-scale experimental setup.
Model | Layers () | Hidden size () | Heads () | FFN size () |
---|---|---|---|---|
Small | 1 | 64 | 2 | 128 |
Medium | 2 | 96 | 3 | 384 |
Large | 3 | 128 | 4 | 512 |
Scaling Laws and Training Budget.
We follow the Chinchilla scaling principles from Hoffmann et al. [22], which demonstrate that model size and training tokens should scale together. Specifically, Chinchilla findings show that optimal training requires approximately 20 tokens per parameter. Given our dataset size of approximately 360K tokens per epoch, we designed our training regimen to respect these scaling principles. Based on our model sizes (110K, 339K, and 749K parameters), we estimated minimum training requirements of 6, 19, and 42 epochs, respectively, to ensure a sufficient token-to-parameter ratio, as implied by Chinchilla’s 20:1 guideline. However, in practice, we observed that phase transitions occurred at different points across scales, not precisely aligned with these theoretical estimates. As such, we extended training durations beyond these minimums to allow sufficient time for representational transitions to emerge, as discussed in Section B.3, and to test whether the Grokking phoneme [41] existence.
B.2 Main Models Training Objectives and Formalisation of Next Token Prediction
Next-Token Prediction Task (NTP):
We formalise the NTP task, which we use to train our decoder-only language models. NTP involves predicting the next token given a preceding sequence . In our setting, each belongs to a controlled lexicon from ABSynth, reflecting structured linguistic categories (e.g., noun3, verb2, etc.) and conforming to correct English grammar.
Formally, the objective is:
(7) |
Here, denotes the model’s token vocabulary, which includes synthetic labels representing part-of-speech categories and their variants (e.g., noun1, adj2). The model autoregressively generates one token at a time, conditioned solely on the preceding tokens.
Example Prompt: Given the sequence: "noun1 verb2 adj1", the task requires predicting the next token, such as: "noun3", depending on the dataset’s underlying syntactic or semantic generation rules.
We focus on NTP because it reflects the core objective used in many widely adopted pretrained language models (e.g., GPT-style models), making it a natural and effective setting for examining representational and interpretive behaviours in a controlled environment.
B.3 Main Models Training Configuration
All models are trained using next-token prediction with the Adam optimiser [28] with learning rate , and 1000 warm-up steps. Training was conducted for significantly extended epochs beyond the computed minimum requirements of the discussed scaling laws (Section B.1), consistent with the observations of Nanda et al. [36] on the "grokking" phenomenon, where generalisation emerges abruptly after an initial memorisation phase. We consider this an extension of the training strategies outlined in Csordás et al. [12], which emphasise the importance of avoiding early stopping to fully exploit the learning capacity of neural models.
Specifically, we trained:
-
•
Small model: 500 epochs (82× the minimum of 6 epochs)
-
•
Medium model: 500 epochs (26× the minimum of 19 epochs)
-
•
Large model: 500 epochs (12× the minimum of 42 epochs)
We record dense checkpoints throughout training (every 500 training steps), extracting hidden states and gradients from all layers to compute our diagnosis metrics.
We applied minimal regularisation and continuously monitored loss, accuracy, gradient dynamics, representational similarity, and mutual information (MI) throughout training to detect potential phase transitions. This methodological rigour enables us to assess the relationship between model structure and task complexity under controlled and interpretable conditions, while ensuring that all models are trained in accordance with modern scaling principles. To ensure correctness and reproducibility, all experiments were repeated at least 5 times with different random seeds, and we reported the averaged results.
B.4 Model performance
B.5 Model Performance
All models demonstrate strong performance on the downstream generation task after their respective phase transitions. Table 3 summarises the evaluation metrics across model scales.
Model | Exact Match | Token Accuracy | BLEU Score | Perplexity |
---|---|---|---|---|
Small | 0.84 | 0.98 | 0.20 | 1.22 |
Medium | 0.98 | 0.99 | 0.21 | 1.08 |
Large | 0.98 | 0.99 | 0.21 | 1.07 |
B.6 Probing Framework and Label Construction
To investigate the interpretability and internal structure of our models, we implemented a probing framework that trains lightweight classifiers on the frozen hidden representations extracted from each layer of our trained models. Specifically, for every layer in the tested models, we trained and evaluated both a part-of-speech (POS) probe and a semantic role labelling (SRL) probe.
Each probe is implemented as a feedforward neural network comprising three linear layers interleaved with ReLU activations and dropout. The network is trained using binary cross-entropy loss, with sigmoid activations at the output layer to support multi-label classification. Given an input representation , the probe computes:
where are trainable weight matrices and denotes the element-wise sigmoid function.
Label | Count | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
NOUN | 279984 | 1.000 | 1.000 | 1.000 | 1.000 |
TRANSITIVE_VERB | 151232 | 0.836 | 0.577 | 0.836 | 0.683 |
INTRANSITIVE_VERB | 59584 | 0.073 | 1.000 | 0.073 | 0.136 |
COMMUNICATION_VERB | 54352 | 0.238 | 1.000 | 0.238 | 0.384 |
MOTION_VERB | 68368 | 0.344 | 0.659 | 0.344 | 0.452 |
CHANGE_VERB | 14240 | 0.562 | 0.510 | 0.562 | 0.535 |
ADJ | 59072 | 0.239 | 0.535 | 0.239 | 0.331 |
LOCATION | 101984 | 0.315 | 0.858 | 0.315 | 0.461 |
TEMP | 37664 | 0.184 | 0.535 | 0.184 | 0.274 |
PREP | 115264 | 0.387 | 0.823 | 0.387 | 0.526 |
RESULT | 68592 | 0.292 | 0.746 | 0.292 | 0.420 |
CONJ | 51248 | 0.350 | 0.874 | 0.350 | 0.500 |
Probes are trained independently for each layer in the model, allowing us to analyse the emergence and distribution of linguistic and functional features across depth. Representations from each layer are frozen and taken from an already trained model following our training setting explained in Section B, and the underlying model weights are not updated during probing.
Due to the synthetic nature of our dataset, both POS and semantic labels are deterministically derived from token names. For example, a token such as noun3 is assigned the POS label NOUN and may additionally be annotated with semantic roles such as AGENT or ENTITY, depending on the symbolic structure of the task. This approach eliminates annotation ambiguity and ensures consistent supervision across examples.
Label | Count | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
NOUN | 279984 | 1.000 | 1.000 | 1.000 | 1.000 |
TRANSITIVE_VERB | 151232 | 0.779 | 0.655 | 0.779 | 0.711 |
INTRANSITIVE_VERB | 59584 | 0.291 | 1.000 | 0.291 | 0.451 |
COMMUNICATION_VERB | 54352 | 0.476 | 1.000 | 0.476 | 0.645 |
MOTION_VERB | 68368 | 0.718 | 0.678 | 0.718 | 0.697 |
CHANGE_VERB | 14240 | 0.875 | 0.510 | 0.875 | 0.644 |
ADJ | 59072 | 0.603 | 0.558 | 0.603 | 0.580 |
LOCATION | 101984 | 0.641 | 0.843 | 0.641 | 0.729 |
TEMP | 37664 | 0.344 | 0.535 | 0.344 | 0.419 |
PREP | 115264 | 0.856 | 0.808 | 0.856 | 0.831 |
RESULT | 68592 | 0.585 | 0.746 | 0.585 | 0.655 |
CONJ | 51248 | 0.545 | 1.000 | 0.545 | 0.705 |
Given the supervised design of our dataset, we employ linear probes due to their demonstrated effectiveness in similar contexts. While alternative methods such as Sparse Autoencoders (SAEs) have shown promise in unsupervised settings Cunningham et al. [13], linear probes remain a robust and interpretable choice for supervised feature probing Kantamneni et al. [25], Smith et al. [48].
B.7 Probe Models Training Configuration
All probes were trained for 30 epochs using the Adam optimiser [28] with a learning rate of , a hidden dimension of 256, and a dropout rate of 0.5. Input dimensionality matched the model’s hidden state size.
B.8 Probing performance
We report the average POS probe models’ performance on Tables 4, 5, and 6, for the small, medium, and large models, receptively. We also report the average semantic probe models’ performance on Tables 7, 8, and 9, for the small, medium, and large models, receptively.




B.9 Probing Extended Results
We present extended probing results across all model scales (Small, Medium, and Large) for both part-of-speech (POS) and semantic role categories. Figures 6, 7, and 8 show model confidence scores for each layer across training steps.

To assess whether abstraction emerges during training rather than being encoded by architecture alone, we also trained probes on frozen hidden states from randomly initialised models. These models exhibited near-zero performance across all linguistic categories, confirming the absence of structured representations at initialisation.
Figures 9, 10, and 11 show the performance of semantic and POS probes on randomly initialised models.






Label | Count | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
NOUN | 279984 | 1.000 | 1.000 | 1.000 | 1.000 |
TRANSITIVE_VERB | 151232 | 0.772 | 0.657 | 0.772 | 0.710 |
INTRANSITIVE_VERB | 59584 | 0.309 | 0.895 | 0.309 | 0.459 |
COMMUNICATION_VERB | 54352 | 0.476 | 1.000 | 0.476 | 0.645 |
MOTION_VERB | 68368 | 0.757 | 0.667 | 0.757 | 0.709 |
CHANGE_VERB | 14240 | 1.000 | 0.510 | 1.000 | 0.676 |
ADJ | 59072 | 0.603 | 0.558 | 0.603 | 0.580 |
LOCATION | 101984 | 0.654 | 0.829 | 0.654 | 0.731 |
TEMP | 37664 | 0.367 | 0.535 | 0.367 | 0.436 |
PREP | 115264 | 0.856 | 0.808 | 0.856 | 0.831 |
RESULT | 68592 | 0.572 | 0.754 | 0.572 | 0.650 |
CONJ | 51248 | 0.545 | 1.000 | 0.545 | 0.705 |
Label | Count | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
AGENT | 268576 | 1.000 | 0.959 | 1.000 | 0.979 |
PATIENT | 151232 | 0.852 | 0.576 | 0.852 | 0.687 |
ACTION | 279984 | 1.000 | 1.000 | 1.000 | 1.000 |
LOCATION | 101984 | 0.315 | 0.858 | 0.315 | 0.461 |
RELATION | 115264 | 0.387 | 0.823 | 0.387 | 0.526 |
CONNECTOR | 51248 | 0.321 | 0.950 | 0.321 | 0.480 |
RESULT | 68592 | 0.305 | 0.731 | 0.305 | 0.431 |
OTHER | 153248 | 0.874 | 0.642 | 0.874 | 0.740 |
Label | Count | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
AGENT | 268576 | 1.000 | 0.959 | 1.000 | 0.979 |
PATIENT | 151232 | 0.785 | 0.652 | 0.785 | 0.713 |
ACTION | 279984 | 1.000 | 1.000 | 1.000 | 1.000 |
LOCATION | 101984 | 0.641 | 0.843 | 0.641 | 0.729 |
RELATION | 115264 | 0.856 | 0.808 | 0.856 | 0.831 |
CONNECTOR | 51248 | 0.574 | 0.944 | 0.574 | 0.714 |
RESULT | 68592 | 0.546 | 0.771 | 0.546 | 0.639 |
OTHER | 153248 | 0.893 | 0.818 | 0.893 | 0.854 |
Label | Count | Accuracy | Precision | Recall | F1 Score |
---|---|---|---|---|---|
AGENT | 268576 | 1.000 | 0.959 | 1.000 | 0.979 |
PATIENT | 151232 | 0.762 | 0.659 | 0.762 | 0.707 |
ACTION | 279984 | 1.000 | 1.000 | 1.000 | 1.000 |
LOCATION | 101984 | 0.641 | 0.843 | 0.641 | 0.729 |
RELATION | 115264 | 0.856 | 0.808 | 0.856 | 0.831 |
CONNECTOR | 51248 | 0.545 | 1.000 | 0.545 | 0.705 |
RESULT | 68592 | 0.390 | 0.969 | 0.390 | 0.556 |
OTHER | 153248 | 0.893 | 0.818 | 0.893 | 0.854 |
B.10 Computational Resources and Software Environment
All the experiments were on an NVIDIA RTX A6000. We used PyTorch (v2.6.0), scikit-learn (v1.6.1), scipy (v1.12.0), seaborn (v0.13.2), Python (v3.9.21), matplotlib (v3.9.4), numpy (v1.26.4), and matplotlib (v3.9.4). Training the small models required around 21 minutes without tracking, and 116 with tracking of all metrics. For medium models, it takes around 28 minutes to train and 130 minutes to train with full tracking. For large models, it takes around 34 minutes to train and 139 minutes to train with full tracking. Training probes require approximately 9 minutes, 19 minutes, and 28 minutes for small, medium and large models, respectively.
Appendix C Mutual Information Estimation with MINE
To quantify the flow and compression of information within transformer models, we estimate mutual information (MI) between input embeddings and internal layer representations. While several methods exist for MI estimation (e.g., k-nearest neighbours, contrastive approaches), we adopt Mutual Information Neural Estimation (MINE) [6] due to its scalability and effectiveness in high-dimensional settings.
In theory, a systematic decline in across layers and training steps should signal abstraction: the model progressively discards surface-level details while retaining task-relevant structure. This compression-based view of abstraction has been explored in other architectures [27, 16], and we examined whether similar dynamics emerge in transformer models during training.
Specifically, we tracked two MI quantities: (i) — the mutual information between the input embeddings and the hidden states at layer , and (ii) — the mutual information between consecutive hidden layers.
Despite the theoretical appeal, our empirical findings (see Figures 12) showed that MI was highly variable across training steps and did not consistently align with the phase transitions identified via curvature or intrinsic dimensionality. These results suggest that MI, while informative in principle, may lack the temporal resolution and stability needed to serve as a primary diagnostic in TRACE.
We report full implementation details, training settings, and estimator architecture below.
C.1 MINE Objective and Architecture
MINE approximates the Donsker-Varadhan lower bound on mutual information using a neural critic function , parameterised by a multilayer perceptron (MLP). Given joint samples and marginal samples formed by pairing with independently sampled , the MI is estimated via:
(8) |
Our implementation uses a 3-layer MLP with hidden dimensions and ReLU activations. MINE is used to estimating — the MI between the input embeddings and layer — it is also used to compute , capturing how information is transmitted between adjacent layers. This allows us to analyse information bottlenecks, compression phases, and abstraction dynamics across the depth of the model.
C.2 Training and Evaluation Protocol
For each chosen training step of the model and its layers, we train a separate MINE estimator to convergence. The training protocol is as follows:
-
•
Batch size: 128 examples
-
•
Optimiser: Adam optimiser [28] with learning rate 0.001
-
•
Training steps: 200 iterations
-
•
Positive samples: Joint pairs from the same forward pass
-
•
Negative samples: where is obtained by shuffling across the batch
C.3 MI Across Layers and Models
Figure 12 shows how mutual information evolves during training across model scales. In small models (top-left), MI between the embedding and subsequent layers drops rapidly and stabilises early, suggesting early-stage compression and limited representational differentiation.



In medium models (top-right), we observe a pronounced dip in MI that aligns temporally with the phase transition identified in our diagnostics. This suggests that representational compression may act as a precursor or trigger for abstraction. Following this dip, MI values remain volatile across layers, reflecting a noisier or less stable reconfiguration of internal representations.
Large models (bottom) follow a broadly similar trend, though with key differences: the MI dip appears in most transitions, but is less evident or absent in others (Layer 1 Layer 2).
While the timing of these dips often aligns with the abstraction phase transition, the metric remains highly volatile. This instability suggests that mutual information, although partially correlated with representational restructuring, may be too noisy and inconsistent to serve as a reliable standalone indicator of abstraction onset.