TRACE for Tracking the Emergence of Semantic Representations in Transformers

Nura Aljaafari1†,  Danilo S. Carvalho3,  André Freitas1,2,3
1 Department of Computer Science, University of Manchester, United Kingdom
2 Idiap Research Institute, Switzerland
3 National Biomarker Centre, CRUK-MI, University of Manchester, United Kingdom
{firstname.lastname}@[postgrad.]manchester.ac.uk
Abstract

Modern transformer models exhibit phase transitions during training, distinct shifts from memorisation to abstraction, but the mechanisms underlying these transitions remain poorly understood. Prior work has often focused on endpoint representations or isolated signals like curvature or mutual information, typically in symbolic or arithmetic domains, overlooking the emergence of linguistic structure. We introduce TRACE (Tracking Representation Abstraction and Compositional Emergence), a diagnostic framework combining geometric, informational, and linguistic signals to detect phase transitions in Transformer-based LMs. TRACE leverages a frame-semantic data generation method, ABSynth, that produces annotated synthetic corpora with controllable complexity, lexical distributions, and structural entropy, while being fully annotated with linguistic categories, enabling precise analysis of abstraction emergence. Experiments reveal that (i) phase transitions align with clear intersections between curvature collapse and dimension stabilisation; (ii) these geometric shifts coincide with emerging syntactic and semantic accuracy; (iii) abstraction patterns persist across architectural variants, with components like feedforward networks affecting optimisation stability rather than fundamentally altering trajectories. This work advances our understanding of how linguistic abstractions emerge in LMs, offering insights into model interpretability, training efficiency, and compositional generalisation that could inform more principled approaches to LM development.

1 Introduction

Transformer models 111Throughout this paper, we use the term ”transformer” to refer to Transformer-based architectures as implemented in Vaswani et al. [50] exhibit evolving internal representations during training, with recent work showing that these representations undergo phase transitions—abrupt shifts in representational structure, generalisation behaviour, and learning dynamics [31, 36, 41]. These transitions mark critical points where models reorganise internal representations and develop increasingly abstract, structured encodings [4, 49, 31].

Understanding the mechanisms and timing of these shifts is essential for interpretability, model steering, and failure mode detection [20]. While prior studies have characterised models’ behaviour at convergence or via final-layer probes, less is known about how internal linguistic structures form over the course of training. Furthermore, much of the literature on transformer interpretability and representation focuses on algorithmic or mathematical tasks [33, 53, 54], or examines geometric properties at the level of individual tokens or local concepts [39, 49]. These works leave open the question of how holistic, sentence-level semantic representations arise in transformer representations.

We address this gap by introducing TRACE: Tracking Representation Abstraction and Compositional Emergence (Fig. 1), a diagnostic framework that combines geometric, linguistic, and information-theoretic signals to characterise how transformers transition from memorisation to abstraction 222 We use ”compositional emergence” to denote the formation of structured internal representations (e.g., roles, syntactic categories), rather than formal compositional generalisation..

Our central hypothesis is that abstraction emerges through a measurable phase transition, marked by: (i) characteristic rise-then-stabilise patterns in intrinsic dimensionality of hidden representations; (ii) transient spikes in loss curvature; (iii) surges in linguistic category alignment, particularly for part-of-speech and semantic accuracy; and (iv) decrease in mutual information between input and hidden representations. We test whether these phenomena, each informative in isolation, exhibit coordinated temporal dynamics that can serve as reliable sentence-level representation/abstraction indicators.

Refer to caption
Figure 1: Overview of the TRACE framework, which integrates the monitoring of (i) intrinsic dimensionality of hidden states, (ii) spectral curvature complexity of the loss landscape, and (iii) linguistic alignment via probing and output accuracy. Inputs are sampled from ABSynth, our proposed synthetically generated corpus grounded on frame-based representations and controlled distributions over entropy, frequency, and complexity.

To isolate model dynamics from data confounds, we introduce a synthetic data generation framework, ABSynth, based on formal frame semantics [19, 5]. Unlike template-based approaches, ABSynth samples from abstract event frames with predefined semantic roles (agent, patient, etc.), producing corpora with transparent syntactic and semantic structure. We instantiate this framework with ABSynth25K, enabling precise tracking of how theoretically-motivated linguistic abstractions emerge across layers and training iterations.

This work addresses the following research questions:

  • What geometric and statistical signals accompany the transition from memorisation to abstraction w.r.t. sentence-level representations?

  • When do syntactic and semantic categories emerge in transformer representations over training?

  • What mechanisms and training dynamics trigger these phase transitions, and how do architectural and optimisation factors influence their onset?

This paper makes three key contributions. First, we introduce TRACE, a unified diagnostic framework that jointly tracks abstraction and early representational structure formation using coordinated geometric (curvature), statistical (mutual information, intrinsic dimensionality), and linguistic (probing-based) signals throughout training. Second, an original spectral curvature complexity measure 𝒞(H)𝒞𝐻\mathcal{C}(H)caligraphic_C ( italic_H ) characterising loss landscape properties. Third, a frame-semantics grounded synthetic sentence generation framework ABSynth, from which we derive the supporting ABSynth-25K corpus for controlled analysis of representational dynamics.

Across all experiments, we observe a consistent phase transition, indicated by coordinated shifts in curvature flattening, intrinsic dimensionality, and probe performance, which marks the onset of abstraction. These patterns persist across model scales and ablation variants, pointing to structural regularities in transformer learning dynamics. Understanding these transitions supports designing more interpretable, adaptive, and resource-efficient language models.

2 Related Work

Phase Transitions in Training Dynamics.

Representational transitions during training have been documented in small-scale settings like grokking [36, 41], where models shift from memorisation to generalisation. Lee et al. [31] attributed these to geometric reorganisation, while Clauw et al. [11] identified emergent information-theoretic structure. Nakkiran et al. [35] described double-descent phenomena, where generalisation is preceded by changes in spectral behaviour. Stagewise development has also been observed in attention heads and induction circuits [37].

Loss Landscape Geometry and Generalisation.

Curvature properties help illuminate learning dynamics in neural networks [7, 51]. Early work linked sharp minima to overfitting [26], while later studies found flatter regions correspond to better generalisation [21, 44]. Spectral metrics like Hessian trace and effective rank capture curvature anisotropy [2, 7], with recent transformer analyses showing systematic evolution of these metrics during training [23, 51]. Empirically, models trained in overparameterised regimes often exhibit flat Hessian spectra with many near-zero eigenvalues [46], corresponding to improved generalisation, indicative of abstract representation formation [2].

Intrinsic Dimensionality and Representation Compression.

Intrinsic dimensionality (ID) serves as a proxy for the representational complexity of neural networks [1, 17, 4]. Under the manifold hypothesis, high-dimensional activations are assumed to lie on lower-dimensional submanifolds [8], with the ID reflecting the necessary degrees of freedom to explain observed variation. During training, representations typically exhibit a rise–fall pattern—initially increasing as features entangle, then compressing as abstraction emerges [4, 10, 49]. While Aghajanyan et al. [1] showed pre-trained models can be fine-tuned in low-dimensional reparameterisations, Cheng et al. [9] and Lee et al. [31] correlate geometric compression with linguistic information acquisition. Recent work confirms linguistic features occupy low-dimensional subspaces [42] and that compressibility enables compositional generalisation [16]. ID estimation methods range from PCA [34] to geometric approaches like TWO-NN [17] and GRIDE [15].

Intermediate Layers and Representation Studies.

Recent studies highlight intermediate layers’ role in shaping model representations [38, 18, 47]. These layers often show stronger linguistic alignment than final layers [47]. Lepori et al. [32] introduced structural compositionality, revealing that models can decompose complex tasks into modular subroutines, with intermediate layers playing a crucial role in this decomposition process. Related work on symbolic domains examines abstraction in transformers trained on code or algorithmic tasks [33, 54, 53], though these focus on task-specific behaviours rather than semantic abstraction in linguistic representations.

Synthetic datasets

Several synthetic benchmarks have been developed to study abstraction and generalisation in neural models, though most focus on algorithmic or symbolic reasoning tasks rather than grounded linguistic structure. SCAN [29] tests systematic compositional skills through command-to-action mappings, while PCFG-based datasets [24] probe models’ syntactic generalisation abilities using controlled linguistic commands. Mathematical reasoning datasets [45] and algorithmic tasks [41] provide controlled environments for studying learning dynamics but lack linguistic structure.

Unlike prior work that investigates abstraction using symbolic or algorithmic datasets with limited linguistic grounding, our approach targets sentence-level semantic abstraction in transformer models trained on structured, English-like input. We introduce a synthetic corpus generator grounded in frame semantics, enabling precise control over contextual entropy, role structure, and token distributions. This design reflects key properties of natural language while retaining full annotation and sampling transparency. While previous studies provide valuable insights, they often examine a single diagnostic signal in isolation, and are typically restricted to tasks or domains that do not generalise to realistic linguistic settings or scale to larger models. By contrast, our multi-signal approach offers a holistic view of how abstraction emerges during training. This principled integration enables more transferable and interpretable analysis of representation learning in modern transformers.

3 Methodology

As shown in Fig. 1, our method integrates the following signals to detect phase transitions during transformer training: spectral curvature of the loss landscape, intrinsic dimensionality of representations, and linguistic category alignment. Our intuition is that these metrics capture complementary aspects of representation learning: optimisation dynamics reflect updates in model weights, geometric measures and probing reveal reorganisation of semantic relationships, and linguistic alignment reflects emergent structure in the model’s outputs. We define semantic abstraction as the model’s ability to internalise role-based generalisations that extend beyond surface lexical forms—for example, recognising the ARG2 role regardless of whether it is realised as "noun3", or "location22". In this setting, abstraction is evidenced when internal representations align with underlying semantic functions rather than memorised token identities. The following sections detail each aspect, along with our synthetic data generator and experimental setup.

3.1 Spectral Signals of Loss Landscape Geometry

We characterise loss landscape geometry using Hessian-based curvature metrics to detect structural shifts in representation learning. We adopt a scalable approximation via the Lanczos algorithm [30], which estimates the top KNmuch-less-than𝐾𝑁K\ll Nitalic_K ≪ italic_N eigenvalues of the Hessian using efficient Hessian-vector products. Let (θ)𝜃\mathcal{L}(\theta)caligraphic_L ( italic_θ ) be the training loss with gradient g=θ(θ)𝑔subscript𝜃𝜃g=\nabla_{\theta}\mathcal{L}(\theta)italic_g = ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ ) and Hessian Hθ=θ2(θ)subscript𝐻𝜃subscriptsuperscript2𝜃𝜃H_{\theta}=\nabla^{2}_{\theta}\mathcal{L}(\theta)italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ ). We compute the truncated spectrum {λi}i=1Ksuperscriptsubscriptsubscript𝜆𝑖𝑖1𝐾\{\lambda_{i}\}_{i=1}^{K}{ italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT where KNmuch-less-than𝐾𝑁K\ll Nitalic_K ≪ italic_N, motivated by observations that curvature information concentrates in dominant modes [44]. Our spectral metrics include:

  • Hessian Trace: Tr(Hθ)=i=1NλiTrsubscript𝐻𝜃superscriptsubscript𝑖1𝑁subscript𝜆𝑖\mathrm{Tr}(H_{\theta})=\sum_{i=1}^{N}\lambda_{i}roman_Tr ( italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, quantifying overall curvature magnitude. Decreasing trace indicates flatter minima associated with improved generalisation [44, 52, 2].

  • Entropy-Based Effective Rank: reff=exp(ipilogpi)subscript𝑟effsubscript𝑖subscript𝑝𝑖subscript𝑝𝑖r_{\text{eff}}=\exp\left(-\sum_{i}p_{i}\log p_{i}\right)italic_r start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT = roman_exp ( - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where pi=|λi|j|λj|subscript𝑝𝑖subscript𝜆𝑖subscript𝑗subscript𝜆𝑗p_{i}=\frac{|\lambda_{i}|}{\sum_{j}|\lambda_{j}|}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG | italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG; this Shannon entropy-based measure [43] quantifies dominant curvature directions. Low values reflect curvature concentration (anisotropy)—abstraction; high values indicate distributed, isotropic curvature—memorisation [16].

To unify curvature magnitude and directional concentration, we define a curvature complexity score:

𝒞(H)=Tr(H)reff.𝒞𝐻Tr𝐻subscript𝑟eff\mathcal{C}(H)=\frac{\mathrm{Tr}(H)}{\sqrt{r_{\text{eff}}}}.caligraphic_C ( italic_H ) = divide start_ARG roman_Tr ( italic_H ) end_ARG start_ARG square-root start_ARG italic_r start_POSTSUBSCRIPT eff end_POSTSUBSCRIPT end_ARG end_ARG . (1)

This measure increases with both the overall curvature and its spectral concentration. High 𝒞(H)𝒞𝐻\mathcal{C}(H)caligraphic_C ( italic_H ) values correspond to sharp, anisotropic curvature, often reflecting representational reorganisation or memorisation. In contrast, low values denote flatter, more isotropic landscapes, typically aligned with abstraction and generalisation.

3.2 Intrinsic Dimensionality

We characterise abstraction in transformer representations through the lens of intrinsic dimensionality (ID), motivated by the manifold hypothesis [8]. Given hidden representations ZD𝑍superscript𝐷Z\in\mathbb{R}^{D}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, ID is defined as the minimal number of degrees of freedom required to locally parameterise the data distribution [17]. That is, although Z𝑍Zitalic_Z may lie in a high-dimensional space, it may concentrate around a lower-dimensional manifold Dsuperscript𝐷\mathcal{M}\subset\mathbb{R}^{D}caligraphic_M ⊂ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT of dimension dDmuch-less-than𝑑𝐷d\ll Ditalic_d ≪ italic_D. We adopt the TWO-NN estimator [17], a non-parametric, maximum likelihood estimator based on local geometry. Given a batch of activation vectors {xi}i=1Nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\{x_{i}\}_{i=1}^{N}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the intrinsic dimension is estimated as:

ID=(1Ni=1Nlog(r2(xi)r1(xi)))1,IDsuperscript1𝑁superscriptsubscript𝑖1𝑁subscript𝑟2subscript𝑥𝑖subscript𝑟1subscript𝑥𝑖1\text{ID}=\left(\frac{1}{N}\sum_{i=1}^{N}\log\left(\frac{r_{2}(x_{i})}{r_{1}(x% _{i})}\right)\right)^{-1},ID = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (2)

where r1(xi)subscript𝑟1subscript𝑥𝑖r_{1}(x_{i})italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and r2(xi)subscript𝑟2subscript𝑥𝑖r_{2}(x_{i})italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denote distances to the first and second nearest neighbours of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. This approach requires no tuning parameters and assumes only uniform local density. To provide a more holistic view, we average the ID over the model layers as ID¯(t)=1L=1LID(t),superscript¯𝐼𝐷𝑡1𝐿superscriptsubscript1𝐿𝐼superscriptsubscript𝐷𝑡\overline{ID}^{(t)}=\frac{1}{L}\sum_{\ell=1}^{L}ID_{\ell}^{(t)},over¯ start_ARG italic_I italic_D end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_I italic_D start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ,, and use ID¯(t)superscript¯𝐼𝐷𝑡\overline{ID}^{(t)}over¯ start_ARG italic_I italic_D end_ARG start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT in our analysis. By computing ID across training steps and network layers, we capture the dynamic evolution of representational structure. We hypothesise a characteristic trajectory aligned with other metrics: low initial ID during early training, rising ID during transition, and stabilisation or decrease as the model projects data onto semantically coherent structures—signalling abstraction.

3.3 Linguistic Category Alignment

We evaluate abstraction emergence through two complementary approaches: internal representation probing and output generation analysis. For each, we examine both semantic roles (e.g., AGENT, PATIENT) for event structure understanding, and part-of-speech (POS) categories for syntactic abstraction.

Internal Probing.

We apply diagnostic probes pc()superscriptsubscript𝑝𝑐p_{c}^{(\ell)}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT to hidden states at layer \ellroman_ℓ to measure category-specific confidence scores during training. These probes quantify how well internal representations encode linguistic structures at each training step t𝑡titalic_t:

Confc(,t)=1||xpc()(h(x)),superscriptsubscriptConf𝑐𝑡1subscript𝑥superscriptsubscript𝑝𝑐subscript𝑥\text{Conf}_{c}^{(\ell,t)}=\frac{1}{|\mathcal{B}|}\sum_{x\in\mathcal{B}}p_{c}^% {(\ell)}(h_{\ell}(x)),Conf start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ , italic_t ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_B end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ( italic_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_x ) ) , (3)

where h(x)subscript𝑥h_{\ell}(x)italic_h start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_x ) is the hidden representation, \mathcal{B}caligraphic_B is the evaluation batch size and pc()superscriptsubscript𝑝𝑐p_{c}^{(\ell)}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT is a linear classifier trained on trained model frozen checkpoints. Probes are used to capture the evolving alignment between internal features and abstract linguistic categories. To validate that observed linguistic alignment reflects learned abstraction rather randomness, we trained probes on randomly initialised models. These probes performed at or near chance, confirming that linguistic features are not encoded prior to training. This supports the view that abstraction emerges progressively and is localised in stages as training evolves. Full results are reported in Appendix B.9.

Output Generation Analysis.

We also assess whether generated tokens respect linguistic constraints. For each generated token y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, we compute category-specific accuracy:

Accc(t)=1|𝒟c|(xi,yi)𝒟c𝟙[𝕪^𝕚𝒻(𝕪𝕚)],superscriptsubscriptAcc𝑐𝑡1subscript𝒟𝑐subscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑐1delimited-[]subscript^𝕪𝕚𝒻subscript𝕪𝕚\text{Acc}_{c}^{(t)}=\frac{1}{|\mathcal{D}_{c}|}\sum_{(x_{i},y_{i})\in\mathcal% {D}_{c}}\mathbbold{1}\left[\hat{y}_{i}\in\mathcal{f}(y_{i})\right],Acc start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 [ over^ start_ARG blackboard_y end_ARG start_POSTSUBSCRIPT blackboard_i end_POSTSUBSCRIPT ∈ caligraphic_f ( blackboard_y start_POSTSUBSCRIPT blackboard_i end_POSTSUBSCRIPT ) ] , (4)

where 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT contains sequences with category c𝑐citalic_c, and 𝒻(yi)𝒻subscript𝑦𝑖\mathcal{f}(y_{i})caligraphic_f ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the set of valid tokens for the expected category at position i𝑖iitalic_i. This metric reveals whether abstract patterns learned internally are successfully deployed during generation. By jointly analysing internal representation alignment and output conformity alongside geometric metrics, we identify precisely when models transition from memorising token associations to acquiring structured abstractions. Divergence between internal and output measures reveals intermediate states where models have partially acquired abstract representations but cannot yet reliably deploy them in generation. The complete methodology for token categorisation, probe architecture, and training procedures is detailed in Appendix B.6.

3.4 Information Compression via Mutual Information

While we explored mutual information (MI) as a potential signal of abstraction, we found that MI estimates were highly volatile and did not consistently align with the phase transitions observed through other metrics. This behaviour likely stems from two issues: (i) MI estimation is inherently noisy in high-dimensional settings, and (ii) abstraction in transformers involves structural reorganisation rather than pure information compression. These patterns were consistent across both types of MI we measured: (i) I(X;Z)𝐼𝑋subscript𝑍I(X;Z_{\ell})italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ), the information retained about the input X𝑋Xitalic_X in the hidden states Z=ϕ(X)subscript𝑍subscriptitalic-ϕ𝑋Z_{\ell}=\phi_{\ell}(X)italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ( italic_X ); and (ii) I(Z;Z+1)𝐼subscript𝑍subscript𝑍1I(Z_{\ell};Z_{\ell+1})italic_I ( italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ), the information shared between adjacent layers. Due to this instability, MI lacked the resolution to serve as a reliable diagnostic. We include full experimental results, estimation procedures, and MI trajectories in Appendix C for completeness, but do not consider MI a core component of our results.

3.5 Synthetic Data Generator

To isolate representational dynamics from confounding factors in natural data, we employ ABSynth, our controllable synthetic data generation framework grounded in frame semantics [19, 5]. ABSynth controllably builds synthetic corpora by sampling from structured sentence frame representations with predefined semantic roles (e.g., AGENT, PATIENT, THEME). The framework supports precise manipulation of structural properties, including vocabulary size, token frequency, syntactic/semantic complexity, and contextual entropy.

The generation pipeline consists of three modular components: (1) a lexicon module that assigns words to categories under a Zipfian frequency distribution [55, 40] (α=1.05𝛼1.05\alpha=1.05italic_α = 1.05), augmented with variable-strength collocations; (2) a frame-based sentence constructor that assembles grammatically well-formed sentences across three levels of structural complexity; and (3) an entropy-aware token selector that modulates predictability by adjusting sampling probabilities for the corpus. For this study, we use ABSynth to generate ABSynth25K, a dataset of 25,000 sentences with complete frame-semantic annotations. Each example includes ground-truth semantic roles and syntactic categories (POS tags) derived directly from the underlying frame structure, enabling precise investigation of how linguistic abstractions emerge in neural representations. ABSynth25K follows an 80/10/10 training/validation/test split. Complete generation procedures and frame specifications are detailed in Appendix A.

3.6 Models Architectures and Training Setup

Transformer Architectures.

We train three decoder-only transformer models of increasing capacity. Small model (2-head, 128 FFN, 64 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT, 1-layer), medium model (3-head, 384 FFN, 96 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT, 2-layer) and large model (4-head, 512 FFN, 128 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT, 3-layer). Models share the same positional encoding scheme, tokenisation, and vocabulary. A fixed sequence length (16) and batch size (128) are used across runs to standardise training dynamics. We record dense checkpoints throughout training to record our metrics. Complete training details, downstream task formalisations and model hyperparameters are reported in Appendix B.

Ablation of Architectural Components.

To isolate the architectural factors driving abstraction, we ablate key transformer components by removing feed-forward (FFN) blocks and reducing the number of attention heads. These interventions are designed to test whether the capacity for abstraction depends on transformation depth or attention expressivity, and to determine which mechanisms are necessary for triggering representational phase transitions.

4 Results & Analysis

4.1 Coordinated Phase Shift in Training Dynamics

Across all model configurations, we observe a robust two-phase training dynamic: an initial regime of rising ID and elevated curvature, followed by a transition into flatter curvature and stabilised representational complexity (Fig. 2). This transition is marked by a consistent intersection between the Hessian curvature score (blue) and ID trajectories (red), which we interpret as a phase shift in learning dynamics.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Coordinated dynamics of Hessian Curvature Score (blue) and Average Intrinsic Dimension (red) across training steps for three model architectures. Each row shows a different architectural variant: standard models (top), models without feed-forward networks (middle), and models with a single attention head (bottom). The intersection points between curvature and ID trajectories mark critical phase transitions in representational learning, with timing and stability varying across architectures but preserving the fundamental pattern.

In larger models (right panel), this intersection occurs early (around step 5000) and sharply, with curvature rapidly collapsing while ID plateaus at higher values, indicating the emergence of high-capacity, structured representations. Medium-scale models (centre panel) follow qualitatively similar transitions but exhibit notable periodic spikes in curvature throughout training. These persistent oscillations suggest recurring reorganisation events where the model temporarily revisits higher-curvature regions of the loss landscape, reflecting optimisation instabilities with limited architectural capacity.

Smaller models (left panel) exhibit delayed transitions (after step 30000), noisier trajectories, and lower equilibrium ID values, demonstrating clear architectural bottlenecks in abstraction capacity. Despite variations in timing, amplitude, and stability, the fundamental pattern of curvature-ID intersection remains consistent, suggesting a universal geometric signature of abstraction emergence that scales with model capacity but preserves its essential character.

4.2 Differential Impact of Architectural Components

Our ablation experiments examine how specific architectural components influence abstraction dynamics, revealing subtle but informative effects. Fig. 2 presents curvature and ID trajectories across three architectural variants: standard models (top row), models without feed-forward networks (FFNs) (middle row), and models with a single attention head (bottom row).

All variants preserve the fundamental pattern of an initial curvature peak followed by a decline, concurrent with rising ID that eventually stabilises. This consistency demonstrates that abstraction emergence is surprisingly robust to architectural modifications and may be an inherent property of transformer-based optimisation.

Removing FFNs (middle row) increases curvature volatility, especially in small and medium models, with persistent oscillations but reduced spike amplitudes. This suggests FFNs contribute to optimisation stability and smooth representational development, functioning as distributed lookup structures [14]. Despite volatility, the phase transition timeline remains preserved, indicating FFNs enhance rather than enable abstraction emergence.

Single attention head models (bottom row) show scale-dependent effects. Medium and large models exhibit more frequent but lower-amplitude spikes, revealing a stability-smoothness trade-off. Small models show greater impact: reduced ID values and delayed phase transitions, indicating attention capacity constraints affect smaller architectures more severely. Nevertheless, the fundamental curvature-ID intersection pattern persists across all configurations.

These results demonstrate the architectural resilience of core learning dynamics in transformers. The preserved geometric signature across configurations establishes that abstraction is observable and a fundamental property of overall transformers’ gradient-based learning on sequential data, rather than a consequence of specific architectural features. This resilience also explains why transformers with varying configurations achieve comparable language task performance.

Refer to caption
Figure 3: Probe confidence scores over training steps for the large model. Each subplot corresponds to a different decoder layer, with curves representing average model confidence for the presence of specific linguistic tags.

4.3 Linguistic Alignment with Geometric Transitions

The intersection of curvature and ID trajectories coincides with key transitions in linguistic abstraction emergence. Fig. 3 shows how internal representations evolve across decoder layers, measured by probe confidence scores for linguistic categories (semantics). Layer 1 exhibits an interesting pattern of representational reorganisation, evident by a temporary dip and increased volatility in confidence scores, occurring precisely at the curvature-ID intersection shown in Fig. 2. This suggests global geometric shifts correspond to evolving category structure.

Layer 2 shows increased category confidence around the same intersection region, mirroring Layer 1’s earlier drop, indicating a hand-off dynamic between middle and upper layers. This temporal complementarity reflects upward abstraction shifts, whereby higher layers specialise in more abstract linguistic features. Unlike Layer 0’s stable dominance and Layer 1’s volatility, Layer 2 maintains a fragmented, dynamic profile throughout training, supporting the formation of higher-order abstractions or interpretability-relevant circuits.

Medium-sized models (Appendix B.9) show more harmonised behaviour across layers, demonstrating tighter coupling between abstraction capacity and model scale, with similar behaviour to layer 1 in the large model. These results support that evolving probe confidence reflects internal reorganisation aligned temporally with geometric transitions.

This internal development diverges from output predictions (Fig. 4), where semantic classification accuracy rapidly improves and stabilises early in training. This indicates that while models quickly generate syntactically appropriate tokens, internal representations continue restructuring long after. This dissociation implies a two-phase developmental process: in the first phase, output behaviour reflects coarse category distinctions likely driven by surface-level statistical regularities; in the second, deeper abstraction is gradually encoded into the model’s internal geometry, as indicated by evolving probe confidence. Full results across models (Appendix B.6) show consistent probe dynamics for both syntactic and semantic categories.

4.4 Limitations of Statistical Diagnostics

Despite theoretical appeal, MI analysis failed to yield reliable insights in our experimental setting. The observed MI dynamics exhibited high variance and showed minimal alignment with phase transitions identified through geometric and linguistic metrics. This instability aligns with concerns raised by Aljaafari et al. [3], which argues that such patterns in transformers may reflect stochastic fluctuations in early representation formation rather than meaningful abstraction signals. Given these limitations, we do not report MI in the main body of this paper, though complete MI trajectories are reported in Appendix C.

Refer to caption
Figure 4: SRL performance per label across models and training steps

5 Discussion and Conclusion

Our results suggest that transformers undergo structured representational reorganisation during training. Rather than emerging gradually, abstraction is evident through phase transitions, coordinated shifts across geometric, information-theoretic, and linguistic signals. These transitions mark a distinct boundary between memorisation and generalisation, where linguistic representations begin to stabilise.

We observe consistent alignment between curvature flattening, ID rise and stabilisation, and increased probe accuracy. This coordination suggests certain geometric signals may serve as markers of emerging abstraction. Low-rank curvature and stable ID appear to signal when models begin internalising structure beyond surface patterns. The phenomenon remains robust across model scales and architectural variants, suggesting that abstraction emergence follows predictable patterns.

While our experiments use synthetic corpora, TRACE is compatible with broader domains. Metrics such as curvature and ID are model-agnostic and can be applied to pre-trained transformers or fine-tuning regimes. Probing-based signals can be approximated using weak supervision or automated annotation tools. Extending TRACE to large-scale pretraining could reveal whether similar phase transitions emerge in noisier, real-world settings. Finally, integrating TRACE with mechanistic interpretability tools could help localise where and how abstraction-related circuits emerge.

6 Limitations

Despite its interpretability, our synthetic corpus does not fully capture the ambiguity and richness of NL. While probe-based diagnostics offer valuable insights, they provide a static view of representation content and may not reflect the dynamic computational mechanisms that transformers deploy at inference time. Finally, while TRACE establishes strong correlations across geometric, informational, and linguistic signals, it does not establish causal relationships or quantify the relative contribution of each factor.

7 Impact Statement

This work reveals that abstract reasoning in language models emerges through predictable phase transitions rather than gradual accumulation. Identifying these critical transitions could enable more efficient training strategies, yielding more interpretable models with reduced computational costs. While this understanding may help detect harmful behaviours, it also presents the usual interpretability trade-off of potentially facilitating model manipulation. Our frame-semantic data generation framework provides a reusable tool for studying abstraction and learning dynamics in language models, with fine-grained control over linguistic properties and transparent evaluation capabilities.

References

  • Aghajanyan et al. [2021] Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, Online, 2021. Association for Computational Linguistics.
  • Ahn et al. [2024] Kwangjun Ahn, Ali Jadbabaie, and Suvrit Sra. How to escape sharp minima with random perturbations. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024.
  • Aljaafari et al. [2025] Nura Aljaafari, Danilo S Carvalho, and André Freitas. Carma: Enhanced compositionality in llms via advanced regularisation and mutual information alignment. arXiv preprint arXiv:2502.11066, 2025.
  • Ansuini et al. [2019] Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • Baker et al. [1998] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The Berkeley FrameNet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90, Montreal, Quebec, Canada, August 1998. Association for Computational Linguistics. doi: 10.3115/980845.980860. URL https://aclanthology.org/P98-1013/.
  • Belghazi et al. [2018] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 531–540. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/belghazi18a.html.
  • Böttcher and Wheeler [2024] Lucas Böttcher and Gregory Wheeler. Visualizing high-dimensional loss landscapes with hessian directions. Journal of Statistical Mechanics: Theory and Experiment, 2024(2):023401, 2024.
  • Cayton et al. [2005] Lawrence Cayton et al. Algorithms for manifold learning. eScholarship, University of California, 2005.
  • Cheng et al. [2023] Emily Cheng, Corentin Kervadec, and Marco Baroni. Bridging information-theoretic and geometric compression in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12397–12420, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.762. URL https://aclanthology.org/2023.emnlp-main.762/.
  • Cheng et al. [2025] Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Lei Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=0fD3iIBhlV.
  • Clauw et al. [2024] Kenzo Clauw, Daniele Marinazzo, and Sebastiano Stramaglia. Information-theoretic progress measures reveal grokking is an emergent phase transition. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum?id=Q4NH6hEPIX.
  • Csordás et al. [2021] Róbert Csordás, Kazuki Irie, and Jürgen Schmidhuber. The devil is in the detail: Simple tricks improve systematic generalization of transformers. In Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, November 2021.
  • Cunningham et al. [2023] Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
  • Dai et al. [2021] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696, 2021.
  • Denti et al. [2022] Francesco Denti, Diego Doimo, Alessandro Laio, and Antonietta Mira. The generalized ratios intrinsic dimension estimator. Scientific Reports, 12(1):20005, 2022.
  • Elmoznino et al. [2024] Eric Elmoznino, Thomas Jiralerspong, Yoshua Bengio, and Guillaume Lajoie. A complexity-based theory of compositionality. arXiv preprint arXiv:2410.14817, 2024.
  • Facco et al. [2017] Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports, 7(1):12140, 2017.
  • Fan et al. [2024] Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang. Not all layers of llms are necessary during inference. CoRR, abs/2403.02181, 2024. URL https://doi.org/10.48550/arXiv.2403.02181.
  • Fillmore [1982] Charles J. Fillmore. Frame semantics. In Linguistics in the Morning Calm, pages 111–137. Hanshin Publishing Co., Seoul, 1982.
  • Grosse [2024] Roger Grosse. Studying large language model generalization with influence functions. In Proceedings of the 38th Conference on Neural Information Processing Systems, NeurIPS ’24, 2024. Workshop on Scalable Continual Learning for Lifelong Foundation Models.
  • Hao et al. [2019] Yaru Hao, Li Dong, Furu Wei, and Ke Xu. Visualizing and understanding the effectiveness of bert. arXiv preprint arXiv:1908.05620, 2019.
  • Hoffmann et al. [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems, 35:30016–30030, 2022.
  • Hoogland et al. [2024] Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. The developmental landscape of in-context learning. arXiv preprint arXiv:2402.02364, 2024.
  • Hupkes et al. [2020] Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757–795, 2020.
  • Kantamneni et al. [2025] Subhash Kantamneni, Joshua Engels, Senthooran Rajamanoharan, Max Tegmark, and Neel Nanda. Are sparse autoencoders useful? a case study in sparse probing. arXiv preprint arXiv:2502.16681, 2025.
  • Keskar et al. [2016] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  • Khadivi et al. [2018] Pejman Khadivi, Ravi Tandon, and Naren Ramakrishnan. Flow of information in feed-forward denoising neural networks. In 2018 IEEE 17th International Conference on Cognitive Informatics and Cognitive Computing (ICCI*CC), pages 166–173, 2018. doi: 10.1109/ICCI-CC.2018.8482098.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  • Lake and Baroni [2018] Brenden Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pages 2873–2882. PMLR, 2018.
  • Lanczos [1950] Cornelius Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of research of the National Bureau of Standards, 45(4):255–282, 1950.
  • Lee et al. [2024] Jin Hwa Lee, Thomas Jiralerspong, Lei Yu, Yoshua Bengio, and Emily Cheng. Geometric signatures of compositionality across a language model’s lifetime. arXiv preprint arXiv:2410.01444, 2024.
  • Lepori et al. [2023] Michael Lepori, Thomas Serre, and Ellie Pavlick. Break it down: Evidence for structural compositionality in neural networks. Advances in Neural Information Processing Systems, 36:42623–42660, 2023.
  • Li and McClelland [2023] Yuxuan Li and James McClelland. Systematic generalization and emergent structures in transformers trained on structured tasks, 2023. URL https://openreview.net/forum?id=pXDmbfVL_SB.
  • Little et al. [2017] Anna V Little, Mauro Maggioni, and Lorenzo Rosasco. Multiscale geometric methods for data sets i: Multiscale svd, noise and curvature. Applied and Computational Harmonic Analysis, 43(3):504–567, 2017.
  • Nakkiran et al. [2021] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  • Nanda et al. [2023] Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW.
  • Olsson et al. [2022] Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  • Pan et al. [2024] Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 57018–57049. Curran Associates, Inc., 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/687163285b8affc8ee933bdca8e75747-Paper-Conference.pdf.
  • Park et al. [2024] Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models. In ICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum?id=KXuYjuBzKo.
  • Piantadosi [2014] Steven T Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic bulletin & review, 21:1112–1130, 2014.
  • Power et al. [2022] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  • Razzhigaev et al. [2024] Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 868–874, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.findings-eacl.58/.
  • Roy and Vetterli [2007] Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pages 606–610. IEEE, 2007.
  • Sankar et al. [2021] Adepu Ravi Sankar, Yash Khasbage, Rahul Vigneswaran, and Vineeth N Balasubramanian. A deeper look at the hessian eigenspectrum of deep neural networks and its applications to regularization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9481–9488, 2021.
  • Saxton et al. [2019] David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1gR5iR5FX.
  • Singh et al. [2021] Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network hessian maps. Advances in Neural Information Processing Systems, 34:23914–23927, 2021.
  • Skean et al. [2025] Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013, 2025.
  • Smith et al. [2025] Lewis Smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Negative results for saes on downstream tasks and deprioritising sae research (gdm mech interp team progress update #2), March 2025. URL https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks. LessWrong post.
  • Valeriani et al. [2023] Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36:51234–51252, 2023.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  • Wang et al. [2024] George Wang, Matthew Farrugia-Roberts, Jesse Hoogland, Liam Carroll, Susan Wei, and Daniel Murfet. Loss landscape geometry reveals stagewise development of transformers. In High-dimensional Learning Dynamicsß: The Emergence of Structure and Reasoning, 2024. URL https://openreview.net/forum?id=2JabyZjM5H.
  • Zhao et al. [2022] Yang Zhao, Hao Zhang, and Xiuyuan Hu. Penalizing gradient norm for efficiently improving generalization in deep learning. In International conference on machine learning, pages 26982–26992. PMLR, 2022.
  • Zhong and Andreas [2024] Ziqian Zhong and Jacob Andreas. Algorithmic capabilities of random transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=plH8gW7tPQ.
  • Zhou et al. [2024] Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Joshua M. Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AssIuHnmHX.
  • Zipf [1949] George Kingsley Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.

Appendix A Frame-Semantic Data Generation Framework

This appendix describes our controllable synthetic data generation framework ABSynth. Unlike template-based synthetic datasets [29] or task-specific benchmarks [45], ABSynth is grounded in formal frame semantics [19, 5] and is able to generate English-like corpora by sampling from abstract event frames with predefined semantic roles, enabling mechanistic study of how linguistic abstractions emerge during transformer training.

Refer to caption
Figure 5: The frame-semantic data generation pipeline: (1) Frame selection with semantic roles, (2) Lexical realisation with Zipfian scaling, (3) Syntactic construction following grammatical constraints, (4) Entropy calibration for controlled predictability. Each generated sentence preserves ground-truth annotations from the underlying frame structure.

ABSynth operationalises frame semantics by representing events as structured predicate-argument frames. Each frame specifies: (i) frame elements; (ii) core semantic roles (e.g., AGENT, PATIENT, INSTRUMENT); (iii) corresponding syntactic categories (e.g., NOUN, VERB), and (iv) complexity constraints based on sequence length and entropy calibration.

This grounding ensures that generated sentences exhibit genuine compositional structure rather than arbitrary token sequences. The frame-based approach enables direct tracking of how neural models learn to represent abstract semantic categories that underlie surface forms.

As illustrated in Figure 5, ABSynth generates datasets through a multi-stage pipeline that includes: (i) semantic frame selection with role specification, (ii) lexical realization following Zipfian frequency distributions and semantic clustering, (iii) syntactic construction respecting grammatical constraints and frame-to-syntax mappings, and (iv) entropy calibration to control contextual predictability. The resulting corpora exhibit theoretically-grounded compositional structure while maintaining naturalistic statistical properties.

We detail the generation process below using our ABSynth25K instantiation as an example, emphasising that each component is modular and configurable for different research objectives. This flexibility enables systematic exploration of how specific linguistic properties influence abstraction emergence in neural models.

A.1 Lexicon Construction: Scaling and Semantic Clustering

The lexicon is built by assigning tokens to POS and semantic role categories, then applying Zipfian scaling and semantic clustering. Each token is associated with a tuple [POS, SRL, Zipf_rank, ClusterID], ensuring interpretability and alignment across analysis stages.

To mimic natural lexical statistics, token frequencies follow a Zipfian distribution [55, 40]:

P(toki)=1/iα+εij=1V(1/jα+εj),𝑃𝑡𝑜subscript𝑘𝑖1superscript𝑖𝛼subscript𝜀𝑖superscriptsubscript𝑗1𝑉1superscript𝑗𝛼subscript𝜀𝑗P(tok_{i})=\frac{1/i^{\alpha}+\varepsilon_{i}}{\sum_{j=1}^{V}(1/j^{\alpha}+% \varepsilon_{j})},italic_P ( italic_t italic_o italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG 1 / italic_i start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ( 1 / italic_j start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG , (5)

where α=1.05𝛼1.05\alpha=1.05italic_α = 1.05, εi𝒩(0,0.05)similar-tosubscript𝜀𝑖𝒩00.05\varepsilon_{i}\sim\mathcal{N}(0,0.05)italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 0.05 ), and V=9000𝑉9000V=9000italic_V = 9000 is the vocabulary size. This maintains a realistic long-tail frequency spectrum.

Semantic clustering introduces collocational structure by forming token groups with variable intra- and inter-cluster association strengths. For tokens toki𝑡𝑜subscript𝑘𝑖tok_{i}italic_t italic_o italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and tokj𝑡𝑜subscript𝑘𝑗tok_{j}italic_t italic_o italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT belonging to clusters cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

S(toki,tokj)=Sbase(ci,cj)+𝒰(0,Srange(ci,cj)),𝑆𝑡𝑜subscript𝑘𝑖𝑡𝑜subscript𝑘𝑗subscript𝑆𝑏𝑎𝑠𝑒subscript𝑐𝑖subscript𝑐𝑗𝒰0subscript𝑆𝑟𝑎𝑛𝑔𝑒subscript𝑐𝑖subscript𝑐𝑗S(tok_{i},tok_{j})=S_{base}(c_{i},c_{j})+\mathcal{U}(0,S_{range}(c_{i},c_{j})),italic_S ( italic_t italic_o italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t italic_o italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_S start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + caligraphic_U ( 0 , italic_S start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) , (6)

where Sbasesubscript𝑆𝑏𝑎𝑠𝑒S_{base}italic_S start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT and Srangesubscript𝑆𝑟𝑎𝑛𝑔𝑒S_{range}italic_S start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT are tuned to create structured but noisy associations, emulating semantic co-occurrence patterns. Intra-cluster associations are drawn from [0.4,0.7]0.40.7[0.4,0.7][ 0.4 , 0.7 ], while cross-cluster links are weaker ([0.05,0.2])0.050.2([0.05,0.2])( [ 0.05 , 0.2 ] ).

Each word receives a naturalised name (e.g., result1, noun5), allowing transparent reverse-mapping for analysis.

A.2 Frame-Based Syntactic Realisation and Entropy Control

Frame-to-syntax mappings define how semantic frames are realised as surface forms while controlling contextual predictability through entropy calibration. All realisations follow valid English grammatical constructions, respecting part-of-speech ordering, agreement patterns, and canonical phrase structure. This grounding enables the study of syntactic and semantic abstraction within structurally coherent input sequences.

Each frame component is annotated with expected entropy based on its role in the frame:

  • Low entropy (0.5–1.5 bits): Grammatically determined positions (e.g., determiners required by nouns)

  • Medium entropy (1.5–3.0 bits): Semantically constrained positions with multiple valid fillers (e.g., theme roles that accept various object types)

  • High entropy (3.0–4.5 bits): Optional frame elements with high variability (e.g., adverbial modifier)

Frames are instantiated according to a target complexity distribution (55% simple, 35% medium, 10% complex), which guides the global entropy profile of the corpus. Example frame realisations include:

Simple TRANSFER:    [AGENT=NOUN] [ACTION=VERB] [THEME=NOUN]
                   "noun2 verb3 noun5"

Medium CREATION:    [AGENT=NOUN] [ACTION=VERB] [THEME=NOUN]
                   [PURPOSE=PREP+NOUN]
                   "noun1 verb2 noun4 prep3 noun7"

Complex MOTION:     [AGENT=NOUN] [REL] [ACTION=VERB] [SOURCE=NOUN]
                   [ACTION=VERB] [GOAL=NOUN]
                   "noun3 rel1 verb5 noun6 verb7 noun9"

A.3 Dynamic Entropy Adjustment Algorithm

To enforce statistical balance across complexity levels, the sentence generator incorporates an entropy-aware sampling mechanism. During generation, the system maintains a global entropy profile, defined as the frequency distribution of sentence positions assigned to low, medium, and high entropy tiers. This profile is updated in real time and compared to the desired target distribution.

If the observed distribution diverges from the target (e.g., too many low-entropy tokens have been sampled), the system increases the sampling weight for frames or token categories that contribute to underrepresented tiers. This feedback mechanism modulates the difficulty of the dataset without sacrificing grammaticality.

Algorithm 1 Entropy-Calibrated Sentence Generation
1:  for each sentence to be generated do
2:     Select template T𝑇Titalic_T with entropy tier annotations
3:     for each token slot i𝑖iitalic_i in T𝑇Titalic_T do
4:        Compute current global entropy profile Ecurrentsubscript𝐸𝑐𝑢𝑟𝑟𝑒𝑛𝑡E_{current}italic_E start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT
5:        Compare to target profile Etargetsubscript𝐸𝑡𝑎𝑟𝑔𝑒𝑡E_{target}italic_E start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT
6:        Derive correction factor α𝛼\alphaitalic_α to adjust token sampling
7:        Sample token toki𝑡𝑜subscript𝑘𝑖tok_{i}italic_t italic_o italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from adjusted distribution
8:     end for
9:     Add sentence to corpus
10:     Update Ecurrentsubscript𝐸𝑐𝑢𝑟𝑟𝑒𝑛𝑡E_{current}italic_E start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r italic_e italic_n italic_t end_POSTSUBSCRIPT with entropy tags from new sentence
11:  end for

The final vocabulary consists of 9,000 tokens distributed across categories as shown in Table 1.

A.4 Output Format and Probing Supervision

Each sentence is stored with a naturalised token sequence and associated structured annotations, including:

  • POS labels (e.g., NOUN, ADJ, VERB)

  • Semantic roles (e.g., AGENT, PATIENT, RESULT)

  • Entropy tier

  • Other Contextual complexity metadata

These annotations enable direct supervision for probing tasks. During model training, hidden states are extracted layer-wise and evaluated using linear probes trained on these annotations. This setup facilitates fine-grained analysis of how and when compositional representations emerge, and how they correlate with curvature, intrinsic dimensionality, and mutual information.

Token Category Vocabulary Size
Noun 2,780
Transitive Verb 694
Intransitive Verb 694
Communication Verb 347
Motion Verb 347
Adjective 1,388
Adverb 555
Location 694
Temporal 694
Preposition 416
Determiner 111
Conjunction 277
Result 277
Total 9,000
Table 1: Vocabulary distribution across syntactic categories

Appendix B Technical Implementation Details

B.1 Main Model Architecture

We implemented decoder-only transformer architectures based on the original design of Vaswani et al. [50]. Each model consists of L𝐿Litalic_L layers, each with hidden size dmodelsubscript𝑑modeld_{\text{model}}italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT, H𝐻Hitalic_H attention heads (where dhead=dmodel/Hsubscript𝑑headsubscript𝑑model𝐻d_{\text{head}}=d_{\text{model}}/Hitalic_d start_POSTSUBSCRIPT head end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT / italic_H), and a feedforward dimension dffnsubscript𝑑ffnd_{\text{ffn}}italic_d start_POSTSUBSCRIPT ffn end_POSTSUBSCRIPT. We focus on decoder-only models given their increasing prevalence in production LLMs and recent arguments that causal architectures provide a cleaner demonstration for emergence [47].

Our architectural choices are guided by recent Transformer scaling laws, notably those articulated by Hoffmann et al. [22]. Namely, we evaluate three configurations that adhere to the Chinchilla scaling law (see below), with approximate parameter counts and architectural specifications as presented in Table 2. All models are trained with a maximum sequence length of T=16𝑇16T=16italic_T = 16 tokens and a dropout rate of 0.10.10.10.1.

We use a simple whitespace-based tokeniser, where each token corresponds to a space-separated word or symbol. This choice allows us to maintain interpretability and simplify downstream representational analyses, while aligning with our controlled, low-scale experimental setup.

Table 2: Transformer model configurations used in this study
Model Layers (L𝐿Litalic_L) Hidden size (dmodelsubscript𝑑modeld_{\text{model}}italic_d start_POSTSUBSCRIPT model end_POSTSUBSCRIPT) Heads (H𝐻Hitalic_H) FFN size (dffnsubscript𝑑ffnd_{\text{ffn}}italic_d start_POSTSUBSCRIPT ffn end_POSTSUBSCRIPT)
Small 1 64 2 128
Medium 2 96 3 384
Large 3 128 4 512

Scaling Laws and Training Budget.

We follow the Chinchilla scaling principles from Hoffmann et al. [22], which demonstrate that model size and training tokens should scale together. Specifically, Chinchilla findings show that optimal training requires approximately 20 tokens per parameter. Given our dataset size of approximately 360K tokens per epoch, we designed our training regimen to respect these scaling principles. Based on our model sizes (110K, 339K, and 749K parameters), we estimated minimum training requirements of 6, 19, and 42 epochs, respectively, to ensure a sufficient token-to-parameter ratio, as implied by Chinchilla’s 20:1 guideline. However, in practice, we observed that phase transitions occurred at different points across scales, not precisely aligned with these theoretical estimates. As such, we extended training durations beyond these minimums to allow sufficient time for representational transitions to emerge, as discussed in Section B.3, and to test whether the Grokking phoneme [41] existence.

B.2 Main Models Training Objectives and Formalisation of Next Token Prediction

Next-Token Prediction Task (NTP):

We formalise the NTP task, which we use to train our decoder-only language models. NTP involves predicting the next token xn+1subscript𝑥𝑛1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT given a preceding sequence X={x1,x2,,xn}𝑋subscript𝑥1subscript𝑥2subscript𝑥𝑛X=\{x_{1},x_{2},\dots,x_{n}\}italic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }. In our setting, each xi𝒱subscript𝑥𝑖𝒱x_{i}\in\mathcal{V}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V belongs to a controlled lexicon from ABSynth, reflecting structured linguistic categories (e.g., noun3, verb2, etc.) and conforming to correct English grammar.

Formally, the objective is:

xn+1=argmaxx𝒱P(xx1,x2,,xn)subscript𝑥𝑛1subscript𝑥𝒱𝑃conditional𝑥subscript𝑥1subscript𝑥2subscript𝑥𝑛x_{n+1}=\arg\max_{x\in\mathcal{V}}P(x\mid x_{1},x_{2},\dots,x_{n})italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_x ∈ caligraphic_V end_POSTSUBSCRIPT italic_P ( italic_x ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (7)

Here, 𝒱𝒱\mathcal{V}caligraphic_V denotes the model’s token vocabulary, which includes synthetic labels representing part-of-speech categories and their variants (e.g., noun1, adj2). The model autoregressively generates one token at a time, conditioned solely on the preceding tokens.

Example Prompt: Given the sequence: "noun1 verb2 adj1", the task requires predicting the next token, such as: "noun3", depending on the dataset’s underlying syntactic or semantic generation rules.

We focus on NTP because it reflects the core objective used in many widely adopted pretrained language models (e.g., GPT-style models), making it a natural and effective setting for examining representational and interpretive behaviours in a controlled environment.

B.3 Main Models Training Configuration

All models are trained using next-token prediction with the Adam optimiser [28] with learning rate 1e31𝑒31e-31 italic_e - 3, and 1000 warm-up steps. Training was conducted for significantly extended epochs beyond the computed minimum requirements of the discussed scaling laws (Section B.1), consistent with the observations of Nanda et al. [36] on the "grokking" phenomenon, where generalisation emerges abruptly after an initial memorisation phase. We consider this an extension of the training strategies outlined in Csordás et al. [12], which emphasise the importance of avoiding early stopping to fully exploit the learning capacity of neural models.

Specifically, we trained:

  • Small model: 500 epochs (similar-to\sim82× the minimum of 6 epochs)

  • Medium model: 500 epochs (similar-to\sim26× the minimum of 19 epochs)

  • Large model: 500 epochs (similar-to\sim12× the minimum of 42 epochs)

We record dense checkpoints throughout training (every 500 training steps), extracting hidden states and gradients from all layers to compute our diagnosis metrics.

We applied minimal regularisation and continuously monitored loss, accuracy, gradient dynamics, representational similarity, and mutual information (MI) throughout training to detect potential phase transitions. This methodological rigour enables us to assess the relationship between model structure and task complexity under controlled and interpretable conditions, while ensuring that all models are trained in accordance with modern scaling principles. To ensure correctness and reproducibility, all experiments were repeated at least 5 times with different random seeds, and we reported the averaged results.

B.4 Model performance

B.5 Model Performance

All models demonstrate strong performance on the downstream generation task after their respective phase transitions. Table 3 summarises the evaluation metrics across model scales.

Table 3: Performance metrics for small, medium, and large models after phase transition.
Model Exact Match Token Accuracy BLEU Score Perplexity
Small 0.84 0.98 0.20 1.22
Medium 0.98 0.99 0.21 1.08
Large 0.98 0.99 0.21 1.07

B.6 Probing Framework and Label Construction

To investigate the interpretability and internal structure of our models, we implemented a probing framework that trains lightweight classifiers on the frozen hidden representations extracted from each layer of our trained models. Specifically, for every layer in the tested models, we trained and evaluated both a part-of-speech (POS) probe and a semantic role labelling (SRL) probe.

Each probe is implemented as a feedforward neural network comprising three linear layers interleaved with ReLU activations and dropout. The network is trained using binary cross-entropy loss, with sigmoid activations at the output layer to support multi-label classification. Given an input representation 𝐡d𝐡superscript𝑑\mathbf{h}\in\mathbb{R}^{d}bold_h ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the probe computes:

𝐲^=σ(W3ReLU(W2ReLU(W1𝐡))),^𝐲𝜎subscript𝑊3ReLUsubscript𝑊2ReLUsubscript𝑊1𝐡\hat{\mathbf{y}}=\sigma\left(W_{3}\cdot\text{ReLU}\left(W_{2}\cdot\text{ReLU}% \left(W_{1}\cdot\mathbf{h}\right)\right)\right),over^ start_ARG bold_y end_ARG = italic_σ ( italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ ReLU ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ReLU ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ bold_h ) ) ) ,

where W1,W2,W3subscript𝑊1subscript𝑊2subscript𝑊3W_{1},W_{2},W_{3}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are trainable weight matrices and σ𝜎\sigmaitalic_σ denotes the element-wise sigmoid function.

Table 4: POS Probe Evaluation – Large Model, Layer 0
Label Count Accuracy Precision Recall F1 Score
NOUN 279984 1.000 1.000 1.000 1.000
TRANSITIVE_VERB 151232 0.836 0.577 0.836 0.683
INTRANSITIVE_VERB 59584 0.073 1.000 0.073 0.136
COMMUNICATION_VERB 54352 0.238 1.000 0.238 0.384
MOTION_VERB 68368 0.344 0.659 0.344 0.452
CHANGE_VERB 14240 0.562 0.510 0.562 0.535
ADJ 59072 0.239 0.535 0.239 0.331
LOCATION 101984 0.315 0.858 0.315 0.461
TEMP 37664 0.184 0.535 0.184 0.274
PREP 115264 0.387 0.823 0.387 0.526
RESULT 68592 0.292 0.746 0.292 0.420
CONJ 51248 0.350 0.874 0.350 0.500

Probes are trained independently for each layer in the model, allowing us to analyse the emergence and distribution of linguistic and functional features across depth. Representations from each layer are frozen and taken from an already trained model following our training setting explained in Section B, and the underlying model weights are not updated during probing.

Due to the synthetic nature of our dataset, both POS and semantic labels are deterministically derived from token names. For example, a token such as noun3 is assigned the POS label NOUN and may additionally be annotated with semantic roles such as AGENT or ENTITY, depending on the symbolic structure of the task. This approach eliminates annotation ambiguity and ensures consistent supervision across examples.

Table 5: POS Probe Evaluation – Large Model, Layer 1
Label Count Accuracy Precision Recall F1 Score
NOUN 279984 1.000 1.000 1.000 1.000
TRANSITIVE_VERB 151232 0.779 0.655 0.779 0.711
INTRANSITIVE_VERB 59584 0.291 1.000 0.291 0.451
COMMUNICATION_VERB 54352 0.476 1.000 0.476 0.645
MOTION_VERB 68368 0.718 0.678 0.718 0.697
CHANGE_VERB 14240 0.875 0.510 0.875 0.644
ADJ 59072 0.603 0.558 0.603 0.580
LOCATION 101984 0.641 0.843 0.641 0.729
TEMP 37664 0.344 0.535 0.344 0.419
PREP 115264 0.856 0.808 0.856 0.831
RESULT 68592 0.585 0.746 0.585 0.655
CONJ 51248 0.545 1.000 0.545 0.705

Given the supervised design of our dataset, we employ linear probes due to their demonstrated effectiveness in similar contexts. While alternative methods such as Sparse Autoencoders (SAEs) have shown promise in unsupervised settings Cunningham et al. [13], linear probes remain a robust and interpretable choice for supervised feature probing Kantamneni et al. [25], Smith et al. [48].

B.7 Probe Models Training Configuration

All probes were trained for 30 epochs using the Adam optimiser [28] with a learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, a hidden dimension of 256, and a dropout rate of 0.5. Input dimensionality matched the model’s hidden state size.

B.8 Probing performance

We report the average POS probe models’ performance on Tables 4, 5, and 6, for the small, medium, and large models, receptively. We also report the average semantic probe models’ performance on Tables 7, 8, and 9, for the small, medium, and large models, receptively.

Refer to caption
Refer to caption
Figure 6: Small model — Semantic (left) and POS (right) probe confidence scores at Layer 0.
Refer to caption
Refer to caption
Figure 7: Medium model — Semantic (left) and POS (right) probe confidence scores across layers.

B.9 Probing Extended Results

We present extended probing results across all model scales (Small, Medium, and Large) for both part-of-speech (POS) and semantic role categories. Figures 67, and 8 show model confidence scores for each layer across training steps.

Refer to caption
Figure 8: Large model — POS probe confidence scores across layers.

To assess whether abstraction emerges during training rather than being encoded by architecture alone, we also trained probes on frozen hidden states from randomly initialised models. These models exhibited near-zero performance across all linguistic categories, confirming the absence of structured representations at initialisation.

Figures 910, and 11 show the performance of semantic and POS probes on randomly initialised models.

Refer to caption
Refer to caption
Figure 9: Randomly initialised Small model — Semantic (left) and POS (right) probe confidence scores.
Refer to caption
Refer to caption
Figure 10: Randomly initialised Medium model — Semantic (left) and POS (right) probe confidence scores.
Refer to caption
Refer to caption
Figure 11: Randomly initialised Large model — Semantic (left) and POS (right) probe confidence scores.
Table 6: POS Probe Evaluation – Large Model, Layer 2
Label Count Accuracy Precision Recall F1 Score
NOUN 279984 1.000 1.000 1.000 1.000
TRANSITIVE_VERB 151232 0.772 0.657 0.772 0.710
INTRANSITIVE_VERB 59584 0.309 0.895 0.309 0.459
COMMUNICATION_VERB 54352 0.476 1.000 0.476 0.645
MOTION_VERB 68368 0.757 0.667 0.757 0.709
CHANGE_VERB 14240 1.000 0.510 1.000 0.676
ADJ 59072 0.603 0.558 0.603 0.580
LOCATION 101984 0.654 0.829 0.654 0.731
TEMP 37664 0.367 0.535 0.367 0.436
PREP 115264 0.856 0.808 0.856 0.831
RESULT 68592 0.572 0.754 0.572 0.650
CONJ 51248 0.545 1.000 0.545 0.705
Table 7: Semantic Probe Evaluation – Large Model, Layer 0
Label Count Accuracy Precision Recall F1 Score
AGENT 268576 1.000 0.959 1.000 0.979
PATIENT 151232 0.852 0.576 0.852 0.687
ACTION 279984 1.000 1.000 1.000 1.000
LOCATION 101984 0.315 0.858 0.315 0.461
RELATION 115264 0.387 0.823 0.387 0.526
CONNECTOR 51248 0.321 0.950 0.321 0.480
RESULT 68592 0.305 0.731 0.305 0.431
OTHER 153248 0.874 0.642 0.874 0.740
Table 8: Semantic Probe Evaluation – Large Model, Layer 1
Label Count Accuracy Precision Recall F1 Score
AGENT 268576 1.000 0.959 1.000 0.979
PATIENT 151232 0.785 0.652 0.785 0.713
ACTION 279984 1.000 1.000 1.000 1.000
LOCATION 101984 0.641 0.843 0.641 0.729
RELATION 115264 0.856 0.808 0.856 0.831
CONNECTOR 51248 0.574 0.944 0.574 0.714
RESULT 68592 0.546 0.771 0.546 0.639
OTHER 153248 0.893 0.818 0.893 0.854
Table 9: Semantic Probe Evaluation – Large Model, Layer 2
Label Count Accuracy Precision Recall F1 Score
AGENT 268576 1.000 0.959 1.000 0.979
PATIENT 151232 0.762 0.659 0.762 0.707
ACTION 279984 1.000 1.000 1.000 1.000
LOCATION 101984 0.641 0.843 0.641 0.729
RELATION 115264 0.856 0.808 0.856 0.831
CONNECTOR 51248 0.545 1.000 0.545 0.705
RESULT 68592 0.390 0.969 0.390 0.556
OTHER 153248 0.893 0.818 0.893 0.854

B.10 Computational Resources and Software Environment

All the experiments were on an NVIDIA RTX A6000. We used PyTorch (v2.6.0), scikit-learn (v1.6.1), scipy (v1.12.0), seaborn (v0.13.2), Python (v3.9.21), matplotlib (v3.9.4), numpy (v1.26.4), and matplotlib (v3.9.4). Training the small models required around 21 minutes without tracking, and 116 with tracking of all metrics. For medium models, it takes around 28 minutes to train and 130 minutes to train with full tracking. For large models, it takes around 34 minutes to train and 139 minutes to train with full tracking. Training probes require approximately 9 minutes, 19 minutes, and 28 minutes for small, medium and large models, respectively.

Appendix C Mutual Information Estimation with MINE

To quantify the flow and compression of information within transformer models, we estimate mutual information (MI) between input embeddings and internal layer representations. While several methods exist for MI estimation (e.g., k-nearest neighbours, contrastive approaches), we adopt Mutual Information Neural Estimation (MINE) [6] due to its scalability and effectiveness in high-dimensional settings.

In theory, a systematic decline in I(X;Z)𝐼𝑋subscript𝑍I(X;Z_{\ell})italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) across layers and training steps should signal abstraction: the model progressively discards surface-level details while retaining task-relevant structure. This compression-based view of abstraction has been explored in other architectures [27, 16], and we examined whether similar dynamics emerge in transformer models during training.

Specifically, we tracked two MI quantities: (i) I(X;Z)𝐼𝑋subscript𝑍I(X;Z_{\ell})italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) — the mutual information between the input embeddings and the hidden states at layer \ellroman_ℓ, and (ii) I(Z;Z+1)𝐼subscript𝑍subscript𝑍1I(Z_{\ell};Z_{\ell+1})italic_I ( italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ) — the mutual information between consecutive hidden layers.

Despite the theoretical appeal, our empirical findings (see Figures 12) showed that MI was highly variable across training steps and did not consistently align with the phase transitions identified via curvature or intrinsic dimensionality. These results suggest that MI, while informative in principle, may lack the temporal resolution and stability needed to serve as a primary diagnostic in TRACE.

We report full implementation details, training settings, and estimator architecture below.

C.1 MINE Objective and Architecture

MINE approximates the Donsker-Varadhan lower bound on mutual information using a neural critic function Tθ:𝒳×𝒵:subscript𝑇𝜃𝒳𝒵T_{\theta}:\mathcal{X}\times\mathcal{Z}\rightarrow\mathbb{R}italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_X × caligraphic_Z → blackboard_R, parameterised by a multilayer perceptron (MLP). Given joint samples (x,z)PXZsimilar-to𝑥𝑧subscript𝑃𝑋𝑍(x,z)\sim P_{XZ}( italic_x , italic_z ) ∼ italic_P start_POSTSUBSCRIPT italic_X italic_Z end_POSTSUBSCRIPT and marginal samples formed by pairing xPXsimilar-to𝑥subscript𝑃𝑋x\sim P_{X}italic_x ∼ italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT with independently sampled zPZsimilar-tosuperscript𝑧subscript𝑃𝑍z^{\prime}\sim P_{Z}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT, the MI is estimated via:

I^θ(X;Z)=𝔼PXZ[Tθ(x,z)]log𝔼PXPZ[eTθ(x,z)]subscript^𝐼𝜃𝑋𝑍subscript𝔼subscript𝑃𝑋𝑍delimited-[]subscript𝑇𝜃𝑥𝑧subscript𝔼tensor-productsubscript𝑃𝑋subscript𝑃𝑍delimited-[]superscript𝑒subscript𝑇𝜃𝑥superscript𝑧\hat{I}_{\theta}(X;Z)=\mathbb{E}_{P_{XZ}}[T_{\theta}(x,z)]-\log\mathbb{E}_{P_{% X}\otimes P_{Z}}[e^{T_{\theta}(x,z^{\prime})}]over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_X ; italic_Z ) = blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_X italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z ) ] - roman_log blackboard_E start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⊗ italic_P start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] (8)

Our implementation uses a 3-layer MLP with hidden dimensions [128,128,1]1281281[128,128,1][ 128 , 128 , 1 ] and ReLU activations. MINE is used to estimating I(X;Z)𝐼𝑋subscript𝑍I(X;Z_{\ell})italic_I ( italic_X ; italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) — the MI between the input embeddings and layer \ellroman_ℓ — it is also used to compute I(Z;Z+1)𝐼subscript𝑍subscript𝑍1I(Z_{\ell};Z_{\ell+1})italic_I ( italic_Z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT roman_ℓ + 1 end_POSTSUBSCRIPT ), capturing how information is transmitted between adjacent layers. This allows us to analyse information bottlenecks, compression phases, and abstraction dynamics across the depth of the model.

C.2 Training and Evaluation Protocol

For each chosen training step t𝑡titalic_t of the model and its layers, we train a separate MINE estimator to convergence. The training protocol is as follows:

  • Batch size: 128 examples

  • Optimiser: Adam optimiser [28] with learning rate 0.001

  • Training steps: 200 iterations

  • Positive samples: Joint pairs (x,z)𝑥𝑧(x,z)( italic_x , italic_z ) from the same forward pass

  • Negative samples: (x,z)𝑥superscript𝑧(x,z^{\prime})( italic_x , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where zsuperscript𝑧z^{\prime}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is obtained by shuffling across the batch

C.3 MI Across Layers and Models

Figure 12 shows how mutual information evolves during training across model scales. In small models (top-left), MI between the embedding and subsequent layers drops rapidly and stabilises early, suggesting early-stage compression and limited representational differentiation.

Refer to caption
Refer to caption
Refer to caption
Figure 12: Mutual Information (MI) between adjacent layers over training steps for small (top-left), medium (top-right), and large (bottom) models. Each line represents the MI between one pair of layers (e.g., embedding \rightarrow Layer 0, Layer 0 \rightarrow Layer 1). Higher MI values suggest greater information flow or redundancy; drops indicate compression.

In medium models (top-right), we observe a pronounced dip in MI that aligns temporally with the phase transition identified in our diagnostics. This suggests that representational compression may act as a precursor or trigger for abstraction. Following this dip, MI values remain volatile across layers, reflecting a noisier or less stable reconfiguration of internal representations.

Large models (bottom) follow a broadly similar trend, though with key differences: the MI dip appears in most transitions, but is less evident or absent in others (Layer 1 \rightarrow Layer 2).

While the timing of these dips often aligns with the abstraction phase transition, the metric remains highly volatile. This instability suggests that mutual information, although partially correlated with representational restructuring, may be too noisy and inconsistent to serve as a reliable standalone indicator of abstraction onset.