Sampling the space of solutions of an artificial neural network

Alessandro Zambon [email protected] Department of Physics, University of Milan and INFN, via Celoria 16, 20133 Milano, Italy Enrico M. Malatesta [email protected] Department of Computing Sciences and Bocconi Institute for Data Science and Analytics (BIDSA), Bocconi University, 20136 Milano, Italy Guido Tiana [email protected] Department of Physics, University of Milan and INFN, via Celoria 16, 20133 Milano, Italy Riccardo Zecchina [email protected] Department of Computing Sciences and Bocconi Institute for Data Science and Analytics (BIDSA), Bocconi University, 20136 Milano, Italy

(March 11, 2025)

Abstract

The weight space of an artificial neural network can be systematically explored using tools from statistical mechanics. We employ a combination of a hybrid Monte Carlo algorithm which performs long exploration steps, a ratchet-based algorithm to investigate connectivity paths, and coupled replica models simulations to study subdominant flat regions. Our analysis focuses on one hidden layer networks and spans a range of energy levels and constrained density regimes.

Near the interpolation threshold, the low-energy manifold shows a spiky topology. While these spikes aid gradient descent, they can trap general sampling algorithms at low temperatures; they remain connected by complex paths in a confined, non-flat region.

In the overparameterized regime, however, the low-energy manifold becomes entirely flat, forming an extended complex structure that is easy to sample. These numerical results are supported by an analytical study of the training error landscape, and we show numerically that the qualitative features of the loss landscape are robust across different data structures. Our study aims to provide new methodological insights for developing scalable methods for large networks.

I Introduction

Understanding the geometry of the loss landscape in deep neural network models is a critical challenge epf , as it directly influences the design and optimization of learning algorithms. The non-convexity of the loss landscape introduces significant complexity that often precludes the application of analytical methods.

Empirical evidence suggests that first-order optimization methods such as gradient descent (GD) and its variants such as stochastic gradient descent (SGD) can effectively navigate certain regions of these landscapes LeCun et al. (2015, 1998); Bottou (2010); Kingma and Ba (2014); Wilson et al. (2017). However, the insights derived from such methods are closely tied to the specifics of the algorithms, limiting their usefulness in providing a comprehensive understanding of the underlying geometric structures. For example, large language models are typically not trained to optimality on their data Kaplan et al. (2020); Hoffmann et al. (2022), linking their generalization properties to high-loss configurations, a relationship that remains poorly understood but holds promise for improving learning efficiency.

Analytical progress has been made in shallow, non-convex networks under simplified assumptions Gardner and Derrida (1988); Baldassi et al. (2019), using statistical physics techniques such as the replica and cavity methods. These studies reveal a highly intricate structure of minima Baldassi et al. (2020a, 2015, 2021); Annesi et al. (2023), characterized by features such as the overlap gap that renders some minima inaccessible Huang and Kabashima (2014), as well as rare, broad, and accessible minima. These minima have also been found to exhibit a diverse and broad range of generalization abilities Baldassi et al. (2022, 2020b).

On the empirical side, simple low-dimensional visualizations of the loss landscape have been used to gain insight into the general properties of these minima and their arrangement within the loss landscape Li et al. (2018); Huang et al. (2020). In particular, linear and piecewise linear paths Draxler et al. (2018); Garipov et al. (2018); Fort and Jastrzebski (2019) have been used to show that weight configurations found by SGD are generally not linearly connected but are nevertheless connected through a piecewise path Pittorino et al. (2022); Entezari et al. (2022).

The problem of studying the geometry of the solutions of a deep neural network can be easily formulated in terms of statistical mechanics. While an exhaustive search for the regions associated with a low value of the loss $\mathcal{L}$ in the high-dimensional parameter space is infeasible, one can explore the parameter space requiring the average loss $\langle\mathcal{L}\rangle$ to be small, while allowing its instantaneous value to fluctuate around it.

We investigate the loss landscape using a Hybrid Monte Carlo (HMC) approach Duane et al. (1987), which enables large exploratory steps guided by gradients, in contrast to the purely random updates of standard Monte Carlo methods.

We also introduce a Ratchet Hybrid Monte Carlo (RHMC) algorithm, which steers sampling along complex paths while maintaining low loss values. This approach is inspired by a similar technique shown to be efficient in protein folding simulations Camilloni et al. (2011) and in identifying their transition states Tiana and Camilloni (2012).

To demonstrate the effectiveness of HMC and RHMC, we compare their results with analytical predictions for shallow non-convex networks and find agreement between the numerical results of the numerical methods and these predictions. Additionally, we uncover previously unreported theoretical insights into the geometry of minima in single–hidden–layer neural networks with generic activation functions Baldassi et al. (2019), including configurations that pose challenges to conventional optimization approaches.

Although these results underscore the efficacy of HMC and RHMC as powerful tools for probing neural networks with intricate architectures, extending these findings to large-scale networks remains a challenge. Our work provides new insights in this regard.

The architecture we have chosen to study the solution space of artificial neural networks is a tree committee machine Barkai et al. (1992); Engel et al. (1992); Baldassi et al. (2019). It can be considered as the simplest non-convex and non-linear toy model of a neural network. It has the advantages of being analytically treatable using replica methods techniques and that its optimized parameters are not related by trivial permutational symmetries. Classical works in the 90s have studied this model by using the replica method Barkai et al. (1990); Engel et al. (1992) for the sign activation function and in the thermodynamic limit $N\to\infty$ and $P\to\infty$ with $\alpha\equiv P/N$ fixed and for $K=O(1)$ . The typical and atypical states Baldassi et al. (2019) of this model were studied in the large width $K$ limit (but with $K/N\to 0$ ) for a generic activation function. The same work provided a determination of the SAT/UNSAT transition, i.e. the maximum number of samples that the model can in principle store, in the Replica Symmetric (RS) and 1-step Replica Symmetry Breaking (1RSB) scheme. Recently the exact SAT/UNSAT transition has been computed in Annesi et al. (2024) by using a numerical solution of the full Replica Symmetry Breaking (fRSB) equations and it has been compared with the the maximal capacity reached by Gradient Descent; this unveiled the presence of an hard phase for Gradient Descent.

Our results are presented as follows. We shall first introduce the model and the algorithms to sample the loss space (Sect. II), then we shall describe the numerical results obtained close to the interpolation threshold (Sect. III.3) . In the case of the overparametrized regime (Sect. III.2), we shall first present the numerical results of the sampling and then compare it with the analytical calculations. Finally, we will discuss how these results change when realistic, correlated data are used as input (Sect. III.3), and we will draw the overall conclusions (Sect. IV).

II The model and the Methods for sampling and connecting solutions

II.1 The model and main definitions

The tree committee machine we consider consists of a two layer neural network with a generic non-linear activation function, in which each of the $K$ neurons of the hidden layer is connected only to a subset of $N/K$ elements of the $N$ –dimensional input vector (Fig. 1).

For any input $\boldsymbol{x}$ , the output of the tree committee machine can be written as

\hat{y}(\boldsymbol{w})=\text{sign}[\Delta(\boldsymbol{w})]\\ =\text{sign}\left[\frac{1}{\sqrt{K}}\sum_{l=1}^{K}c_{l}\,\varphi\left(\sqrt{% \frac{K}{N}}\sum_{i=1}^{N/K}w_{li}x_{li}\right)\right],

(1)

where $K$ is the width of the hidden layer and $\varphi$ is a non-linear activation function. The first layer is parameterized by a set of weights $\boldsymbol{w}\in\mathbb{R}^{N}$ , which will be trained to learn a dataset. The weights of the second layer, $c_{l}$ are fixed (not learned) to $\pm 1$ , with equal probability. Thus, the number of learned weights is exactly equal to $N$ , the input dimension of the network.

Refer to caption — Figure 1: Example of a tree committee machine with K=6, activation function $\varphi$ for the neurons in the hidden layer and $\sigma$ for the neuron in the output layer. The weights of the output neuron have been set to $\pm 1$ for $i=1,...,\frac{K}{2}$ and $i=\frac{K}{2}+1,...,K$ respectively.

We consider a syntetic dataset $\mathcal{D}$ made of $P$ input vectors $\boldsymbol{x}^{\mu}$ , whose elements are extracted from a standard normal distribution, and the associated labels $y^{\mu}=\pm 1$ , also generated randomly with equal probability. The ratio $\alpha=P/N$ is called constrained density.

The task is to find an $N$ -dimensional vector $\boldsymbol{w}$ that correctly classifies each input $\boldsymbol{x}^{\mu}$ in the dataset to the corresponding label $y^{\mu}$ , for every $\mu\in[P]$ , i.e.

y^{\mu}\Delta^{\mu}(\boldsymbol{w})>0\,,\qquad\forall\mu\in[P]\,.

(2)

A weight vector $\boldsymbol{w}$ satisfying Eq. (2) will be therefore called in the following as a solution of the classification problem. Equivalently, this means that the training error of $\boldsymbol{w}$ defined as the number of misclassified input-output associations

\begin{split}&\epsilon_{t}=\sum_{\mu=1}^{P}\ell_{\mathrm{NE}}\left(y^{\mu}% \Delta^{\mu}(\boldsymbol{w})\right)\\ &\ell_{\mathrm{NE}}\left(y^{\mu}\Delta^{\mu}(\boldsymbol{w})\right)\equiv% \Theta(-y^{\mu}\Delta(\boldsymbol{w}))\end{split}

(3)

is equal to zero, where we have denoted by $\Theta(\cdot)$ the Heaviside function, and by $\ell_{\mathrm{NE}}$ the “error counting” loss. Since this loss is not differentiable, it is not used for optimization in machine learning, which typically relies on gradient-based algorithms. Instead, the error counting loss usually serves to evaluate whether a solution has been found, rather than guiding the optimization process itself.

A loss function that is commonly used to actually find solutions, and on which we will focus on this paper, is the so called cross-entropy loss, which in binary classification reads

\ell_{\mathrm{CE}}(y^{\mu}\Delta^{\mu}(\boldsymbol{w}))=\ln\left(1+e^{-y^{\mu}% \Delta^{\mu}(\boldsymbol{w})}\right).

(4)

In the following we will say that a solution is “typical” if it is extracted from the flat measure over the set of all solutions. Sometimes one requires not only that $\boldsymbol{w}$ is a solution of the classification problem, but also that it satisfies a certain degree of robustness. This can be enforced by ensuring that $\Delta^{\mu}(\boldsymbol{w})$ aligns with the corresponding label $y^{\mu}$ for any $\mu\in[P]$ , within a specified confidence level $\kappa$

y^{\mu}\Delta^{\mu}(\boldsymbol{w})>\kappa\,,\qquad\forall\mu\in[P]

(5)

$\kappa$ is also named “margin”. Imposing a margin $\kappa>0$ ensures that the solution sampled non only has zero training error, but also it is robust to perturbations of the inputs. Both typical (i.e. $\kappa=0$ ) and atypically robust solutions with a positive margin $\kappa$ can be obtained using a loss function that generalizes the one defined in Eq. (3),

\ell(y^{\mu}\Delta^{\mu}(\boldsymbol{w}))=\Theta(-y^{\mu}\Delta^{\mu}(% \boldsymbol{w})-\kappa)\,.

(6)

Indeed in the limit $\beta\to\infty$ the Boltzmann measure of Eq. (7), equipped with the loss function of Eq. (6), focuses on solutions satisfying Eq. (5). We stress that extracting solutions by optimizing a loss different from the error counting loss is therefore to be considered as "atypical".

II.2 The canonical–ensemble framework

We shall look for the solutions of the network using the formalism of the canonical ensemble, thus sampling the space of parameters that have in average a given value of the loss. The solutions of the network are then found sampling the parameters at low temperature, where the average loss is minimal.

The posterior distribution corresponding to a given loss function (or log-likelihood) $\mathcal{L}(\boldsymbol{w};\mathcal{D})$ is given by the Boltzmann-like distribution

p(\boldsymbol{w},\beta;\mathcal{D})=\frac{e^{-\beta\mathcal{L}(\boldsymbol{w};% \mathcal{D})}p(\boldsymbol{w})}{Z(\beta;\mathcal{D})}

(7)

where $\beta=1/T$ is the inverse temperature, $p(\boldsymbol{w})$ is the prior distribution of the weights. For simplicity, a Boltzmann constant $k_{B}=1$ is assumed in every equation. The factor $Z(\beta;\mathcal{D})$ is a normalization factor that is called evidence in Bayesian statistics and partition function in statistical physics and it reads

Z(\beta;\mathcal{D})=\int d\boldsymbol{w}\,p(\boldsymbol{w})\,e^{-\beta% \mathcal{L}(\boldsymbol{w};\mathcal{D})}

(8)

The loss function usually considered in the machine learning literature is factorized over the $P$ elements of the dataset, i.e.

\mathcal{L}(\boldsymbol{w};\mathcal{D})\equiv\sum_{\mu=1}^{P}\ell(y^{\mu}% \Delta^{\mu}(\boldsymbol{w}))

(9)

where $\ell$ is a loss function per pattern and $\Delta(\boldsymbol{w})$ identifies the preactivation of the output node. We consider as a prior a standard normal distribution, or equivalently a $L^{2}$ regularization term with parameter $\lambda$

p(\boldsymbol{w})=e^{-\frac{\beta\lambda}{2}|\boldsymbol{w}|^{2}}

(10)

We can incorporate the regularization term and the loss term in a function $U(\boldsymbol{w})$

U(\boldsymbol{w})=\mathcal{L}(\boldsymbol{w};\mathcal{D})+\frac{\lambda}{2}|% \boldsymbol{w}|^{2}

(11)

that can be though as the (potential) energy associated to the neural network parameters $\boldsymbol{w}$ . Equation (7) can be then rewritten as

p(\boldsymbol{w},T)=\frac{e^{-\frac{U(\boldsymbol{w})}{T}}}{Z_{U}(T)}

(12)

where we have dropped for simplicity the dependence on the dataset $\mathcal{D}$ both on the posterior distribution and in the partition function $Z_{U}(T)=\int d\boldsymbol{w}\,e^{-\frac{U(\boldsymbol{w})}{T}}$ .

II.3 Hybrid Monte Carlo algorithm

One of the main goals of statistical inference is to be able to sample from the posterior distribution of Eq. (12). Monte Carlo–based methods achieve this by simulating a Markovian stochastic process over $\boldsymbol{w}$ that converges to the distribution of Eq. (12). In the Metropolis implementation, the transition rate of the stochastic process is

r_{M}(\boldsymbol{w}_{i}\rightarrow\boldsymbol{w}_{i+1})\\ =r_{0}\,p_{ap}(\boldsymbol{w}_{i+1}|\boldsymbol{w}_{i})\,\mathrm{min}\biggl{[}% 1,e^{-\frac{U(\boldsymbol{w}_{i+1})-U(\boldsymbol{w}_{i})}{T}}\biggr{]}

(13)

where $r_{0}$ is a rate constant, $p_{ap}(\boldsymbol{w}_{i+1}|\boldsymbol{w}_{i})$ is the a priori conditional probability of proposing a move and the last term is the acceptance probability. If $p_{ap}(\boldsymbol{w}_{i+1}|\boldsymbol{w}_{i})$ is symmetric upon exhange of the two states, the condition of detailed balance holds and the process converges to Eq. (12). Often a uniform a priori probability is used; however, the higher the dimension of the system, the lower is the probability of going in an optimal direction with a uniform choice.

A way of mitigating this problem is to choose the a priori probability exploiting the knowledge of the energy gradient Duane et al. (1987), using a Hamiltonian formalism. Defining $\boldsymbol{p}\in\mathbb{R}^{N}$ as the momenta associated to the weights $\boldsymbol{w}$ , the Hamiltonian of the system is

\begin{split}H(\boldsymbol{w},\boldsymbol{p})&=\frac{1}{2}\boldsymbol{p}^{T}% \mathcal{M}^{-1}\boldsymbol{p}+U(\boldsymbol{w})\end{split}

(14)

where $\mathcal{M}^{-1}$ is the inverse of a (fictitious) mass matrix.

The solutions to the equations of motion resulting from this Hamiltonian can be used as a preliminary step in the HMC algorithm. The time reversal invariance of Hamiltonian systems guarantees that the detailed balance holds.

In general, the equations of motions are solved with a numerical integrator, and detailed balance is satisfied only in the limit of time step $\delta t\rightarrow 0$ . However, some integrators like the velocity Verlet algorithm

\begin{split}\boldsymbol{w}^{(k+1)}&=\boldsymbol{w}^{(k)}+\mathcal{M}^{-1}% \boldsymbol{p}^{(k)}\delta t-\mathcal{M}^{-1}\nabla U(\boldsymbol{w}^{(k)})% \frac{\delta t^{2}}{2}\\ \boldsymbol{p}^{(k+1)}&=\boldsymbol{p}^{(k)}-\frac{\delta t}{2}\bigl{[}\nabla U% (\boldsymbol{w}^{(k+1)})+\nabla U(\boldsymbol{w}^{(k)})\bigr{]}.\end{split}

(15)

are intrinsically symmetric for time reversal, and thus detailed balance holds for any choice of $\delta t$ , even if the energy is not conserved. In fact, evaluating the latter of Eqs. (15) for $\delta t\to-\delta t$ one obtains

\displaystyle\boldsymbol{p}^{(k+1)}+\frac{\delta t}{2}\nabla U(\boldsymbol{w}^% {(k+1)})=\boldsymbol{p}^{(k)}-\frac{\delta t}{2}\nabla U(\boldsymbol{w}^{(k)}),

(16)

and substituting it in the former,

	$\displaystyle\boldsymbol{w}^{(k)}$	$\displaystyle=\boldsymbol{w}^{(k+1)}-\mathcal{M}^{-1}\boldsymbol{p}^{(k+1)}% \delta t-\mathcal{M}^{-1}\nabla U(\boldsymbol{w}^{(k+1)})\frac{\delta t^{2}}{2}$
	$\displaystyle\boldsymbol{p}^{(k)}$	$\displaystyle=\boldsymbol{p}^{(k+1)}+\frac{\delta t}{2}\bigl{[}\nabla U(% \boldsymbol{w}^{(k)})+\nabla U(\boldsymbol{w}^{(k+1)})\bigr{]},$

which is a backward trajectory with respect to that of Eqs. (15).

Operatively, at each iteration of the Metropolis algorithm, the momenta are extracted from a Maxwell–Boltzmann distribution at temperature $T$ and a trajectory in the phase space is generated solving Eqs. (15). The final point of the trajectory is then accepted with probability $\min(1,\exp[-\Delta H/T])$ , where $\Delta H=H(\boldsymbol{w}_{i+1},\boldsymbol{p}_{i+1})-H(\boldsymbol{w}_{i},% \boldsymbol{p}_{i})$ . In the condition of detailed balance, the kinetic energy arising from the Metropolis acceptance probability and that coming from the Maxwell–Boltzmann term in the a priori probability $p_{ap}$ cancel out and the algorithm converges to Eq. (12).

The power of this scheme is that it can produce large moves in a direction that change smoothly the energy of the system, and thus the Metropolis acceptance rate can remain high.

II.4 Double ratchet

In deep learning, characterizing the structure of neural network loss landscapes is crucial for understanding both optimization dynamics and generalization properties. A key line of research explores pathways between distinct weight configurations $\boldsymbol{w}$ and $\boldsymbol{w}^{\prime}$ , which typically correspond to solutions with low loss. The simplest approach examines how the loss behaves along the linear (or straight) path connecting these points. As shown in Draxler et al. (2018), overparameterized neural networks trained with stochastic gradient descent (SGD) often exhibit a loss barrier along the linear path. The barrier can be notably decreased by removing the symmetries that the neural network possess, like permutation symmetry of the hidden units Pittorino et al. (2022); Entezari et al. (2022).

While linear paths provide an intuitive and computationally efficient visualization of the loss landscape, they do not explicitly search for paths between solutions. To address these limitations, the authors of Garipov et al. (2018) introduced the use of piecewise linear trajectories, or “polygonal chains”, where the path is optimized by introducing $k$ intermediate pivot points. Although this method allows for more flexible connections, it is computationally demanding, requiring multiple training runs to optimize the pivot locations. Furthermore, the number of required pivots can grow significantly, particularly in less overparameterized settings, making the approach increasingly impractical in such regimes.

As an alternative to generate trajectories from $\boldsymbol{w}$ to $\boldsymbol{w}^{\prime}$ , we make use of a modified HMC algorithm that damps fluctuations in the direction opposite to $\boldsymbol{w}^{\prime}$ . This is inspired from the ratchet–and–paw mechanical system and has given good results in molecular dynamics simulations Camilloni et al. (2011). It can be implemented calculating the dynamics of two points $\boldsymbol{w}_{1}(t)$ and $\boldsymbol{w}_{2}(t)$ starting in $\boldsymbol{w}$ and $\boldsymbol{w}^{\prime}$ , respectively, and evolving under a HMC with energy

U_{rt}\bigl{[}\boldsymbol{w}_{1}(t),\boldsymbol{w}_{2}(t)\bigr{]}\\ =U\bigl{[}\boldsymbol{w}_{1}(t)\bigr{]}+U\bigl{[}\boldsymbol{w}_{2}(t)\bigr{]}% +\widetilde{U}\bigl{[}\boldsymbol{w}_{1}(t),\boldsymbol{w}_{2}(t)\bigr{]}

(17)

where $U$ is the potential energy of Eq.(11) and

\widetilde{U}\bigl{[}\boldsymbol{w}_{1}(t),\boldsymbol{w}_{2}(t)\bigr{]}=% \begin{cases}\frac{k}{2}\bigl{[}|\boldsymbol{w}_{1}(t)-\boldsymbol{w}_{2}(t)|-% d_{\min}(t)\bigr{]}^{2}&\text{if }|\boldsymbol{w}_{1}(t)-\boldsymbol{w}_{2}(t)% |>d_{\mathrm{min}}(t)\\ 0&\text{if }|\boldsymbol{w}_{1}(t)-\boldsymbol{w}_{2}(t)|\leq d_{\min}(t)\end{cases}

(18)

where $d_{\mathrm{min}}(t)\equiv\min_{t^{\prime}<t}|\boldsymbol{w}_{1}(t^{\prime})-% \boldsymbol{w}_{2}(t^{\prime})|$ is the minimum distance observed along the trajectories of the two points up to time $t$ . The time–dependent term of Eq. (18) favors the moves which get the two weights closer to each other, without exerting work to push them. In this way, the two points can autonomously find the minimum–energy path that connects them.

During each simulation, the two vectors are updated sequentially at each time step according to Eq. (17), until their cosine–similarity (or normalized overlap)

q(\boldsymbol{w}_{1},\boldsymbol{w}_{2})=\frac{\boldsymbol{w}_{1}\cdot% \boldsymbol{w}_{2}}{|\boldsymbol{w}_{1}||\boldsymbol{w}_{2}|}

(19)

is equal to 1. We will call the point $\boldsymbol{s}^{\star}$ where the two trajectories $\boldsymbol{w}_{1}(t)$ and $\boldsymbol{w}_{2}(t)$ meet as the “anchor” weight of $\boldsymbol{w}_{1}$ , $\boldsymbol{w}_{2}$ .

II.5 Coupled replica simulations

Besides sampling the Boltzmann–like probability of Eq. (12), it can be useful to bias in order to give more statistical weight to wider energy minima, which were shown to have better generalization properties than narrow ones Baldassi et al. (2015, 2020b, 2021, 2022).

We did this using an entropy–driven search algorithm inspired from ref. Baldassi et al. (2016). We used a system composed of $y$ coupled replicas $\{\boldsymbol{w}_{i}(t)\}_{i=1}^{y}$ , each starting from a different solution previously found by using the HMC. The replicas are coupled to the barycenter of the system

\displaystyle\boldsymbol{w}_{c}(t)=\frac{1}{y}\,\frac{\sum_{i=1}^{y}|% \boldsymbol{w}_{i}(t)|}{|\sum_{i=1}^{y}\boldsymbol{w}_{i}(t)|}\,\sum_{i=1}^{y}% \boldsymbol{w}_{i}(t)

(20)

with a potential

U_{rp}(\{\boldsymbol{w}_{i}(t)\}_{i=1}^{y})=\sum_{i=1}^{y}U\bigl{[}\boldsymbol% {w}_{i}(t)\bigr{]}+\frac{\gamma T}{y}\sum_{i=1}^{y}|\boldsymbol{w}_{i}(t)-% \boldsymbol{w}_{c}(t)|,

(21)

where $\gamma$ is a Lagrange multiplier regulating the mean distance between the replicas and the barycenter $\boldsymbol{w}_{c}$ .

Algorithmically, the weights $\{\boldsymbol{w}_{i}\}_{i=1}^{y}$ are updated sequentially according to Eq. (21), whereas the barycenter is computed once one of the replicas has moved. The value of $\gamma$ is increased during the simulations, until the $y$ replicas $\boldsymbol{w}_{i}$ all collapse onto a single high–entropy weight $\boldsymbol{c}$ , that we will refer in the rest of the paper as the “center” weight found by the coupled replica simulation.

Note that in the definition of the barycenter of Eq. (20), the norm is rescaled to match the mean of the other replicas. This prevents the replicas from being driven toward a lower norm point, which would undesirably increase their energy $U\bigl{[}\boldsymbol{w}_{i}(t)\bigr{]}$ .

III Results

Using the numerical techniques listed in the previous section, we explored the learning loss landscape of the tree committee machine at different energy levels and in different regimes of constrained density. Our main results are the following:

•

Close to interpolation threshold, the low-energy manifold has a spiky shaped structure, i.e. it is composed by sharp protrusions (targeted by GD) from which HMC is not able to escape at low temperatures. We nevertheless show that those protrusions are connected by complex paths through a more compact narrow region that is not completely flat (see section III.1)
•

In the overparametrized regime $\alpha\ll 1$ we show that the spikes do not trap anymore HMC. Furthermore, the low-energy manifold is entirely flat in the bulk, giving it a star-shaped structure. We confirm this numerical evidence by studying analytically the error landscape, showing analogous results (see section III.2).
•

Preliminary experiments on highly correlated, real-world datasets indicate that our findings remain robust even when the data exhibit significant structural properties.

Unless stated otherwise, in all the numerical simulations we have used a tree committee machine with $N=1000$ input neurons, $K=50$ hidden neurons, and ReLU activation functions $\varphi(x)=\max(0,x)$ . In HMC simulations we have used the cross entropy loss function in Eq. (4). The norm of the weights is controlled by the Lagrange multiplier $\lambda=2P\cdot 10^{-7}$ , with $P$ being the dimension of the dataset. The mass matrix in Eq. (14) required by the algorithm is set $\mathcal{M}=\mathbb{I}$ .

III.1 The model close to the interpolation threshold

We first study the loss landscape of the model in the underparametrized regime, i.e. just below the threshold at which full–batch GD can no longer find weights with zero training error and which is usually called "interpolation threshold" Engel and Van den Broeck (2001); Belkin et al. (2019). This happens approximately around $\alpha=1.8$ (Fig. 2a), which is very far from the SAT/UNSAT transition $\alpha_{c}\simeq 2.65$ where solutions cease to exist and which was computed in Annesi et al. (2024). All the numerical studies of this section therefore refer to the case $\alpha=1.8$ .

III.1.1 The weight space displays three regimes with respect to the temperature

First, we trained the model with GD for approximately $\sim 10^{7}$ epochs and with a fixed learning rate $\eta=1.0$ . The solutions $\boldsymbol{s}_{i}$ found with GD, from which we started the sampling, have mean energy $\langle U\rangle_{\boldsymbol{s}}=(8.5\pm 0.04)\cdot 10^{-2}$ and mean training error $\langle\epsilon\rangle_{\boldsymbol{s}}=0$ (Fig. 2a). The similarity between them is peaked at $q\approx 0.6$ (see upper panel in Fig. 2c), which means that GD finds solutions with an almost fixed mutual distance.

Starting from GD solutions, we explored the space of parameters of the network with the HMC algorithm at different temperatures (Fig. 2b). For $T<1.8\cdot 10^{-1}$ the average energy $\langle U\rangle_{T}$ starts to flatten, suggesting that here the system freezes in the lowest–energy available states. The average energy is comparable with that of the solutions $\boldsymbol{s}_{i}$ and also the mean training error remains $\langle\epsilon\rangle=0$ . At these temperatures the sampling of the space of weights is difficult and the system gets trapped in local minima of the loss. The similarity distribution $\rho(q)$ (see Eq. (19)) calculated on weights sampled from HMC trajectories starting from the same initial condition (intra-state overlap distribution) is strongly peaked around 1, see Fig. 2c, middle panel. This is markedly different from the inter-state overlap distribution, which is obtained by measuring the overlap between weights along HMC trajectories starting from strictly different initial conditions and is peaked at $q\approx 0.63$ . Thus, the system is not able to equilibrate at these temperatures but stays close to the initial condition.

In the range of temperature $1.8\cdot 10^{-1}<T<1.8\cdot 10^{2}$ the average energy increases almost linearly $\langle U\rangle_{T}\sim T^{\delta}$ , with a scaling exponent $\delta\approx 0.989\pm 0.019$ . This is typical of a glassy thermodynamic phase with a sub-exponential number of accessible states. In this intermediate temperature range, there is no marked difference between the inter and intra state overlap distribution (Fig. 2c, lower panel). This suggests that the system is not trapped in local minima as in the low–temperature phase. The average similarity between states displays a single peak centered at $q\approx 0.6$ , similar to that observed at low temperature, but with a larger variance.

In the high temperature phase, the average energy deviates from the linear behavior and the training error further increases. The distribution of similarities $q$ is very broaden. However, here the HMC algorithm becomes inefficient because the high velocities extracted before each HMC move require a very small step $\delta t$ for a correct integration of the equations of motion and thus for an acceptable acceptance rate.

We summarize these results in Fig. 2d, where we show a sketch of the space of parameters at fixed norm. On this hypersphere, the solutions found by the GD define energy basins which are explored by the system at low and intermediate temperatures. These basins are disconnected at low temperature, since HMC remains confined near the initialization and cannot find pathways connecting the different basins while remaining at low energy and at zero training error.

III.1.2 The low–energy basins are connected by complex paths

To explore the connectivity of energy basins at low temperatures, we performed double-ratchet simulations, hoping to find paths which are not found by the HMC algorithm. This method is designed to identify low-barrier pathways between two weight configurations by following the minimal gradient in the direction that brings them closer together (Fig. 3a).

The double ratchets are initialized on different pairs of solutions $(\boldsymbol{s}_{i},\boldsymbol{s}_{j})$ found after a short thermalization of the system of approximately $10^{4}$ HMC steps. Along each simulation, the weights do not cross significant energy barriers, as compared to the average energy at this temperature, keeping the training error equal to zero. Conversely, the linear interpolation between pairs of points along the double–ratchet trajectory results in barriers that are substantially higher than the average energy (Fig. 3b). These results suggest that, although the solutions found by gradient descent are linearly mode disconnected, low-energy tortuous paths joining them still exist.

The anchor point $\boldsymbol{s}^{\star}$ obtained at the end of the double–ratchet trajectory has an energy that is slightly lower than the one of the initial points $(\boldsymbol{s}_{i},\boldsymbol{s}_{j})$ and $\langle U\rangle_{\boldsymbol{s}^{\star}}=(6.80\pm 0.09)\cdot 10^{-2}$ is compatible with the average $\langle U\rangle_{T_{L}}$ , that is the constrained dynamic arising from the double–ratchet simulation helps the equilibration of the system. The anchor weights $\boldsymbol{s}^{\star}$ obtained from several double ratchet simulations initialized from different weight pairs $(\boldsymbol{s}_{i},\boldsymbol{s}_{j})$ exhibit larger similarity than the respective initial vectors and are separated by energy barriers along their linear interpolation (Fig. 3c). This suggests the presence of a compact and narrow region of the solution space that must be traversed to move from one gradient descent solution to another. The similarity between the anchor weights $\boldsymbol{s}^{\star}$ and the associated initial weights is also centered around $0.6$ , as that between the starting solutions.

In summary (Fig. 3d), the basins identified by GD solutions are connected by non–linear paths pointing towards a denser, more compact region with slightly lower energy. The low–energy manifold of the space of weights has thus a spiky shape, with the GD solutions lying on its spikes.

III.1.3 The center is not flat

We then studied the geometry of the center of the spiky low energy manifold in more detail using coupled replica simulations, starting with five GD solutions coupled to their barycenter and increasing the harmonic couplings until all the running points converge to a single vector $\boldsymbol{c}$ (Fig. 4a).

We compared the properties of the centers found in different coupled replica simulations with the solutions $\boldsymbol{s}$ found by GD and the anchor points $\boldsymbol{s}^{\star}$ found by double ratchets initialized on different GD solutions at $T=T_{L}$ . The final configuration $\boldsymbol{c}$ always has an energy that is lower by several standard deviations than the mean energy at the simulation temperature (Fig. 4b,c). The centers obtained from the coupled replica simulations are confined to a very narrow region of the solution space (Fig. 4b), even narrower than the region spanned by the anchor points (middle panel of Fig. 3c).

In addition, the similarity distribution between centers and anchor points is larger than that between centers and GD solutions. This ordering suggests that the centers lie deep within the bulk of the solution manifold, with anchor points positioned more peripherally and GD solutions even farther away. This “nested overlap” structure has already been observed in the negative perceptron problem, see Annesi et al. (2024).

Despite the high similarity between the $\boldsymbol{c}$ vectors ( $q\approx 0.90\pm 0.03$ ), the latter are not linearly connected at fixed norm (Fig. 4b, upper panel, green curves), even though the barrier height is significantly lower than the one between $\boldsymbol{c}$ and GD solutions $\boldsymbol{s}$ , as well as $\boldsymbol{s}^{*}$ points (Fig. 4b, upper panel, blue and red curves, respectively).

Finally, we performed double–ratchet simulations between pairs of $\boldsymbol{c}$ vectors and between centers and GD solutions. In the latter case, at $T=T_{L}$ the weight initialized on the center $\boldsymbol{c}$ stays very close to it along the double ratchet dynamics (and its energy within two standard deviations from the average at the simulation temperature); the weight starting from the GD solution instead slowly lowers its energy until the coupled $\boldsymbol{c}$ vector is reached (Fig. 4c, blue curves).

On the contrary, when the same simulation is performed between pairs of center points at $T<T_{L}$ , the latter are able to connect to each other through paths whose energy is lower than the average $\langle U\rangle_{T_{L}}$ (Fig. 4c, green curves).

From this last computational analysis, we conclude that the center of the spiky low–energy manifold is not barrier–free. Although there are low–energy paths connecting all low–energy regions of the structure, these are simple as linear paths. Moreover, there is a small slope of the energy towards the center.

III.1.4 The intermediate temperature regime

By increasing the temperature to intermediate values (cf. Fig. 2), the system reaches a phase where the HMC algorithm can easily sample the space of weights. In this phase the training error is still small but non–negligible (up to $\epsilon\approx 0.3$ ) but the system can still learn some of the data we present.

The similarity distribution between states shows a single peak centered at $q\approx 0.6$ , indicating that most of the sample states are equivalent to each other. This is different from what is found at low temperatures, where there is a difference between points at the periphery of the structure and weights in the center of the low–energy manifold. Such a single peak in $\rho(q)$ is compatible with what is known in the language of complex systems as replica-symmetric behavior of the system Mézard et al. (1987).

The states sampled during simulations in this intermediate regime ( $T=1.8\cdot 10^{1}$ ) show an average similarity $\langle q\rangle=0.54\pm 0.03$ with respect to GD solutions $\boldsymbol{s}_{i}$ and $\langle q\rangle=0.58\pm 0.04$ with respect to both the $s^{*}$ and $\boldsymbol{c}$ vectors found in the low–temperature phase. Consequently, here the system is never close to any specific region of the subspace explored at low temperatures. In particular, the large mean distance between the center of the manifold and the typical states sampled at intermediate temperatures suggests that the former is entropically unfavorable. This fact, along with the unimodal shape of the cosine–similarity distribution, suggests that the visited manifold at intermediate temperatures still retains a (less rugged) spiky shape, but with a “hollow” center characterized by high free energy.

III.2 The overparametrized regime

III.2.1 Similarities and differences with the underparametrized regime

We then compared the properties of the space of weights that we found at the interpolation threshold with that of the overparametrized regime.

We performed various HMC simulations at $\alpha=0.2$ for a wide range of temperatures, all of which never indicate a frozen behavior as, similarly to the intermediate–temperature case at $\alpha=1.8$ , the intra e inter overlap distribution coincide, see Fig. 5a. Moreover, the energy profile along the linear paths between the sampled states shows a central free energy barrier, both at $\alpha=1.8$ and at most temperatures at $\alpha=0.2$ . In the last case, the central barrier disappears for vanishing temperatures and the energy profile assumes a convex shape (Fig. 5b).

In conclusion, the intermediate–temperature configurations visited by the system are similarly distributed both near the interpolation threshold and in the overparametrized regime, where in both scenarios the manifold exhibits a symmetric shape, in the sense that the system populates a single kind of state, belonging to the spiky periphery of the space. Only at low temperatures do the two cases differ, since in the former the explored subspace maintains a spiky shape, where part of the center of the manifold and its periphery are populated, whilst in the latter it becomes convex.

III.2.2 The barrier along linear paths between sampled solutions can be computed analytically

In the overparametrized regime, we can compute analytically both the energy (loss) and the training error along the geodesic path connecting two weights $\boldsymbol{w}^{1}$ , $\boldsymbol{w}^{2}$ in the thermodynamic limit, extending the previous analysis performed in ref. Annesi et al. (2023) to the tree committee machine case. In full generality we will assume to sample the two weights from two different Boltzmann distributions $\boldsymbol{w}^{1}\sim p_{1}(\boldsymbol{w};\mathcal{D})$ , $\boldsymbol{w}^{2}\sim p_{2}(\boldsymbol{w};\mathcal{D})$ whose individual form are the same as in equation (7) but each of them differing from the choice of the loss function and the prior, i.e.


	$\displaystyle p_{1}(\boldsymbol{w};\mathcal{D})=\frac{e^{-\beta\mathcal{L}_{1}% (\boldsymbol{w};\mathcal{D})}p_{1}(\boldsymbol{w})}{Z^{1}_{\mathcal{D}}}$		(22a)
	$\displaystyle p_{2}(\boldsymbol{w};\mathcal{D})=\frac{e^{-\beta\mathcal{L}_{2}% (\boldsymbol{w};\mathcal{D})}p_{2}(\boldsymbol{w})}{Z^{2}_{\mathcal{D}}}$		(22b)

Notice, moreover, that in equations (22) the training data $\mathcal{D}$ is the same for both Boltzmann distributions. As in Section II we choose a Gaussian distribution as a prior $p_{1}(\boldsymbol{w})$ , $p_{2}(\boldsymbol{w})$ , with a $L2$ regularization parameter that we denote respectively by $\lambda_{1}$ and $\lambda_{2}$ . In the large $N$ limit the choice of the regularization parameter will induce a non-trivial value of the norm of the weights $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ . In the following we will suppose that $\lambda_{1}$ and $\lambda_{2}$ are chosen such that the norm of $\boldsymbol{w}_{1}$ is the same as the one of $\boldsymbol{w}^{2}$ , i.e. $|\boldsymbol{w}^{1}|^{2}=|\boldsymbol{w}^{2}|^{2}\equiv Q$ .

We are interested in computing the average training error and training loss landscape on the geodesic path joining $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ at fixed squared norm $Q$ . This can be obtained as follows. Given $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ we define the interpolating weight $\boldsymbol{w}(\gamma)$ with parameter $\gamma\in[0,1]$ as the vector whose components are the linear interpolation of the components of $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$

w_{li}(\gamma)\equiv\gamma w_{li}^{1}+(1-\gamma)w_{li}^{2}\,.

(23)

In order to compute the geodesic path at the same squared norm $Q$ , we finally need to project the whole straight path on the hypersphere of radius $\sqrt{Q}$ . This defines the weight

\widetilde{w}_{li}(\gamma)\equiv\frac{w_{li}(\gamma)}{c_{\gamma}}

(24)

where

c_{\gamma}\equiv\frac{\lvert\boldsymbol{w}(\gamma)\rvert}{\sqrt{Q}}\equiv\sqrt% {\frac{1}{QN}\sum_{l=1}^{K}\sum_{i=1}^{N/K}w^{2}_{li}(\gamma)}\\ =\sqrt{1-2\gamma(1-\gamma)\left(1-\frac{p}{Q}\right)}

(25)

In the previous expression we have introduced the quantity

p\equiv\mathbb{E}_{\mathcal{D}}\left\langle\frac{1}{N}\sum_{li}w_{li}^{1}w_{li% }^{2}\right\rangle_{\boldsymbol{w}^{1}\sim p_{1}(\cdot;\mathcal{D}),% \boldsymbol{w}^{2}\sim p_{2}(\cdot;\mathcal{D})}

(26)

which corresponds to the overlap between $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ . In the previous equation we have denoted by $\langle\cdot\rangle$ the average over the Boltzmann distribution (7) and by $\mathbb{E}_{\mathcal{D}}$ the average over the dataset.

We are interested in finding what is the average value of a loss $\widetilde{\mathcal{L}}$ of the projected interpolated weight $\widetilde{\boldsymbol{w}}(\gamma)$ as a function of $\gamma$ , and how this profile depends on the choice of the endpoints $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ via the probability density functions in (22). In formulas, what we compute is

\begin{split}E(\gamma)&=\frac{1}{P}\,\mathbb{E}_{\mathcal{D}}\,\mathbb{E}_{% \boldsymbol{w}^{1}\sim p_{1}(\cdot\,;\mathcal{D})}\,\mathbb{E}_{\boldsymbol{w}% ^{2}\sim p_{2}(\cdot\,;\mathcal{D})}\,\widetilde{\mathcal{L}}\left(\widetilde{% \boldsymbol{w}}(\gamma);\mathcal{D}\right)\\ &=\frac{1}{P}\,\mathbb{E}_{\mathcal{D}}\frac{\int d\boldsymbol{w}^{1}d% \boldsymbol{w}^{2}\,p(\boldsymbol{w}^{1})p(\boldsymbol{w}^{2})\,e^{-\beta% \mathcal{L}_{1}(\boldsymbol{w}_{1};\mathcal{D})-\beta\mathcal{L}_{2}(% \boldsymbol{w}_{2};\mathcal{D})}\,\widetilde{\mathcal{L}}\left(\widetilde{% \boldsymbol{w}}(\gamma);\mathcal{D}\right)}{Z_{\mathcal{D}}^{1}Z_{\mathcal{D}}% ^{2}}\end{split}

(27)

Using a general loss $\widetilde{\mathcal{L}}$ allows us also to both have access to training loss and training error profiles. Equation (27) can be computed by the replica method in the large $N$ limit. In the following we will focus only on the infinite width limit $K\to\infty$ with the ratio $K/N\to 0$ , as done in Baldassi et al. (2019). The full calculation is reported in Appendix B. Here we only mention that the result of the calculation, assuming no replica symmetry breaking of the solution space (i.e. in the so called Replica Symmetric ansatz), will depend on simple geometrical quantities. Those are the typical overlap between two weights $\boldsymbol{w}^{a}$ , $\boldsymbol{w}^{b}$ extracted from $p_{1}$ , $\boldsymbol{w}^{a},\boldsymbol{w}^{b}\sim p_{1}(\boldsymbol{w};\mathcal{D})$ (respectively from $p_{2}$ , i.e. $\boldsymbol{w}^{a},\boldsymbol{w}^{b}\sim p_{2}(\boldsymbol{w};\mathcal{D})$ ), i.e.


$\displaystyle q_{1}$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\left\langle\frac{1}{N}\sum_{li}w_{li}^{% a}w_{li}^{b}\right\rangle_{\boldsymbol{w}^{a},\boldsymbol{w}^{b}\sim p_{1}(% \cdot;\mathcal{D})}$	(28a)
$\displaystyle q_{2}$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\left\langle\frac{1}{N}\sum_{li}w_{li}^{% a}w_{li}^{b}\right\rangle_{\boldsymbol{w}^{a},\boldsymbol{w}^{b}\sim p_{2}(% \cdot;\mathcal{D})}$	(28b)

as well as the typical overlap $p$ between the endpoints $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ , which was introduced in equation (26). In the Replica Symmetric ansatz all those overlaps concentrate in the large $N$ limit. Since the weights $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ are in general samples of two different probability distributions (22), the overlaps $q_{1}$ and $q_{2}$ can be simply obtained by computing with the replica method the corresponding partition function $Z^{1}_{\mathcal{D}}$ and $Z^{2}_{\mathcal{D}}$ . We refer to Baldassi et al. (2019, 2023) for such calculation, but we report for convenience of the reader in Appendix A its outcome. The overlap $p$ is instead slightly more difficult to compute, because it amounts to study an elastically coupled system of weights $\boldsymbol{w}^{1}$ , $\boldsymbol{w}^{2}$ in the limit where the coupling is sent to zero, see Annesi et al. (2023) for details. We present in Appendix D an alternative and complementary calculation based on the Franz-Parisi potential Franz and Parisi (1995).

III.2.3 Comparison between theory and simulations

We compare here our analytical predictions with the numerical results obtained using the HMC algorithm. We have considered the case $K/N=0.005$ , a constraint density $\alpha=0.2$ , temperature $T=0.02$ and L2 regularization $\beta\lambda=0.02$ .

The plot of Figure 6 shows the cross-entropy loss along the geodesic path interpolating between two samples of the Boltzmann distribution. We note that the cross-entropy profile presents a non-trivial non-monotonic behavior: starting from one of the two configurations, the loss starts to decrease and then increasing again, reaching a local maximum in the middle of the path. This same non-trivial behavior is also observed in the numerical estimate. We emphasize that achieving the asymptotic limit predicted by the theory is nontrivial, as it requires accounting for finite $N$ and $K$ corrections while operating in the scaling regime $K/N\to 0$ . Nevertheless, by keeping the ratio $K/N$ small and increasing $N$ , we show that the simulations approach the predictions given by our infinite-size theory.

III.2.4 The star-shaped property of the solution space

Thus far, in both the underparametrized and overparametrized regimes we have focused on the energy landscape induced by the cross-entropy loss. Here, we broaden our scope to examine the entire solution space, i.e. we consider the loss function (6). Notice that this is a larger set of set of weights that includes, but is not limited to, those configurations selected by optimizing the cross-entropy loss.

In Figure 7 we consider the case in which the endpoints are sampled from the large $\beta$ limit of the Boltzmann distributions corresponding to the theta loss (6), with two equal margin $\kappa_{1}=\kappa_{2}=\kappa$ . We plot the training error (3) (i.e. $\widetilde{\mathcal{L}}(\widetilde{\boldsymbol{w}};\mathcal{D})=\sum_{\mu=1}^{% P}\Theta(-y^{\mu}\Delta^{\mu}(\boldsymbol{w}))$ ) on the geodesic path joining such solutions for several values of $\kappa$ in the low $\alpha$ regime, i.e. corresponding to a rather overparameterized network. It can be noticed that for $\kappa=0$ , i.e. for typical solutions of the learning task, as soon as one moves away from the endpoints the training error is strictly positive, meaning that $\left.\frac{dE(\gamma)}{d\gamma}\right|_{\gamma=0}>0$ and $\left.\frac{dE(\gamma)}{d\gamma}\right|_{\gamma=1}<0$ . Increasing the value of $\kappa$ there is a small neighborhood of the endpoints where the training error vanishes. Overall the whole curve of the training error monotonically decreases if one keeps increasing $\kappa$ . For $\kappa>\kappa^{\star}$ the two endpoints become linear mode connected. This has been called in Annesi et al. (2023) the “geodesically convex” component of the manifold of solutions, as any two solutions sampled within this region are linear mode connected. In the two panels of Figure 7 we also compare two activation functions, the ReLU and the sign activation function. It can be noticed that overall the barrier for a fixed $\kappa$ is smaller in the ReLU activation case. This is consistent with was found in Baldassi et al. (2019), where it has been argued that the training error landscape corresponding to the ReLU activation possess wider and flatter minima with respect to other activation choices like the sign case.

In Figure 8 we consider the case in which the endpoints are sampled from the large $\beta$ limit of the Boltzmann distributions corresponding to the loss (6), but with two different margin $\kappa_{1}$ and $\kappa_{2}$ . We consider the first endpoint to be a typical solution of the classification task, i.e. to have margin $\kappa_{1}=0$ . Similarly as before, we plot the training error, i.e. $\widetilde{\mathcal{L}}(\widetilde{\boldsymbol{w}};\mathcal{D})=\sum_{\mu=1}^{% P}\Theta(-y^{\mu}\Delta^{\mu}(\boldsymbol{w}))$ , on the geodesic path joining $\boldsymbol{w}^{1}$ to $\boldsymbol{w}_{2}$ for several values of $\kappa_{2}$ and for the same $\alpha$ considered in Figure 7. It can be noticed that the maximum of the barrier is always closer to the less robust solution, and the whole curve monotonically decreases if one keeps increasing $\kappa_{2}$ . For $\kappa_{2}>\kappa_{\text{krn}}$ the two solutions become eventually linear mode connected. This means that typical solutions despite not being linear mode connected, are connected by a piecewise path, passing through a solution having a rather large margin $\kappa$ . It therefore exists a subset of solutions called kernel, that are geodesically connected to any other solution of the learning task¹¹1Indeed if $\kappa_{2}>\kappa_{\text{krn}}$ , then not only this solution is linear mode connected to a typical solution with $\kappa_{1}=0$ , but also to any other solution with margin $\kappa_{1}>0$ extracted from the Boltzmann measure induced by (6).. This implies that the space of solutions is star-shaped in the overparameterized regime. Similar plots hold for other activation functions; we report the case of the Erf activation function in the appendix. This conclusion is consistent with what was found in reference Annesi et al. (2023) for a non-convex but simpler linear model called the negative perceptron. Recently in Lin et al. (2024); Sonthalia et al. (2024), numerical evidence has been presented suggesting that the solutions space of deep networks possess a star-shaped geometry. We first provide theoretical support for this claim in the case of simple, overparameterized one-hidden-layer networks with general activation functions.

We refer to Appendix B.4.3 for a discussion of the training error and loss along the path connecting a typical solution of the cross entropy loss and the error counting loss, which shows that the low temperature configurations sampled from the cross-entropy loss are solutions located deep into the geodesically convex component of the manifold of solutions. Note that this is consistent with what has been observed numerically in Fig. 5. As we have numerically observed, however, this is not true in the underparametrized regime, as typical cross-entropy loss solutions become linear mode disconnected.

III.3 The geometrical structure can be robust with respect to highly correlated data

To further assess the applicability of our techniques, we repeated the numerical study on the same architecture, utilizing correlated data for both the training phase and the exploration of the weight space via ratchet and coupled replica simulations. Again, we treated a case of binary classification, where the dataset is composed of 32x32 images of cats and dogs from the CIFAR10 repository, each of which labeled respectively as $+1$ and $-1$ . Furthermore, each component of an input vector $\boldsymbol{x}^{\mu}$ , representing a given image in the training dataset, has been scaled so that $x_{i}^{\mu}\in[-1,1]$ $\forall i,\mu$ (cf. Fig. 9a).

Due to the nature of the task, we again adopted the binary cross–entropy function with a $L^{2}$ regularization term as the potential energy of the system (see Eq. (11)), with the Lagrange multiplier $\lambda=2P\cdot 10^{-8}$ . On the contrary, we slightly modified the networks parameters, because of the different size of the input vectors $\boldsymbol{x}^{\mu}$ . In this case, $N/K=K=32$ , so that the $i$ -th neuron in the hidden layer takes as input the $i$ -th row of each input image. Once again, the value of $\alpha=P/N$ is chosen to be just below the threshold where full–batch GD can no longer find weights with zero training error. In this case, that is $\alpha=0.8$ (or equivalently $P\approx 820$ ).

Similarly to the case of random inputs, the energy of the sampled state decreases as we move from the GD solutions to the center (upper panel in Fig. 9b). An important difference is that now the distribution of similarities between the center and the GD solutions partially overlaps with the similarity between the center and the points $\boldsymbol{s}^{\star}$ found by the double ratchet to connect the GD solutions. In turn, this distribution partially overlaps with that of similarity between the different central points (lower panel in Fig. Fig. 9b). This suggests that the spiky manifold has a larger center for correlated data.

Moreover, the energy profile seems to display higher peaks and to be more rugged than the random label case, as apparent from the shape of the energy along the paths generated by the double ratchet (Fig. 9c) and from the irregular relation between the height of the peaks found along the linear paths (compare the upper panel of Fig. 9b with that of Fig. 4b).

The central points are still connected by low–energy paths (Fig. 9c), even though the mean similarity between them is lower than in the random–label case ( $\langle q\rangle\approx 0.84\pm 0.06$ ).

Summing up, the low–energy states of a tree committee machine trained with correlated data still have a spiky shape, but its energy profile is more rugged and irregular than the random–label case. Moreover, the center is more bulky (see sketch in Fig. 9d).

IV Discussion and Conclusions

In this work, we analyzed the loss landscape of a simple one-hidden-layer artificial neural network with a tree-like structure in both the underparameterized and overparameterized regimes. We employed various numerical techniques, including (Hamiltonian) Monte Carlo methods and biased dynamics that were originally developed in statistical mechanics and have been used to identify native-like conformations in protein folding molecular dynamics simulations. This approach allowed us to sample the weight space manifold at different loss levels and investigate low-energy paths connecting distinct weights.

Close to the interpolation threshold, the numerical exploration of the weight space by using HMC starting from GD solutions, identified two main regimes as a function of temperature. At low temperatures, the system is in a frozen state having zero training error and is unable to move sufficiently far away from initialization (the inter-state overlap distribution is peaked near 1), despite the overlap between HMC trajectories starting from different GD solutions (intra-state overlap) is strictly less then 1. The linear interpolation between GD solutions also shows a loss barrier. Despite this, GD solutions can be connected by a low energy path; those paths are tortuous and difficult to find by an unbiased Monte Carlo algorithm. We identified them by using a double–ratchet hybrid Monte Carlo algorithm, which penalizes moves that cause the two gradient descent solutions to drift apart. These results suggest that the manifold of low-energy weights has a spiky topology, with gradient descent solutions located along its protruding rays. We have also shown that the center of the manifold solutions has also has a complex pattern of valleys and barriers.

At intermediate temperatures, the training error is small but not zero. Differently from the low temperature regime, the HMC dynamics is ergodic, since there is no difference between inter- and intra-state overlaps. This means that the shape of the populated state is particularly symmetric and corresponds, in the language of replica calculations, to a replica–symmetric solution. The energy always displays a maximum at the center of the straight lines between states populated at intermediate temperatures.

The symmetric, hollow structure of states at intermediate temperature is similar to that displayed by the system in the overparametrized regime, suggesting that the two manifolds are quite similar. The main difference between the overparametrized regime and that at the interpolation threshold is at low temperature; while in the latter case we have the spiky shape of the manifold discussed above, in the former they are more spherical, located at the center of the manifold.

In the overparametrized regime we have also resorted to replica computation to study both the loss and the training error landscape on the linear interpolation between two weights sampled with different Boltzmann probability distributions. Our work shows that the training error manifold is star-shaped: it exists a subset of robust, solutions having a large margin $\kappa$ that are linear mode connected to solutions having lower (even zero) margin, similarly to what was found in Annesi et al. (2023). Typical low temperature weights extracted from the Boltzmann measure equipped with the cross entropy loss function tend to focus on the inner core of the star-shaped manifold, that we called geodesically convex component, as any two solutions in this region are linear mode connected. This result also agrees with numerical simulations. Our theory also reproduces the non-monotonic behavior of the energy along the linear interpolation between Boltzmann samples at larger temperatures.

The use of a realistic training dataset, displaying correlated data, instead of random data, does not change substantially the properties of the space of weights at low temperature. The center of the spiky structure becomes more bulky and the barriers more heterogeneous, but the overall geometry does not change.

References

(1) Loss Landscape of Neural Networks: theoretical insights and practical implications, EPFL Virtual Symposium .
LeCun et al. (2015) Y. LeCun, Y. Bengio, and G. Hinton, Nature 521, 436 (2015).
LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Proceedings of the IEEE 86, 2278 (1998).
Bottou (2010) L. Bottou, in Proceedings of COMPSTAT’2010 (Springer, 2010) pp. 177–186.
Kingma and Ba (2014) D. P. Kingma and J. Ba, CoRR abs/1412.6980 (2014).
Wilson et al. (2017) A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, in Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Curran Associates, Inc., 2017) pp. 4148–4158.
Kaplan et al. (2020) J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, arXiv preprint arXiv:2001.08361 (2020).
Hoffmann et al. (2022) J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre, in Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22 (Curran Associates Inc., Red Hook, NY, USA, 2022).
Gardner and Derrida (1988) E. Gardner and B. Derrida, Journal of Physics A: Mathematical and General 21, 271 (1988).
Baldassi et al. (2019) C. Baldassi, E. M. Malatesta, and R. Zecchina, Phys. Rev. Lett. 123, 170602 (2019).
Baldassi et al. (2020a) C. Baldassi, F. Pittorino, and R. Zecchina, Proceedings of the National Academy of Sciences 117, 161 (2020a).
Baldassi et al. (2015) C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Phys. Rev. Lett. 115, 128101 (2015).
Baldassi et al. (2021) C. Baldassi, C. Lauditi, E. M. Malatesta, G. Perugini, and R. Zecchina, Physical Review Letters 127, 278301 (2021).
Annesi et al. (2023) B. L. Annesi, C. Lauditi, C. Lucibello, E. M. Malatesta, G. Perugini, F. Pittorino, and L. Saglietti, Phys. Rev. Lett. 131, 227301 (2023).
Huang and Kabashima (2014) H. Huang and Y. Kabashima, Phys. Rev. E 90, 052813 (2014).
Baldassi et al. (2022) C. Baldassi, C. Lauditi, E. M. Malatesta, R. Pacelli, G. Perugini, and R. Zecchina, Physical Review E 106, 014116 (2022).
Baldassi et al. (2020b) C. Baldassi, E. M. Malatesta, M. Negri, and R. Zecchina, Journal of Statistical Mechanics: Theory and Experiment 2020, 124012 (2020b).
Li et al. (2018) H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, in Advances in Neural Information Processing Systems, Vol. 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Curran Associates, Inc., 2018).
Huang et al. (2020) W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang, and T. Goldstein, in Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, edited by J. Zosa Forde, F. Ruiz, M. F. Pradier, and A. Schein (PMLR, 2020) pp. 87–97.
Draxler et al. (2018) F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, in Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, edited by J. Dy and A. Krause (PMLR, 2018) pp. 1309–1318.
Garipov et al. (2018) T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, in Advances in Neural Information Processing Systems, Vol. 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Curran Associates, Inc., 2018).
Fort and Jastrzebski (2019) S. Fort and S. Jastrzebski, Advances in Neural Information Processing Systems 32 (2019).
Pittorino et al. (2022) F. Pittorino, A. Ferraro, G. Perugini, C. Feinauer, C. Baldassi, and R. Zecchina, Journal of Statistical Mechanics: Theory and Experiment 2022, 114007 (2022).
Entezari et al. (2022) R. Entezari, H. Sedghi, O. Saukh, and B. Neyshabur, in International Conference on Learning Representations (2022).
Duane et al. (1987) S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, Physics Letters B 195, 216 (1987).
Camilloni et al. (2011) C. Camilloni, R. A. Broglia, and G. Tiana, Journal of Chemical Physics 134, 45105 (2011).
Tiana and Camilloni (2012) G. Tiana and C. Camilloni, The Journal of Chemical Physics 137, 235101 (2012).
Barkai et al. (1992) E. Barkai, D. Hansel, and H. Sompolinsky, Phys. Rev. A 45, 4146 (1992).
Engel et al. (1992) A. Engel, H. M. Köhler, F. Tschepke, H. Vollmayr, and A. Zippelius, Phys. Rev. A 45, 7590 (1992).
Barkai et al. (1990) E. Barkai, D. Hansel, and I. Kanter, Physical review letters 65, 2312 (1990).
Annesi et al. (2024) B. L. Annesi, E. M. Malatesta, and F. Zamponi, arXiv preprint arXiv:2410.06717 (2024).
Baldassi et al. (2016) C. Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Proceedings of the National Academy of Sciences 113, E7655 (2016).
Engel and Van den Broeck (2001) A. Engel and C. Van den Broeck, Statistical mechanics of learning (Cambridge University Press, 2001).
Belkin et al. (2019) M. Belkin, D. Hsu, S. Ma, and S. Mandal, Proceedings of the National Academy of Sciences 116, 15849 (2019).
Mézard et al. (1987) M. Mézard, G. Parisi, and M. Virasoro, Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, Vol. 9 (World Scientific Publishing Company, 1987).
Baldassi et al. (2023) C. Baldassi, E. M. Malatesta, G. Perugini, and R. Zecchina, Phys. Rev. E 108, 024310 (2023).
Franz and Parisi (1995) S. Franz and G. Parisi, Journal de Physique I 5, 1401 (1995).
Lin et al. (2024) Z. Lin, P. Li, and L. Wu, arXiv preprint arXiv:2404.06391 (2024).
Sonthalia et al. (2024) A. Sonthalia, A. Rubinstein, E. Abbasnejad, and S. J. Oh, arXiv preprint arXiv:2403.07968 (2024).
Neal (1996) R. M. Neal, “Priors for infinite networks,” in Bayesian Learning for Neural Networks (Springer New York, New York, NY, 1996) pp. 29–53.

Appendix A Equilibrium measure

We report here the outcome of the equilibrium calculation, through the evaluation of the free entropy

\phi=\lim_{N\to\infty}\frac{1}{N}\mathbb{E}_{\mathcal{D}}\ln Z(\beta;\mathcal{% D})

(29)

where is the partition function defined in equation (8) with the Gaussian prior (10); $\mathbb{E}_{\mathcal{D}}$ refers to the average over the random dataset $\mathcal{D}$ . This average can be performed by using the replica trick Mézard et al. (1987) and the saddle point method in the large $N$ limit. Here we report the final result, which is a slight variation of the one reported in Baldassi et al. (2019)

\phi=\lim_{N\to\infty}\lim_{n\to 0}\frac{1}{nN}\ln\mathbb{E}_{\mathcal{D}}Z^{n% }=\mathrm{extr}_{q,Q}\left[\mathcal{G}_{S}(q,Q)+\alpha\mathcal{G}_{E}(q,Q)\right]

(30)

where we have defined the entropic and the energetic terms respectively as

\begin{split}\mathcal{G}_{S}(q,Q)&=\frac{Q}{2(Q-q)}-\frac{\beta\lambda}{2}Q+% \frac{1}{2}\ln(2\pi(Q-q))\\ \mathcal{G}_{E}(q,Q)&=\int\prod_{l=1}^{K}Dx_{l}\ln\int\prod_{l=1}^{K}D\lambda_% {l}\,e^{-\beta\ell\left(\frac{1}{\sqrt{K}}\sum_{l}c_{l}g\left(\sqrt{q}x_{l}+% \sqrt{Q-q}\lambda_{l}\right)\right)}\end{split}

(31)

We remind that $\lambda$ is the Lagrange multiplier (or regularization in machine learning jargon) that fixes the square norm $Q$ of the weights $\boldsymbol{w}$ . $q$ and $Q$ can be obtained from the extremization of the right side of (30). $q$ represent the typical overlap of two weights $\boldsymbol{w}^{a}$ , $\boldsymbol{w}^{b}$ sampled from the Boltzmann distribution (7) (see also equation (28)); $Q$ represent the typical squared norm of a sample of (7). In the large $K$ limit (but having $K/N\to 0$ ), the energetic term can be simplified by using the central limit theorem as shown in Baldassi et al. (2019); Annesi et al. (2024)

\begin{split}\mathcal{G}_{E}(q,Q)&=\int Dz_{0}\ln\int Dz_{1}\,e^{-\beta\ell% \left(\sqrt{\Delta_{Q}(q)-\Delta_{Q}(0)}z_{0}+\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(q% )}z_{1}-\kappa\right)}\,.\end{split}

(32)

where

\Delta_{Q}(q)\equiv\int Dx\left[\int Dy\,\varphi\left(\sqrt{q}x+\sqrt{Q-q}y% \right)\right]^{2}

(33)

is an effective order parameter Barkai et al. (1990); Engel et al. (1992) whose expression depends on the choice of the activation function $\varphi$ . This kernel has also the same expression of the Neural Network Gaussian Process (NNGP) kernel that appears in neural networks learning a finite number of examples in the large width limit Neal (1996).

A.1 Large $\beta$ limit

In the large $\beta$ limit we have the following scaling

q=Q-\frac{\delta q}{\beta}

(34)

This induces the following scaling on the effective order parameter difference

\Delta_{Q}(Q)-\Delta_{Q}(q)\simeq\Delta^{\prime}_{Q}(Q)\frac{\delta q}{\beta}

(35)

where

\Delta_{Q}^{\prime}(q)\equiv\frac{\partial\Delta_{Q}(q)}{\partial q}=\int Dx% \left[\int Dy\,\varphi^{\prime}\left(\sqrt{q}x+\sqrt{Q-q}y\right)\right]^{2}\,.

(36)

We therefore have that the free energy of the system is


$\displaystyle-f$	$\displaystyle\equiv\lim_{\beta\to\infty}\phi=\mathrm{extr}_{\delta q,Q}\left[% \mathcal{G}_{S}(\delta q,Q)+\alpha\mathcal{G}_{E}(\delta q,Q)\right]$	(37a)
$\displaystyle\mathcal{G}_{S}(\delta q,Q)$	$\displaystyle=\frac{Q}{2\delta q}-\frac{\lambda}{2}Q$	(37b)
$\displaystyle\mathcal{G}_{E}(\delta q,Q)$	$\displaystyle=\int Dz_{0}\,z_{\star}(z_{0})\,,$	(37c)

where we have defined the function $z_{\star}(z_{0})$ as

\begin{split}z_{\star}(z_{0})&=\operatorname*{argmax}_{z_{1}}\left[-\frac{z_{1% }^{2}}{2}-\ell_{1}\left(\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(0)}\,z_{0}+\sqrt{\Delta% _{Q}^{\prime}(Q)\delta q_{1}}\,z_{1}\right)\right]\end{split}

(38)

Appendix B Loss Landscape on the linear interpolation between weights

In this section we want to compute the average loss $\widetilde{\mathcal{L}}$ of $\widetilde{\boldsymbol{w}}(\gamma)$ as defined in (27). $\widetilde{\boldsymbol{w}}(\gamma)$ is the interpolation of $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ , see equation (24). As stated in the main text, both weights $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ and their interpolation $\widetilde{\boldsymbol{w}}(\gamma)$ are considered to have the same squared norm $Q$ . Equation (27) can be written also in terms of the corresponding loss per pattern $\widetilde{\ell}$

\begin{split}E(\gamma)&=\mathbb{E}_{\mathcal{D}}\frac{\int d\boldsymbol{w}^{1}% d\boldsymbol{w}^{2}\,p(\boldsymbol{w}^{1})p(\boldsymbol{w}^{2})\,e^{-\beta% \mathcal{L}_{1}(\boldsymbol{w}_{1};\mathcal{D})-\beta\mathcal{L}_{2}(% \boldsymbol{w}_{2};\mathcal{D})}\,\tilde{\ell}\left(\Delta^{\mu}(\widetilde{% \boldsymbol{w}}(\gamma))\right)}{Z_{\mathcal{D}}^{1}Z_{\mathcal{D}}^{2}}\end{split}

(39)

we have here focused on a particular pattern $\mu$ , since in the thermodynamic limit all input patterns will give the same contribution on average.

B.1 Replica approach

The computation proceeds as usual introducing replicas via the identity $(Z_{\mathcal{D}}^{r})^{-1}=\lim\limits_{n\to 0}(Z_{\mathcal{D}}^{r})^{n-1}$ with $r=1,2$ . We have (denoting with $r,s=1,2$ the index that runs over the real replicas $\boldsymbol{w}^{1}$ and $\boldsymbol{w}^{2}$ )

\begin{split}E(\gamma)&=\int\prod_{arl\mu}\frac{d\lambda_{lr}^{\mu a}d\hat{% \lambda}_{lr}^{\mu a}}{2\pi}e^{i\lambda_{lr}^{\mu a}\hat{\lambda}_{lr}^{\mu a}% }\int\prod_{ar}d\boldsymbol{w}^{ra}\,\prod_{ra\mu}e^{-\beta\ell_{r}\left(\frac% {1}{\sqrt{K}}\sum_{l}c_{l}\varphi\left(\lambda^{\mu a}_{lr}\right)\right)}\,% \tilde{\ell}\left[\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\varphi\left(\frac{\gamma% \lambda_{l1}^{\mu 1}+(1-\gamma)\lambda_{l2}^{\mu 1}}{c_{\gamma}}\right)\right]% \\ &\times e^{-\frac{\lambda_{1}}{2}\sum_{lia}(w_{li}^{ar=1})^{2}-\frac{\lambda_{% 2}}{2}\sum_{lia}(w_{li}^{ar=2})^{2}}\prod_{li\mu}\mathbb{E}_{\xi_{li}^{\mu}}e^% {-i\frac{\xi_{li}^{\mu}}{\sqrt{N}}\sum_{ar}w_{li}^{ar}\hat{\lambda}_{lr}^{\mu a% }}\\ &=\int\prod_{ar,bs,l}\frac{dq_{rsl}^{ab}d\hat{q}_{rsl}^{ab}}{2\pi}e^{-N\sum_{% ar,bsl}q_{rsl}^{ab}\hat{q}_{rsl}^{ab}-\frac{\lambda_{1}}{2}\sum_{la}q^{aa}_{11% l}-\frac{\lambda_{2}}{2}\sum_{rla}q^{aa}_{22l}+NG_{S}+(N\alpha-1)G_{E}}\\ &\times\int\prod_{arl}\frac{d\lambda_{lr}^{a}d\hat{\lambda}_{lr}^{a}}{2\pi}\,e% ^{i\lambda_{lr}^{a}\hat{\lambda}_{lr}^{a}}\prod_{ra}e^{-\beta\ell_{r}\left(% \frac{1}{\sqrt{K}}\sum_{l}c_{l}\varphi\left(\lambda^{\mu a}_{lr}\right)\right)% }\,\tilde{\ell}\left[\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\varphi\left(\frac{% \gamma\lambda_{l1}^{\mu 1}+(1-\gamma)\lambda_{l2}^{\mu 1}}{c_{\gamma}}\right)% \right]e^{-\frac{1}{2}\sum_{ar,bs}q^{ab}_{rs}\hat{\lambda}_{r}^{a}\hat{\lambda% }_{s}^{b}}\end{split}

(40)

where we remind that $c_{\gamma}$ is the the quantity that appears in equation (24). We have also introduced the usual entropic and energetic terms Baldassi et al. (2019) as

\begin{split}G_{S}&\equiv\ln\int\prod_{ar}dw^{ra}_{l}\,e^{\sum_{ar,bsl}\hat{q}% ^{ab}_{rsl}w_{l}^{ra}w_{l}^{sb}}\\ G_{E}&\equiv\ln\int\prod_{arl}\frac{d\lambda_{rl}^{a}d\hat{\lambda}_{rl}^{a}}{% 2\pi}\,e^{-\beta\sum_{ra}\ell_{r}\left(\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,% \varphi\left(\lambda^{a}_{rl}\right)\right)}\,e^{i\sum_{rla}\lambda_{rl}^{a}% \hat{\lambda}_{rl}^{a}-\frac{1}{2}\sum_{ar,bsl}q^{ab}_{rsl}\hat{\lambda}_{rl}^% {a}\hat{\lambda}_{sl}^{b}}\,.\end{split}

(41)

B.2 Replica Symmetric ansatz

We impose the Replica Symmetric (RS) ansatz over order parameters


$\displaystyle q_{rsl}^{aa}$	$\displaystyle\equiv Q\delta_{rs}+(1-\delta_{rs})p$	$\displaystyle a\in[n]\,,\forall l$	(42a)
$\displaystyle q_{rsl}^{ab}$	$\displaystyle\equiv t_{rs}=q_{r}\delta_{rs}+(1-\delta_{rs})p$	$\displaystyle a\neq b\,,\forall l$	(42b)

Notice that we called by $q_{1}$ and $q_{2}$ the typical overlap between solutions extracted from the distribution of the endpoint $\gamma=1$ and $\gamma=0$ respectively. The overlap $p$ represents the typical overlap between the two endpoints. A similar ansatz is imposed over the conjugated order parameters $\hat{q}_{rsl}^{ab}$ . However in the $n\to 0$ limit the conjugated order parameters will not appear explicitly in the expression of $E(\gamma)$ . We need to express decompose the term

\begin{split}-\frac{1}{2}\sum_{abrsl}q_{rsl}^{ab}\hat{\lambda}_{lr}^{a}\hat{% \lambda}_{ls}^{b}&=-\frac{1}{2}\sum_{r}(Q-q_{r})\sum_{al}\left(\hat{\lambda}_{% rl}^{a}\right)^{2}-\frac{1}{2}\sum_{rs}t_{rs}\sum_{abl}\hat{\lambda}_{rl}^{a}% \hat{\lambda}_{sl}^{b}\\ &=-\frac{1}{2}\sum_{r}(Q-q_{r})\sum_{al}\left(\hat{\lambda}_{rl}^{a}\right)^{2% }-\frac{1}{2}\sum_{rl}\left(\sum_{as}\mathcal{T}_{rs}\hat{\lambda}^{a}_{sl}% \right)^{2}\end{split}

(43)

where $\mathcal{T}_{rs}$ is $(r,s)$ the element of the square root of the matrix $t_{rs}$ defined in (42b). The computation proceeds in a standard way by using a Hubbard-Stratonovich transformation and integrating over $\hat{\lambda}_{rl}^{a}$ . We get

\begin{split}E(\gamma)&=\int\prod_{arl}\frac{d\lambda_{rl}^{a}d\hat{\lambda}_{% rl}^{a}}{2\pi}\,e^{-\beta\ell_{r}\left(\frac{1}{\sqrt{K}}\sum_{l}c_{l}\varphi% \left(\lambda^{a}_{lr}\right)\right)}\,\tilde{\ell}\left[\frac{1}{\sqrt{K}}% \sum_{l}c_{l}\,\varphi\left(\frac{\gamma\lambda_{l1}^{1}+(1-\gamma)\lambda_{l2% }^{1}}{c_{\gamma}}\right)\right]\\ &\times e^{i\sum_{arl}\lambda_{rl}^{a}\hat{\lambda}_{rl}^{a}-\frac{1}{2}\sum_{% r}(Q-q_{r})\sum_{al}\left(\hat{\lambda}_{rl}^{a}\right)^{2}-\frac{1}{2}\sum_{% rl}\left(\sum_{as}\mathcal{T}_{rs}\hat{\lambda}^{a}_{sl}\right)^{2}}\\ &=\int\!\prod_{rl}Dx_{rl}\frac{\int\prod_{rl}\frac{d\lambda_{rl}d\hat{\lambda}% _{rl}}{2\pi}\,e^{i\hat{\lambda}_{rl}\left(\lambda_{rl}+\sum_{s}\mathcal{T}_{rs% }x_{sl}\right)-\frac{1}{2}(Q-q_{r})\hat{\lambda}_{rl}^{2}-\beta\sum_{r}\ell_{r% }\left(\frac{1}{\sqrt{K}}\sum_{l}c_{l}\varphi\left(\lambda_{lr}\right)\right)}% \,\tilde{\ell}\left[\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\varphi\left(\frac{\gamma% \lambda_{l1}+(1-\gamma)\lambda_{l2}}{c_{\gamma}}\right)\right]}{\prod_{r}\int% \prod_{l}\frac{d\lambda_{l}d\hat{\lambda}_{l}}{2\pi}\,e^{-\beta\ell_{r}\left(% \frac{1}{\sqrt{K}}\sum_{l}c_{l}\varphi\left(\lambda_{l}\right)\right)}\,e^{i% \sum_{l}\hat{\lambda}_{l}\left(\lambda_{l}+\sum_{s}\mathcal{T}_{sr}\,x_{rl}% \right)-\frac{1}{2}\sum_{r}(Q-q_{r})\sum_{l}\hat{\lambda}_{l}^{2}}}\\ &=\int\!\prod_{rl}Dx_{rl}\frac{\int\prod_{rl}D\lambda_{rl}\,e^{-\beta\sum_{r}% \ell_{r}\left(\frac{1}{\sqrt{K}}\sum_{l}c_{l}\varphi\left(\sqrt{Q-q_{r}}% \lambda_{lr}-\sum_{s}\mathcal{T}_{rs}x_{sl}\right)\right)}\,\tilde{\ell}\left[% \frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\varphi\left(\sum_{r}\gamma_{r}\left(\frac{% \sqrt{Q-q_{r}}\lambda_{rl}-\sum_{s}\mathcal{T}_{rs}x_{sl}}{c_{\gamma}}\right)% \right)\right]}{\prod_{r}\int\prod_{l}D\lambda_{l}\,e^{-\beta\ell_{r}\left(% \frac{1}{\sqrt{K}}\sum_{l}c_{l}\varphi\left(\sqrt{Q-q_{r}}\lambda_{l}-\sum_{s}% \mathcal{T}_{rs}x_{sl}\right)\right)}}\end{split}

(44)

where in the last expression we have defined $\gamma_{1}\equiv\gamma$ and $\gamma_{2}=1-\gamma$ for convenience.

B.3 Large $K$ limit

We now perform the large $K$ limit, in order to further simplify and the expression in (44) by repeated usage of the central limit theorem. We will call by $I$ the numerator of the last expression in (44).

The numerator of the fraction can be written as

\begin{split}I&\equiv\int\prod_{rl}D\lambda_{rl}\,e^{-\beta\sum_{r}\ell_{r}% \left(\frac{1}{\sqrt{K}}\sum_{l}c_{l}\varphi\left(\sqrt{Q-q_{r}}\lambda_{lr}-% \sum_{s}\mathcal{T}_{rs}x_{sl}\right)\right)}\,\tilde{\ell}\left[\frac{1}{% \sqrt{K}}\sum_{l}c_{l}\,\varphi\left(\sum_{r}\gamma_{r}\left(\frac{\sqrt{Q-q_{% r}}\lambda_{rl}-\sum_{s}\mathcal{T}_{rs}x_{sl}}{c_{\gamma}}\right)\right)% \right]\\ &=\int\prod_{r}\frac{dh_{r}d\hat{h}_{r}}{2\pi}\,e^{i\sum_{r}h_{r}\hat{h}_{r}-% \beta\sum_{r}\ell_{r}\left(h_{r}\right)}\int\frac{dvd\hat{v}}{2\pi}e^{i\hat{v}% v}\,\tilde{\ell}\left[v\right]\\ &\times\int\prod_{rl}D\lambda_{rl}\,e^{-\frac{i}{\sqrt{K}}\sum_{r}\hat{h}_{r}% \sum_{l}c_{l}\varphi\left(\sqrt{Q-q_{r}}\lambda_{lr}-\sum_{s}\mathcal{T}_{rs}x% _{sl}\right)-\frac{i\hat{v}}{\sqrt{K}}\sum_{l}c_{l}\,\varphi\left(\sum_{r}% \gamma_{r}\left(\frac{\sqrt{Q-q_{r}}\lambda_{rl}-\sum_{s}\mathcal{T}_{rs}x_{sl% }}{c_{\gamma}}\right)\right)}\end{split}

(45)

Expanding the exponential up to second order, averaging over $\lambda_{lr}$ and re-exponentiating we get

\begin{split}&\int\prod_{rl}D\lambda_{rl}\,e^{-\frac{i}{\sqrt{K}}\sum_{r}\hat{% h}_{r}\sum_{l}c_{l}\varphi\left(\sqrt{Q-q_{r}}\lambda_{lr}-\sum_{s}\mathcal{T}% _{rs}x_{sl}\right)-\frac{i\hat{v}}{\sqrt{K}}\sum_{l}c_{l}\,\varphi\left(\sum_{% r}\gamma_{r}\left(\frac{\sqrt{Q-q_{r}}\lambda_{rl}-\sum_{s}\mathcal{T}_{rs}x_{% sl}}{c_{\gamma}}\right)\right)}\\ &\simeq e^{-i\sum_{r}\hat{h}_{r}M_{r}^{(0)}-i\hat{v}N^{(0)}-\frac{1}{2}\sum_{r% }\Delta_{r}^{(0)}\hat{h}_{r}^{2}-\frac{\Xi^{(0)}}{2}\hat{v}^{2}-\hat{v}\sum_{r% }\hat{h}_{r}\Omega_{r}^{(0)}}\end{split}

(46)

where we have introduced the notation


$\displaystyle\varphi_{rl}(\lambda_{r},x)$	$\displaystyle\equiv\varphi\left(\sqrt{Q-q_{r}}\lambda_{lr}-\sum_{s}\mathcal{T}% _{rs}x_{sl}\right)$	(47a)
$\displaystyle\tilde{\varphi}_{l}(\gamma;\{\lambda_{r}\}_{r=1}^{y},x)$	$\displaystyle\equiv\varphi\left(\sum_{r}\gamma_{r}\left(\frac{\sqrt{Q-q_{r}}% \lambda_{rl}-\sum_{s}\mathcal{T}_{rs}x_{sl}}{c_{\gamma}}\right)\right)$	(47b)

to define the following quantities


$\displaystyle M^{(0)}_{r}$	$\displaystyle\equiv\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\left\langle\varphi\left(% \sqrt{Q-q_{r}}\lambda_{r}-\sum_{s}\mathcal{T}_{rs}x_{sl}\right)\right\rangle_{% \lambda_{r}}\equiv\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\left\langle\varphi_{rl}% \right\rangle_{\lambda}$	(48a)
$\displaystyle N^{(0)}$	$\displaystyle\equiv\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\left\langle\varphi\left(% \sum_{r}\gamma_{r}\left(\frac{\sqrt{Q-q_{r}}\lambda_{rl}-\sum_{s}\mathcal{T}_{% rs}x_{sl}}{c_{\gamma}}\right)\right)\right\rangle_{\{\lambda_{r}\}_{r=1}^{y}}% \equiv\frac{1}{\sqrt{K}}\sum_{l}c_{l}\,\left\langle\tilde{\varphi}_{l}\right% \rangle_{\lambda}$	(48b)
$\displaystyle\Delta^{(0)}_{r}$	$\displaystyle\equiv\frac{1}{K}\sum_{l}c_{l}^{2}\,\left[\langle\varphi^{2}_{rl}% \rangle_{\lambda}-\langle\varphi_{rl}\rangle^{2}_{\lambda}\right]$	(48c)
$\displaystyle\Xi^{(0)}$	$\displaystyle\equiv\frac{1}{K}\sum_{l}c_{l}^{2}\left[\langle\tilde{\varphi}^{2% }_{l}\rangle_{\lambda}-\langle\tilde{\varphi}_{l}\rangle^{2}_{\lambda}\right]$	(48d)
$\displaystyle\Omega_{r}^{(0)}$	$\displaystyle\equiv\frac{1}{K}\sum_{l}c_{l}^{2}\left[\langle\tilde{\varphi}_{l% }\varphi_{lr}\rangle_{\lambda}-\langle\tilde{\varphi}_{l}\rangle_{\lambda}% \langle\varphi_{lr}\rangle_{\lambda_{r}}\right]$	(48e)

Notice that $\Delta^{(0)}_{rs}=\frac{1}{K}\sum_{l}c_{l}^{2}\,\left[\langle\varphi_{rl}% \varphi_{sl}\rangle_{\lambda}-\langle\varphi_{rl}\rangle_{\lambda}\langle% \varphi_{sl}\rangle_{\lambda}\right]=0$ if $r\neq s$ since the Gaussian variable $\lambda_{lr}$ is independent of $\lambda_{ls}$ . We then inserting (46) back into (45) and perform the integrals over $\hat{h}_{r}$ and $\hat{v}$ ; in general the integral is of the form

\begin{split}&\int\prod_{r}\frac{dh_{r}d\hat{h}_{r}}{2\pi}\frac{dvd\hat{v}}{2% \pi}e^{i\hat{h}_{r}(h_{r}-M_{r})+i\hat{v}(v-N)-\frac{1}{2}\sum_{rs}\Delta_{rs}% \hat{h}_{r}\hat{h}_{s}-\frac{\Xi}{2}\hat{v}^{2}-\hat{v}\sum_{r}\Omega_{r}\hat{% h}_{r}}f(\{h_{r}\},v)\\ &\simeq\int\prod_{r}\frac{dh_{r}d\hat{h}_{r}}{2\pi}Dv\,e^{i\sum_{r}\hat{h}_{r}% (h_{r}-M_{r}-\frac{v}{\sqrt{\Xi}}\Omega_{r})-\frac{1}{2}\sum_{rs}\left(\Delta_% {rs}-\frac{\Omega_{r}\Omega_{s}}{\Xi}\right)\hat{h}_{r}\hat{h}_{s}}f(\{h_{r}\}% ,N+\sqrt{\Xi}v)\\ &=\int Dv\prod_{r}Dh_{r}\,f\left(\left\{M_{r}+\frac{\Omega_{r}}{\sqrt{\Xi}}\,v% +\sum_{s}\left(\sqrt{A}\right)_{rs}h_{s}\right\},N+\sqrt{\Xi}v\right)\end{split}

(49)

with $A_{rs}\equiv\Delta_{rs}-\frac{\Omega_{r}\Omega_{s}}{\Xi}$ . Specializing this identity to our case, i.e. using $f(\left\{h_{r}\right\},v)=e^{-\beta\sum_{r}\ell_{r}(h_{r})}\,\widetilde{\ell}(v)$ we get

\begin{split}E(\gamma)&=\int\prod_{l}Dx_{rl}\frac{\int\prod_{r}D\lambda_{r}Ds% \,\tilde{\ell}\left(N^{(0)}+\sqrt{\Xi^{(0)}}s\right)\,e^{-\beta\sum_{r}\ell% \left(M_{r}^{(0)}+\frac{\Omega_{r}^{(0)}}{\sqrt{\Xi^{(0)}}}s+\sum_{s}\left[% \sqrt{\Lambda^{(0)}}\right]_{rs}\lambda_{s}\right)}}{\prod_{r}\int D\lambda\,e% ^{-\beta\ell_{r}\left(M_{r}^{(0)}+\sqrt{\Delta^{(0)}_{r}}\lambda\right)}}\end{split}

(50)

with $\Lambda_{rs}^{(0)}\equiv\Delta^{(0)}_{r}\delta_{rs}-\frac{\Omega_{r}^{(0)}% \Omega_{s}^{(0)}}{\Xi^{(0)}}$ . Notice that we have done the same steps also in the denominator of the fraction.

Finally we apply the central limit again to simplify the integrals over $x_{l}$ variables; concerning all the variance terms, i.e. $\Delta_{r}^{(0)}$ , $\Xi^{(0)}$ and $\Omega_{r}^{(0)}$ we can trivially compute their mean with respect to $x_{r}$ , their variance being subleading in $K$ . The only new terms come from the variance of the mean terms, i.e. the parameters $M_{r}^{(0)}$ and $N^{(0)}$ . We get a term of the type

\begin{split}&\int\prod_{rl}Dx_{rl}\int\prod_{r}\frac{dh_{r}d\hat{h}_{r}}{2\pi% }\frac{dvd\hat{v}}{2\pi}\,e^{ih_{r}\hat{h}_{r}+iv\hat{v}-i\hat{h}_{r}M_{r}^{(0% )}-i\hat{v}N^{(0)}}f(\{h_{r}\},v)\\ &\simeq\int\prod_{r}\frac{dh_{r}d\hat{h}_{r}}{2\pi}\frac{dvd\hat{v}}{2\pi}e^{i% \hat{h}_{r}(h_{r}-M)+i\hat{v}(v-N)-\frac{1}{2}\sum_{rs}\Psi_{rs}\hat{h}_{r}% \hat{h}_{s}-\frac{T}{2}\hat{v}^{2}-\hat{v}\sum_{r}U_{r}\hat{h}_{r}}f(\{h_{r}\}% ,v)\\ &=\int Dx\prod_{r}Dy_{r}\,f\left(\left\{M+\frac{U_{r}}{\sqrt{T}}\,x+\sum_{s}% \sqrt{H}_{rs}y_{s}\right\},N+\sqrt{T}x\right)\end{split}

(51)

where we have used in the last step the general identity (LABEL:eq::generic_identity). The final result is

\begin{split}E(\gamma)&=\int\prod_{l}Dx_{rl}\frac{\int\prod_{r}D\lambda_{r}Ds% \,\tilde{\ell}\left(N^{(0)}+\sqrt{\Xi^{(0)}}s\right)\,e^{-\beta\sum_{r}\ell% \left(M_{r}^{(0)}+\frac{\Omega_{r}^{(0)}}{\sqrt{\Xi^{(0)}}}s+\sum_{s}\left[% \sqrt{\Lambda^{(0)}}\right]_{rs}\lambda_{s}\right)}}{\prod_{r}\int D\lambda\,e% ^{-\beta\ell_{r}\left(M_{r}^{(0)}+\sqrt{\Delta^{(0)}_{r}}\lambda\right)}}\\ &=\int Dx\prod_{r}Dy_{r}\frac{\int\prod_{r}D\lambda_{r}Ds\,\tilde{\ell}\left(% \sqrt{T}x+\sqrt{\Xi}s\right)\,e^{-\beta\sum_{r}\ell_{r}\left(\frac{U_{r}}{% \sqrt{T}}x+\sum_{s}\sqrt{H}_{rs}y_{s}+\frac{\Omega_{r}}{\sqrt{\Xi}}s+\sum_{s}% \left[\sqrt{\Lambda}\right]_{rs}\lambda_{s}\right)}}{\prod_{r}\int D\lambda\,e% ^{-\beta\ell_{r}\left(\frac{U_{r}}{\sqrt{T}}x+\sum_{s}\sqrt{H}_{rs}y_{s}+\sqrt% {\Delta_{r}}\lambda\right)}}\end{split}

(52)

where we have defined the quantities


$\displaystyle H_{rs}$	$\displaystyle\equiv\Psi_{rs}-\frac{U_{r}U_{s}}{T}$	(53a)
$\displaystyle\Lambda_{rs}$	$\displaystyle\equiv\Delta_{r}\delta_{rs}-\frac{\Omega_{r}\Omega_{s}}{\Xi}$	(53b)

and


$\displaystyle\Psi_{rs}$	$\displaystyle\equiv\langle\langle\varphi_{r}\rangle_{\lambda}\langle\varphi_{s% }\rangle_{\lambda}\rangle_{x}-\langle\varphi_{r}\rangle_{\lambda,x}\langle% \varphi_{s}\rangle_{\lambda,x}$	(54a)
$\displaystyle\Delta_{r}$	$\displaystyle\equiv\langle\varphi^{2}\rangle_{\lambda,x}-\langle\langle\varphi% _{r}\rangle_{\lambda}^{2}\rangle_{x}$	(54b)
$\displaystyle T$	$\displaystyle\equiv\langle\langle\tilde{\varphi}\rangle^{2}_{\lambda}\rangle_{% x}-\langle\tilde{\varphi}\rangle_{\lambda,x}^{2}$	(54c)
$\displaystyle\Xi$	$\displaystyle\equiv\langle\tilde{\varphi}^{2}\rangle_{\lambda,x}-\langle% \langle\tilde{\varphi}\rangle^{2}_{\lambda}\rangle_{x}$	(54d)
$\displaystyle U_{r}$	$\displaystyle\equiv\langle\langle\varphi_{r}\rangle_{\lambda}\langle\tilde{% \varphi}\rangle_{\lambda}\rangle_{x}-\langle\varphi\rangle_{\lambda,x}\langle% \tilde{\varphi}\rangle_{\lambda,x}$	(54e)
$\displaystyle\Omega_{r}$	$\displaystyle\equiv\langle\langle\tilde{\varphi}\varphi_{r}\rangle_{\lambda}% \rangle_{x}-\langle\langle\tilde{\varphi}\rangle_{\lambda}\langle\varphi_{r}% \rangle_{\lambda}\rangle_{x}$	(54f)

Note that in doing the last step in equation (52) we have used the fact that $M=N=0$ since they are both proportional to $\sum_{l}c_{l}$ . We show in appendix C that the quantities above can be all written in terms of the following function that depends on the activation function $\varphi$

\Delta_{Q}(q)\equiv\int Dx\left[\int Dy\,\varphi\left(\sqrt{q}x+\sqrt{Q-q}y% \right)\right]^{2}

(55)


$\displaystyle\Psi_{rs}$	$\displaystyle=\Delta_{Q}(t_{rs})-\Delta_{Q}(0)$	(56a)
$\displaystyle\Delta_{r}$	$\displaystyle=\Delta_{Q}(Q)-\Delta_{Q}(t_{rr})$	(56b)
$\displaystyle T$	$\displaystyle=\Delta_{Q}\left(\frac{\gamma^{2}q_{1}+(1-\gamma)^{2}q_{2}+2% \gamma(1-\gamma)p}{c_{\gamma}^{2}}\right)-\Delta_{Q}(0)$	(56c)
$\displaystyle\Xi$	$\displaystyle=\Delta_{Q}(Q)-\Delta_{Q}\left(\frac{\gamma^{2}q_{1}+(1-\gamma)^{% 2}q_{2}+2\gamma(1-\gamma)p}{c_{\gamma}^{2}}\right)$	(56d)
$\displaystyle U_{1}$	$\displaystyle=\Delta_{Q}\left(\frac{\gamma q_{1}+(1-\gamma)p}{c_{\gamma}}% \right)-\Delta_{Q}(0)$	(56e)
$\displaystyle U_{2}$	$\displaystyle=\Delta_{Q}\left(\frac{(1-\gamma)q_{2}+\gamma p}{c_{\gamma}}% \right)-\Delta_{Q}(0)$	(56f)
$\displaystyle\Omega_{1}$	$\displaystyle=\Delta_{Q}\left(\frac{\gamma Q+(1-\gamma)p}{c_{\gamma}}\right)-% \Delta_{Q}\left(\frac{\gamma q_{1}+(1-\gamma)p}{c_{\gamma}}\right)$	(56g)
$\displaystyle\Omega_{2}$	$\displaystyle=\Delta_{Q}\left(\frac{(1-\gamma)Q+\gamma p}{c_{\gamma}}\right)-% \Delta_{Q}\left(\frac{(1-\gamma)q_{2}+\gamma p}{c_{\gamma}}\right)$	(56h)

We can simplify (52) by performing 2-dimensional rotations of the integration over the $\lambda$ variables

\begin{split}E(\gamma)&=\int DxDy_{1}Dy_{2}Ds\,\tilde{\ell}\left(\frac{U_{1}}{% \sqrt{\Psi_{11}}}x+\frac{U_{2}\Psi_{11}-U_{1}\Psi_{12}}{\sqrt{\Psi_{11}\det% \Psi}}y_{1}+\sqrt{\frac{T\det H}{\det\Psi}}y_{2}+\sqrt{\Xi}s\right)\\ &\frac{\int D\lambda_{1}D\lambda_{2}\,e^{-\beta\ell_{1}\left(\sqrt{\Psi_{11}}x% +\frac{\Omega_{1}}{\sqrt{\Xi}}s+\sqrt{\Lambda_{11}}\lambda_{1}\right)}e^{-% \beta\ell_{2}\left(\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x+\sqrt{\frac{\det\Psi}{% \Psi_{11}}}y_{1}+\frac{\Omega_{2}}{\sqrt{\Xi}}s+\frac{\Lambda_{12}}{\sqrt{% \Lambda_{11}}}\lambda_{1}+\sqrt{\frac{\det\Lambda}{\Lambda_{11}}}\lambda_{2}% \right)}}{\int D\lambda\,e^{-\beta\ell_{1}\left(\sqrt{\Psi_{11}}x+\sqrt{\Delta% _{1}}\lambda\right)}\int D\lambda\,e^{-\beta\ell_{2}\left(\frac{\Psi_{12}}{% \sqrt{\Psi_{11}}}x+\sqrt{\frac{\det\Psi}{\Psi_{11}}}y_{1}+\sqrt{\Delta_{2}}% \lambda\right)}}\end{split}

(57)

Notice how the variable $y_{2}$ appears only in $\tilde{\ell}$ . Performing a rotation over $\lambda_{1}$ and $s$

\begin{split}\epsilon_{t}(\gamma)&=\int DxDy_{1}Dy_{2}\int D\lambda_{1}Ds\,% \tilde{\ell}\left(\frac{U_{1}}{\sqrt{\Psi_{11}}}x+\frac{U_{2}\Psi_{11}-U_{1}% \Psi_{12}}{\sqrt{\Psi_{11}\det\Psi}}y_{1}+\sqrt{\frac{T\det H}{\det\Psi}}y_{2}% +\frac{\Omega_{1}}{\sqrt{\Delta_{1}}}\lambda_{1}+\sqrt{\frac{\Lambda_{11}\Xi}{% \Delta_{1}}}s\right)\\ &\frac{e^{-\beta\ell_{1}\left(\sqrt{\Psi_{11}}x+\sqrt{\Delta_{1}}\lambda_{1}% \right)}\int D\lambda_{2}\,e^{-\beta\ell_{2}\left(\frac{\Psi_{12}}{\sqrt{\Psi_% {11}}}x+\sqrt{\frac{\det\Psi}{\Psi_{11}}}y_{1}+\sqrt{\frac{\Delta_{1}}{\Xi% \Lambda_{11}}}\Omega_{2}s+\sqrt{\frac{\det\Lambda}{\Lambda_{11}}}\lambda_{2}% \right)}}{\int D\lambda\,e^{-\beta\ell_{1}\left(\sqrt{\Psi_{11}}x+\sqrt{\Delta% _{1}}\lambda\right)}\int D\lambda\,e^{-\beta\ell_{2}\left(\frac{\Psi_{12}}{% \sqrt{\Psi_{11}}}x+\sqrt{\frac{\det\Psi}{\Psi_{11}}}y_{1}+\sqrt{\Delta_{2}}% \lambda\right)}}\end{split}

(58)

Notice how $\lambda_{1}$ disappears from the argument of the second loss function $\ell_{2}$ . Finally (57) can be further simplified by letting appear the same argument in $\ell_{2}$ in the numerator and denominator. This can be obtained by performing a rotation over the Gaussian variables $s$ and $\lambda_{2}$ . After one can integrate explicitly over $y_{2}$ obtaining

\begin{split}E(\gamma)&=\int DxDy_{1}\int D\lambda_{1}D\lambda_{2}\,\frac{e^{-% \beta\ell_{1}\left(\sqrt{\Psi_{11}}x+\sqrt{\Delta_{1}}\lambda_{1}\right)}\,e^{% -\beta\ell_{2}\left(\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x+\sqrt{\frac{\det\Psi}{% \Psi_{11}}}y_{1}+\sqrt{\Delta_{2}}\lambda_{2}\right)}}{\int D\lambda\,e^{-% \beta\ell_{1}\left(\sqrt{\Psi_{11}}x+\sqrt{\Delta_{1}}\lambda\right)}\int D% \lambda\,e^{-\beta\ell_{2}\left(\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x+\sqrt{% \frac{\det\Psi}{\Psi_{11}}}y_{1}+\sqrt{\Delta_{2}}\lambda\right)}}\\ &\int Ds\,\tilde{\ell}\left(\frac{U_{1}}{\sqrt{\Psi_{11}}}x+\frac{U_{2}\Psi_{1% 1}-U_{1}\Psi_{12}}{\sqrt{\Psi_{11}\det\Psi}}y_{1}+\frac{\Omega_{1}}{\sqrt{% \Delta_{1}}}\lambda_{1}+\frac{\Omega_{2}}{\sqrt{\Delta_{2}}}\lambda_{2}+\sqrt{% \Xi-\frac{\Omega_{1}^{2}}{\Delta_{1}}-\frac{\Omega_{2}^{2}}{\Delta_{2}}+\frac{% T\det H}{\det\Psi}}s\right)\end{split}

(59)

B.4 Analytical predictions in some particular cases

In the following we will specialize equation (59) to several interesting cases depending on the Boltzmann distribution through which the endpoints $\gamma=0,1$ are sampled.

B.4.1 Error counting loss with a margin

Here we specialize (59) to the case $\ell_{1}(x)=\Theta(-x+\kappa_{1})$ and $\ell_{2}(x)=\Theta(-x+\kappa_{2})$ where $\kappa_{1}\,,\kappa_{2}\geq 0$ impose a certain degree of robustness $\boldsymbol{w}_{1}$ and $\boldsymbol{w}_{2}$ . We further consider $\tilde{\ell}(x)=\Theta(-x)$ and we will focus on the $\beta\to\infty$ limit for simplicity. Starting from (57) we have

\begin{split}\epsilon_{t}(\gamma)&=\int DxDy_{1}\,\\ &\times\frac{\int DsH\left(\frac{\frac{U_{1}}{\sqrt{\Psi_{11}}}x+\frac{U_{2}% \Psi_{11}-U_{1}\Psi_{12}}{\sqrt{\Psi_{11}\det\Psi}}y_{1}+\sqrt{\Xi}s}{\sqrt{% \frac{T\det H}{\det\Psi}}}\right)\int_{\frac{\kappa_{1}-\sqrt{\Psi_{11}}x-% \frac{\Omega_{1}}{\sqrt{\Xi}}s}{\sqrt{\Lambda_{11}}}}^{\infty}D\lambda_{1}\,H% \left(\frac{\sqrt{\Lambda_{11}}\left(\kappa_{2}-\frac{\Psi_{12}}{\sqrt{\Psi_{1% 1}}}x-\sqrt{\frac{\det\Psi}{\Psi_{11}}}y_{1}-\frac{\Omega_{2}}{\sqrt{\Xi}}s% \right)-\Lambda_{12}\lambda_{1}}{\sqrt{\det\Lambda}}\right)}{H\left(\frac{% \kappa_{1}-\sqrt{\Psi_{11}}x}{\sqrt{\Delta_{1}}}\right)H\left(\frac{\kappa_{2}% -\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x-\sqrt{\frac{\det\Psi}{\Psi_{11}}}y_{1}}{% \sqrt{\Delta_{2}}}\right)}\end{split}

(60)

This is the expression that we have used to produce the plots in Figure 7 of the main text. We show similar plots for the Erf activation function in Figure 10. If the endpoints are sampled with the same margin $\kappa_{1}=\kappa_{2}=\kappa$ then as stated before $q_{1}=q_{2}=p$ and this implies the relation $\sqrt{\frac{T\det H}{\det\Psi}}=\sqrt{T-\frac{U^{2}}{\Psi}}$ . In this case, the formula simplifies even further and reads

\begin{split}\epsilon_{t}(\gamma)&=\int Dx\frac{\int Ds\,H\left(\frac{\frac{U}% {\sqrt{\Psi}}x+\sqrt{\Xi}s}{\sqrt{T-\frac{U^{2}}{\Psi}}}\right)\,\int_{\frac{% \kappa-\sqrt{\Psi}x-\frac{\Omega_{1}}{\sqrt{\Xi}}s}{\sqrt{\Delta-\frac{\Omega_% {1}^{2}}{\Xi}}}}^{\infty}D\lambda_{1}\,H\left(\frac{\sqrt{\Delta-\frac{\Omega_% {1}^{2}}{\Xi}}\left(\kappa-\sqrt{\Psi}x-\frac{\Omega_{2}}{\sqrt{\Xi}}s\right)+% \frac{\Omega_{1}\Omega_{2}}{\Xi}\lambda_{1}}{\sqrt{\det\Lambda}}\right)}{H^{2}% \left(\frac{\kappa-\sqrt{\Psi}x}{\sqrt{\Delta}}\right)}\,.\end{split}

(61)

B.4.2 Equal general loss functions at finite temperature

We will here suppose that $\ell_{1}=\ell_{2}=\tilde{\ell}\equiv\ell$ and $\beta<\infty$ . In this case $q_{1}=q_{2}=p\equiv q$ , so that $\Psi_{rs}=\Delta_{Q}(q)-\Delta_{Q}(0)\equiv\Psi$ , $U_{1}=U_{2}\equiv U=\Delta_{Q}\left(\frac{q}{c_{\gamma}}\right)-\Delta_{Q}(0)$ and $\Delta_{1}=\Delta_{2}\equiv\Delta=\Delta_{Q}(Q)-\Delta_{Q}(q)$ . In this case we obtain

\begin{split}E(\gamma)&=\int DxD\lambda_{1}D\lambda_{2}\,\frac{e^{-\beta\ell% \left(\sqrt{\Psi}x+\sqrt{\Delta}\lambda_{1}\right)}\,e^{-\beta\ell\left(\sqrt{% \Psi}x+\sqrt{\Delta}\lambda_{2}\right)}}{\left[\int D\lambda\,e^{-\beta\ell% \left(\sqrt{\Psi}x+\sqrt{\Delta}\lambda\right)}\right]^{2}}\\ &\int Ds\,\ell\left(\frac{U}{\sqrt{\Psi}}x+\frac{\Omega_{1}}{\sqrt{\Delta}}% \lambda_{1}+\frac{\Omega_{2}}{\sqrt{\Delta}}\lambda_{2}+\sqrt{\Xi-\frac{\Omega% _{1}^{2}}{\Delta}-\frac{\Omega_{2}^{2}}{\Delta}+T-\frac{U^{2}}{\Psi}}s\right)% \end{split}

(62)

The previous equation is the one that we have used for the theoretical curve presented in Figure 6 of the main text. If one is interested in measuring the training error we have

\begin{split}E(\gamma)&=\int DxD\lambda_{1}D\lambda_{2}\,\frac{e^{-\beta\ell% \left(\sqrt{\Psi}x+\sqrt{\Delta}\lambda_{1}\right)}\,e^{-\beta\ell\left(\sqrt{% \Psi}x+\sqrt{\Delta}\lambda_{2}\right)}}{\left[\int D\lambda\,e^{-\beta\ell% \left(\sqrt{\Psi}x+\sqrt{\Delta}\lambda\right)}\right]^{2}}\,H\left(\frac{% \frac{U}{\sqrt{\Psi}}x+\frac{\Omega_{1}}{\sqrt{\Delta}}\lambda_{1}+\frac{% \Omega_{2}}{\sqrt{\Delta}}\lambda_{2}}{\sqrt{\Xi-\frac{\Omega_{1}^{2}}{\Delta}% -\frac{\Omega_{2}^{2}}{\Delta}+T-\frac{U^{2}}{\Psi}}}\right)\end{split}

(63)

where $H(x)\equiv\frac{1}{2}\text{Erfc}\left(\frac{x}{\sqrt{2}}\right)$ . In the previous formulas $T$ reduces to $T=\Delta_{Q}\left(\frac{q}{c_{\gamma}^{2}}\right)-\Delta_{Q}(0)$ .

B.4.3 Generic loss – error counting loss with a margin

The last case we consider is $\ell_{2}(x)=\Theta(-x+\kappa)$ , but $\ell_{1}(x)$ is considered to be a generic convex loss function. We will again consider the infinite $\beta$ limit. This imposes a scaling on the overlap $q_{1}$ that reads

q_{1}=1-\frac{\delta q_{1}}{\beta}

(64)

This induces a non-trivial scaling on some of the effective order parameters (56), in particular $\Delta_{1}$ and $\Omega_{1}$ will be vanishingly small


$\displaystyle\Delta_{1}$	$\displaystyle=\Delta_{Q}(Q)-\Delta_{Q}(q_{1})\simeq\frac{\Delta_{Q}^{\prime}(Q% )\delta q_{1}}{\beta}$	(65a)
$\displaystyle\Omega_{1}$	$\displaystyle=\Delta_{Q}\left(\frac{\gamma Q+(1-\gamma)p}{c_{\gamma}}\right)-% \Delta_{Q}\left(\frac{\gamma q_{1}+(1-\gamma)p}{c_{\gamma}}\right)\simeq\Delta% _{Q}^{\prime}\left(\frac{\gamma Q+(1-\gamma)p}{c_{\gamma}}\right)\frac{\gamma% \delta q_{1}}{c_{\gamma}\beta}\equiv\delta\Omega_{1}\frac{\delta q_{1}}{\beta}$	(65b)

where we have introduced the quantity

\Delta_{Q}^{\prime}(q)\equiv\frac{\partial\Delta_{Q}}{\partial q}=\int Dx\left% [\int Dy\,\varphi^{\prime}\left(\sqrt{q}x+\sqrt{Q-q}y\right)\right]^{2}\,.

(66)

and

\delta\Omega_{1}\equiv\Delta_{Q}^{\prime}\left(\frac{\gamma Q+(1-\gamma)p}{c_{% \gamma}}\right)\frac{\gamma}{c_{\gamma}}

(67)

Furthermore, we have that $\Psi_{11}=\Delta_{Q}(Q)-\Delta_{Q}(0)$ , $U_{1}=\Delta_{Q}\left(\frac{\gamma Q+(1-\gamma)p}{c_{\gamma}}\right)-\Delta_{Q% }(0)$ , $\frac{\Lambda_{11}}{\Delta_{1}}\to 1$ and $\Lambda_{12}\to 0$ so that $\frac{\det\Lambda}{\Lambda_{11}}\to\Lambda_{22}$ . Using those relations inside (58), rescaling $\lambda_{1}\to\beta\lambda_{1}$ and using a saddle point over $\lambda_{1}$ , one gets

\begin{split}\epsilon_{t}(\gamma)&=\int DxDy_{1}Dy_{2}\int Ds\,\tilde{\ell}% \left(\frac{U_{1}}{\sqrt{\Psi_{11}}}x+\frac{U_{2}\Psi_{11}-U_{1}\Psi_{12}}{% \sqrt{\Psi_{11}\det\Psi}}y_{1}+\sqrt{\frac{T\det H}{\det\Psi}}y_{2}+\frac{% \delta\Omega_{1}\sqrt{\delta q_{1}}}{\sqrt{\Delta_{Q}^{\prime}(Q)}}z_{\star}(x% )+\sqrt{\Xi}s\right)\\ &\times\frac{H\left(\frac{\kappa-\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x-\sqrt{% \frac{\det\Psi}{\Psi_{11}}}y_{1}-\frac{\Omega_{2}}{\sqrt{\Xi}}s}{\sqrt{\Lambda% _{22}}}\right)}{H\left(\frac{\kappa-\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x-\sqrt{% \frac{\det\Psi}{\Psi_{11}}}y_{1}}{\sqrt{\Delta_{2}}}\right)}\end{split}

(68)

where $z_{\star}(x)$ is the function defined in (38) that also appears in the equilibrium computation Baldassi et al. (2023). In the case we want to measure the training error, i.e. $\tilde{\ell}(x)=\Theta(-x)$ we have

\begin{split}\epsilon_{t}(\gamma)&=\int DxDy_{1}Ds\,H\left(\frac{\frac{U_{1}}{% \sqrt{\Psi_{11}}}x+\frac{U_{2}\Psi_{11}-U_{1}\Psi_{12}}{\sqrt{\Psi_{11}\det% \Psi}}y_{1}+\frac{\delta\Omega_{1}\sqrt{\delta q_{1}}}{\sqrt{\Delta_{Q}^{% \prime}(Q)}}z_{\star}(x)+\sqrt{\Xi}s}{\sqrt{\frac{T\det H}{\det\Psi}}}\right)% \frac{H\left(\frac{\kappa-\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x-\sqrt{\frac{\det% \Psi}{\Psi_{11}}}y_{1}-\frac{\Omega_{2}}{\sqrt{\Xi}}s}{\sqrt{\Lambda_{22}}}% \right)}{H\left(\frac{\kappa-\frac{\Psi_{12}}{\sqrt{\Psi_{11}}}x-\sqrt{\frac{% \det\Psi}{\Psi_{11}}}y_{1}}{\sqrt{\Delta_{2}}}\right)}\end{split}

(69)

We show in Figure 11 the training loss and error along the geodesics connecting solutions extracted from the error counting loss with a margin and the typical cross-entropy minimizer. Despite the training error is very small along the path, the loss is much larger in the neighborhood of the endpoint corresponding to the solution extracted from the error counting loss with a margin. As the margin is increased the training loss decreases.

Appendix C Effective order parameters

In this section we show that the effective order parameters defined in Eq. (54) reduce to the expressions given in (56).

We remind the notation


$\displaystyle\varphi_{r}(\lambda_{r},x)$	$\displaystyle\equiv\varphi\left(\sqrt{Q-q_{r}}\lambda_{r}-\sum_{s}\mathcal{T}_% {rs}x_{s}\right)$	(70a)
$\displaystyle\tilde{\varphi}(\gamma;\lambda_{1},\lambda_{2},x_{1},x_{2})$	$\displaystyle\equiv\varphi\left(\sum_{r}\gamma_{r}\left(\frac{\sqrt{Q-q_{r}}% \lambda_{r}-\sum_{s}\mathcal{T}_{rs}x_{s}}{c_{\gamma}}\right)\right)$	(70b)

Remember also that $\mathcal{T}$ is the square root matrix of the matrix $t_{rs}=q_{r}\delta_{rs}+(1-\delta_{rs})p$ and therefore we have the following identities $q_{1}=\mathcal{T}_{11}^{2}+\mathcal{T}_{12}^{2}$ , $q_{2}=\mathcal{T}_{22}^{2}+\mathcal{T}_{12}^{2}$ , $p=\mathcal{T}_{12}(\mathcal{T}_{11}+\mathcal{T}_{22})$ and $(\mathcal{T}_{11}\mathcal{T}_{22}-\mathcal{T}_{12}^{2})^{2}=q_{1}q_{2}-p^{2}$ .

Let’s start by analyzing the terms $\Psi_{rs}\equiv\langle\langle\varphi_{r}\rangle_{\lambda}\langle\varphi_{s}% \rangle_{\lambda}\rangle_{x}-\langle\varphi_{r}\rangle_{\lambda,x}\langle% \varphi_{s}\rangle_{\lambda,x}$ and $\Delta_{r}\equiv\langle\varphi^{2}\rangle_{\lambda,x}-\langle\langle\varphi_{r% }\rangle_{\lambda}^{2}\rangle_{x}$ , which only involve $\varphi_{r}$ . First notice that


$\displaystyle\langle\varphi_{r}\rangle_{\lambda,x}$	$\displaystyle=\int Dy\,\varphi(\sqrt{Q}y)=\sqrt{\Delta_{Q}(0)}\,,\qquad r=1,2$	(71a)
$\displaystyle\langle\varphi_{r}^{2}\rangle_{\lambda,x}$	$\displaystyle=\int Dy\,\varphi^{2}(\sqrt{Q}y)=\Delta_{Q}(Q)\,,\qquad r=1,2$	(71b)

and

\begin{split}\langle\langle\varphi_{1}\rangle_{\lambda}\langle\varphi_{2}% \rangle_{\lambda}\rangle_{x}&=\int Dx_{1}Dx_{2}D\lambda_{1}D\lambda_{2}\,% \varphi\left(\sqrt{Q-q_{1}}\lambda_{1}+\sum_{s}\mathcal{T}_{1s}x_{s}\right)\,% \varphi\left(\sqrt{Q-q_{2}}\lambda_{2}+\sum_{s}\mathcal{T}_{2s}x_{s}\right)\\ &=\int Dx_{1}Dx_{2}D\lambda_{1}D\lambda_{2}\,\varphi\left(\sqrt{Q-q_{1}}% \lambda_{1}+\sqrt{q_{1}}x_{1}\right)\,\varphi\left(\sqrt{Q-q_{2}}\lambda_{2}+% \frac{p}{\sqrt{q_{1}}}x_{1}+\sqrt{q_{2}-\frac{p^{2}}{q_{1}}}x_{2}\right)\\ &=\int Dx_{1}Dx_{2}\,\varphi\left(\sqrt{Q}x_{1}\right)\,\varphi\left(\frac{p}{% \sqrt{Q}}\,x_{1}+\sqrt{Q-\frac{p^{2}}{Q}}x_{2}\right)=\Delta_{Q}(p)\\ \end{split}

(72)

Similarly, if $r=s$ we have

\begin{split}\langle\langle\varphi_{r}\rangle_{\lambda}\langle\varphi_{r}% \rangle_{\lambda}\rangle_{x}=\langle\langle\varphi_{r}\rangle_{\lambda}^{2}% \rangle_{x}=\Delta_{Q}(q_{r})\end{split}

(73)

so that $\Psi_{rs}=\Delta_{Q}(t_{rs})-\Delta_{Q}(0)$ and $\Delta_{r}=\Delta_{Q}(Q)-\Delta_{Q}(t_{rr})$ .

Secondly let’s analyze the terms which contain only $\tilde{\varphi}$ , i.e. $T\equiv\langle\langle\tilde{\varphi}\rangle^{2}_{\lambda}\rangle_{x}-\langle% \tilde{\varphi}\rangle_{\lambda,x}^{2}$ and $\Xi\equiv\langle\tilde{\varphi}^{2}\rangle_{\lambda,x}-\langle\langle\tilde{% \varphi}\rangle^{2}_{\lambda}\rangle_{x}$ . The terms $\langle\tilde{\varphi}\rangle_{\lambda,x}$ and $\langle\tilde{\varphi}^{2}\rangle_{\lambda,x}$ are easy to analyze since the integrand involves a sum of 4 uncorrelated Gaussian variables, which is Gaussian. We therefore get


$\displaystyle\langle\tilde{\varphi}\rangle_{\lambda,x}$	$\displaystyle=\int Dy\,\varphi(\sqrt{Q}y)=\sqrt{\Delta_{Q}(0)}\,,$	(74a)
$\displaystyle\langle\tilde{\varphi}^{2}\rangle_{\lambda,x}$	$\displaystyle=\int Dy\,\varphi^{2}(\sqrt{Q}y)=\Delta_{Q}(Q)\,.$	(74b)

Finally

\begin{split}\langle\langle\tilde{\varphi}\rangle^{2}_{\lambda}\rangle_{x}&=% \int Dx_{1}Dx_{2}\left[\int D\lambda_{1}D\lambda_{2}\,\varphi\left(\sum_{r}% \gamma_{r}\left(\frac{\sqrt{Q-q_{r}}\lambda_{r}+\sum_{s}\mathcal{T}_{rs}x_{s}}% {c_{\gamma}}\right)\right)\right]^{2}\\ &=\int Dx\left[\int Dy\,\varphi\left(\frac{\sqrt{Q-2Q\gamma(1-\gamma)-\gamma^{% 2}q_{1}-(1-\gamma)^{2}q_{2}}}{c_{\gamma}}y+\frac{\sqrt{\gamma^{2}q_{1}+(1-% \gamma)^{2}q_{2}+2\gamma(1-\gamma)p}}{c_{\gamma}}\,x\right)\right]^{2}\\ &=\Delta_{Q}\left(\frac{\gamma^{2}q_{1}+(1-\gamma)^{2}q_{2}+2\gamma(1-\gamma)p% }{c_{\gamma}^{2}}\right)\end{split}

(75)

The last computation concerns the correlations between the function $\varphi_{r}$ and $\tilde{\varphi}$ , which appear in the variables $U_{r}\equiv\langle\langle\varphi_{r}\rangle_{\lambda}\langle\tilde{\varphi}% \rangle_{\lambda}\rangle_{x}-\langle\varphi\rangle_{\lambda,x}\langle\tilde{% \varphi}\rangle_{\lambda,x}$ , $\Omega_{r}\equiv\langle\langle\tilde{\varphi}\varphi_{r}\rangle_{\lambda}% \rangle_{x}-\langle\langle\tilde{\varphi}\rangle_{\lambda}\langle\varphi_{r}% \rangle_{\lambda}\rangle_{x}$ . We need therefore to evaluate the following two quantities $\langle\langle\tilde{\varphi}\varphi_{r}\rangle_{\lambda}\rangle_{x}$ and $\langle\langle\tilde{\varphi}\rangle_{\lambda}\langle\varphi_{r}\rangle_{% \lambda}\rangle_{x}$ , for $r=1,2$ . We are going to analyze the case $r=1$ , the other can be obtained by symmetry. We have

\begin{split}\langle\langle\tilde{\varphi}\varphi_{1}\rangle_{\lambda}\rangle_% {x}&=\int Dx_{1}Dx_{2}D\lambda_{1}D\lambda_{2}\,\varphi\left(\sum_{r}\gamma_{r% }\left(\frac{\sqrt{Q-q_{r}}\lambda_{r}+\sum_{s}\mathcal{T}_{rs}x_{s}}{c_{% \gamma}}\right)\right)\varphi\left(\sqrt{Q-q_{1}}\lambda_{1}+\sum_{s}\mathcal{% T}_{1s}x_{s}\right)\\ &=\int Dx_{1}Dx_{2}D\lambda_{1}D\lambda_{2}\,\varphi(x_{1})\\ &\times\varphi\left(\frac{\gamma Q+(1-\gamma)p}{\sqrt{Q}c_{\gamma}}x_{1}-\frac% {(1-\gamma)p\sqrt{Q-q_{1}}}{\sqrt{q_{1}Q}c_{\gamma}}\lambda_{1}+\frac{(1-% \gamma)\sqrt{Q-q_{2}}}{c_{\gamma}}\lambda_{2}+\frac{1-\gamma}{c_{\gamma}}\sqrt% {q_{2}-\frac{p^{2}}{q_{1}}}x_{2}\right)\\ &=\int Dx_{1}Dx_{2}\,\varphi(\sqrt{Q}x_{1})\varphi\left(\frac{\gamma Q+(1-% \gamma)p}{\sqrt{Q}c_{\gamma}}x_{1}+\frac{1-\gamma}{c_{\gamma}}\sqrt{Q-\frac{p^% {2}}{Q}}x_{2}\right)=\Delta_{Q}\left(\frac{\gamma Q+(1-\gamma)p}{c_{\gamma}}% \right)\end{split}

(76)

Notice that this does not depend on either $q_{1}$ nor $q_{2}$ . The case $r=2$ can be obtained by sending $\gamma\to 1-\gamma$ and $q_{1},q_{2}\to q_{1},q_{2}$ . Let’s analyze the term $\langle\langle\tilde{\varphi}\rangle_{\lambda}\langle\varphi_{r}\rangle_{% \lambda}\rangle_{x}$ in the $r=1$ case

\begin{split}\langle\langle\tilde{\varphi}\rangle_{\lambda}\langle\varphi_{1}% \rangle_{\lambda}\rangle_{x}&=\int Dx_{1}Dx_{2}D\lambda_{1}D\lambda_{2}D% \lambda_{3}\,\varphi\left(\sqrt{Q-q_{1}}\lambda_{3}+\sum_{s}\mathcal{T}_{1s}x_% {s}\right)\,\varphi\left(\sum_{r}\gamma_{r}\left(\frac{\sqrt{Q-q_{r}}\lambda_{% r}+\sum_{s}\mathcal{T}_{rs}x_{s}}{c_{\gamma}}\right)\right)\\ &=\int Dx_{1}Dx_{2}\,\varphi(\sqrt{Q}x_{1})\varphi\left(\frac{\gamma q_{1}+(1-% \gamma)p}{\sqrt{Q}c_{\gamma}}x_{1}+\sqrt{Q-\frac{(\gamma q_{1}+(1-\gamma)p)^{2% }}{Qc_{\gamma}^{2}}}x_{2}\right)\\ &=\Delta_{Q}\left(\frac{\gamma q_{1}+(1-\gamma)p}{c_{\gamma}}\right)\end{split}

(77)

The case $r=2$ can be obtained by sending $\gamma\to 1-\gamma$ and $q_{1}\to q_{2}$ , i.e.

\begin{split}\langle\langle\tilde{\varphi}\rangle_{\lambda}\langle\varphi_{2}% \rangle_{\lambda}\rangle_{x}=\Delta_{Q}\left(\frac{(1-\gamma)q_{2}+\gamma p}{c% _{\gamma}}\right)\end{split}

(78)

Appendix D Computing the overlap between differently sampled solutions

The scope of this section is to find the typical overlap between two configurations $\boldsymbol{w}_{1}$ and $\boldsymbol{w}_{2}$ that are sampled from two (in principle different) distribution $p_{1}(\bullet\,;\mathcal{D})$ and $p_{2}(\bullet\,;\mathcal{D})$ , see the definition in equation (22). A way of computing this overlap has been sketched in Annesi et al. (2023). Here we adopt a different approach, based on the Franz-Parisi entropy Franz and Parisi (1995). The Franz-Parisi entropy is defined as the average log of the number of configurations $\boldsymbol{w}_{2}\sim p_{2}(\bullet;\mathcal{D})$ that are at a fixed overlap $p$ from the $\boldsymbol{w}_{1}\sim p_{1}(\bullet;\mathcal{D})$ . In formulas


$\displaystyle\phi_{FP}(S)$	$\displaystyle\equiv\mathbb{E}_{\mathcal{D}}\int d\boldsymbol{w}_{1}\,p_{1}(% \boldsymbol{w}_{1};\mathcal{D})\ln\mathcal{N}_{\mathcal{D}}(\boldsymbol{w}_{1}% ;S)$	(79a)
$\displaystyle\mathcal{N}_{\mathcal{D}}(\boldsymbol{w}_{1};S)$	$\displaystyle\equiv\int d\boldsymbol{w}_{2}\,p_{2}(\boldsymbol{w}_{2};\mathcal% {D})\,\delta\left(NS-\boldsymbol{w}_{1}\cdot\boldsymbol{w}_{2}\right)$	(79b)

In the following we will call $\boldsymbol{w}_{1}$ the “reference” weight and $\boldsymbol{w}_{2}$ the “slaved” weight as it is constrained to stay at a distance given by the reference $\boldsymbol{w}_{1}$ . In the following we will also suppose (as done in the main text), that the configuration $\boldsymbol{w}_{2}$ sampled from $p_{2}$ , possesses the same squared norm $Q$ as the reference $\boldsymbol{w}_{1}$ ; this can be achieved by properly choosing the Lagrande multiplier $\lambda_{2}$ .

The typical (i.e. the most probable) overlap is the one that maximizes the Franz-Parisi entropy

p=\operatorname*{argmax}_{S}\phi_{FP}(S)

(80)

The Franz-Parisi can be computed with standard methods using a double replica trick. Here we refer to Baldassi et al. (2023, 2021) for the derivation in the case of the perceptron. In the tree committee machine in the large width limit one gets

\phi_{FP}(S)=\text{extr}_{q_{2},\,t}\left[\mathcal{G}_{S}(q_{2},t,S)+\alpha% \mathcal{G}_{E}(q_{2},t,S)\right]

(81)

where

\begin{split}\mathcal{G}_{S}&=\frac{(Q^{2}-S^{2})(Q-2q_{1})+Qq_{1}^{2}-2q_{1}% St+Qt^{2}}{2(Q-q_{2})(Q-q_{1})^{2}}+\frac{1}{2}\ln(2\pi)+\frac{1}{2}\ln\left(Q% -q_{2}\right)\end{split}

(82)

and


$\displaystyle\mathcal{G}_{E}$	$\displaystyle=\int Dz_{0}\,\frac{\int Dz_{1}Dz_{2}\,e^{-\beta\ell_{1}\left(% \sqrt{\Delta_{Q}(q_{1})-\Delta_{Q}(0)}z_{0}+\frac{\Delta_{Q}(S)-\Delta_{Q}(t)}% {\sqrt{\Gamma}}z_{1}+\sqrt{\eta}z_{2}\right)}}{\int Dz_{1}\,e^{-\beta\ell_{1}% \left(\sqrt{\Delta_{Q}(q_{1})-\Delta_{Q}(0)}z_{0}+\sqrt{\Delta_{Q}(Q)-\Delta_{% Q}(q_{1})}z_{1}\right)}}$	(83a)
	$\displaystyle\times\ln\int Dz_{3}\,e^{-\beta\ell_{2}\left(\sqrt{\Delta_{Q}(Q)-% \Delta_{Q}(q_{2})}z_{3}+\frac{\Delta_{Q}(t)-\Delta_{Q}(0)}{\sqrt{\Delta_{Q}(q_% {1})-\Delta_{Q}(0)}}z_{0}+\sqrt{\Gamma}z_{1}\right)}$	(83b)
$\displaystyle\eta$	$\displaystyle\equiv\Delta_{Q}(Q)-\Delta_{Q}(q_{1})-\frac{(\Delta_{Q}(S)-\Delta% _{Q}(t))^{2}}{\Gamma}$	(83c)
$\displaystyle\Gamma$	$\displaystyle=\Delta_{Q}(q_{2})-\Delta_{Q}(0)-\frac{(\Delta_{Q}(t)-\Delta_{Q}(% 0))^{2}}{\Delta_{Q}(q_{1})-\Delta_{Q}(0)}$	(83d)

Notice that $q_{1}$ represents the typical overlap between reference configurations $\boldsymbol{w}_{1}\sim p_{1}(\boldsymbol{w}_{1};\mathcal{D})$ . Imposing that at the typical distance the Franz-Parisi presents a maximum we have

\frac{\partial\phi_{FP}}{\partial S}=0\implies\frac{QS-2q_{1}S+q_{1}t}{(Q-q_{2% })(Q-q_{1})^{2}}\simeq\frac{S}{(Q-q_{1})(Q-q_{2})}=\alpha\frac{\partial% \mathcal{G}_{E}}{\partial S}

(84)

The first equality follows from the fact that at the maximum of the Franz-Parisi entropy one can verify that the saddle point equation impose $t=S$ . One can then compute the right hand side explicitly, expanding the expression for $t\to S$ . The typical overlap $p$ can be obtained finally by solving this implicit equation for $p$

\begin{split}\frac{p}{\Delta_{Q}^{\prime}(p)(Q-q_{1})(Q-q_{2})}&=\frac{\alpha}% {\Delta_{Q}(p)-\Delta_{Q}(0)}\int Dz_{0}\,\left[\frac{\partial}{\partial z_{0}% }\,\ln\int Dz_{1}e^{-\beta\ell_{1}\left(\sqrt{\Delta_{Q}(q_{1})-\Delta_{Q}(0)}% z_{0}+\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(q_{1})}z_{1}\right)}\right]\\ &\times\left[\int Dz_{1}\frac{\partial}{\partial z_{0}}\ln\int Dz_{3}\,e^{-% \beta\ell_{2}\left(\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(q_{2})}z_{3}+\frac{\Delta_{Q% }(p)-\Delta_{Q}(0)}{\sqrt{\Delta_{Q}(q_{1})-\Delta_{Q}(0)}}z_{0}+\sqrt{\Gamma}% z_{1}\right)}\right]\end{split}

(85)

where we have introduced the quantity $\Delta_{Q}^{\prime}(q)$ as in equation (36) and we have redefined $\Gamma$ to be $\Gamma=\Delta_{Q}(q_{2})-\Delta_{Q}(0)-\frac{(\Delta_{Q}(p)-\Delta_{Q}(0))^{2}% }{\Delta_{Q}(q_{1})-\Delta_{Q}(0)}$ . Notice that equation (85) depends non-trivially on $q_{1}$ and $q_{2}$ which represent respectively the typical overlap of configurations $\boldsymbol{w}_{1}$ and $\boldsymbol{w}_{2}$ that are extracted from $\boldsymbol{w}_{1}\sim p_{1}(\bullet;\mathcal{D})$ and $\boldsymbol{w}_{2}\sim p_{2}(\bullet;\mathcal{D})$ and that can be obtained by a standard equilibrium computation of the partition function in equation (8). In particular, when the solutions are sampled from the same distribution, then $q_{1}=q_{2}$ , and one can verify that equation (85) is trivially satisfied by $p=q_{1}=q_{2}$ .

In the following subsection we specialize equation (85) to several interesting sub-cases, in the large $\beta$ limit.

D.1 The error counting loss with a margin

In the case one is interested in the theta loss, i.e. $\ell_{1}(x)=\Theta(\kappa_{1}-x)$ and $\ell_{2}(x)=\Theta(\kappa_{2}-x)$ the integrals in (85) inside the logs can be performed and in the infinite $\beta$ limit one gets

\begin{split}\frac{p}{\Delta_{Q}^{\prime}(p)(Q-q_{1})(Q-q_{2})}&=\alpha\int Dz% _{0}Dz_{1}\,\frac{GH\left(\frac{\kappa_{1}+\sqrt{\Delta_{Q}(q_{1})-\Delta_{Q}(% 0)}z_{0}}{\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(q_{1})}}\right)GH\left(\frac{\kappa_{% 2}+\frac{\Delta_{Q}(p)-\Delta_{Q}(0)}{\sqrt{\Delta_{Q}(q_{1})-\Delta_{Q}(0)}}z% _{0}+\sqrt{\Gamma}z_{1}}{\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(q_{2})}}\right)}{\sqrt% {(\Delta_{Q}(Q)-\Delta_{Q}(q_{1}))(\Delta_{Q}(Q)-\Delta_{Q}(q_{2}))}}\,.\end{split}

(86)

This expression reduces to the one of the perceptron case computed in Annesi et al. (2023) if one specializes it to the identity activation function $\varphi(x)=x$ where $\Delta_{Q}(q)=q$ .

D.2 Large $\beta$ limit: generic loss – error counting loss with a margin

We consider here that the reference solution is sampled from a generic convex loss function, whereas $\ell_{2}(x)=\Theta(\kappa_{2}-x)$ . In the large $\beta$ limit we have that $q_{1}$ scales as $q_{1}=Q-\frac{\delta q_{1}}{\beta}$ and correspondingly $\Delta_{Q}(Q)-\Delta_{Q}(q_{1})\simeq\frac{\Delta_{Q}^{\prime}(Q)\delta q_{1}}% {\beta}$ . Scaling $z_{1}\to\beta z_{1}$ and using saddle point method, we therefore have

\begin{split}\frac{p}{\Delta_{Q}^{\prime}(p)(Q-q_{2})}&=\frac{\alpha\sqrt{% \delta q_{1}}}{\sqrt{(\Delta_{Q}(Q)-\Delta_{Q}(q_{2}))\Delta^{\prime}_{Q}(Q)}}% \int Dz_{0}\,z_{\star}(z_{0})\int Dz_{1}\,GH\left(\frac{\kappa_{2}-\frac{% \Delta_{Q}(p)-\Delta_{Q}(0)}{\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(0)}}z_{0}-\sqrt{% \Gamma}z_{1}}{\sqrt{\Delta_{Q}(Q)-\Delta_{Q}(q_{2})}}\right)\end{split}

(87)

where $z_{\star}(z_{0})$ is the same function defined in (38).

Sampling the space of solutions of an artificial neural network

Abstract

I Introduction

II The model and the Methods for sampling and connecting solutions

II.1 The model and main definitions

II.2 The canonical–ensemble framework

II.3 Hybrid Monte Carlo algorithm

II.4 Double ratchet

II.5 Coupled replica simulations

III Results

III.1 The model close to the interpolation threshold

III.1.1 The weight space displays three regimes with respect to the temperature

III.1.2 The low–energy basins are connected by complex paths

III.1.3 The center is not flat

III.1.4 The intermediate temperature regime

III.2 The overparametrized regime

III.2.1 Similarities and differences with the underparametrized regime

III.2.2 The barrier along linear paths between sampled solutions can be computed analytically

III.2.3 Comparison between theory and simulations

III.2.4 The star-shaped property of the solution space

III.3 The geometrical structure can be robust with respect to highly correlated data

IV Discussion and Conclusions

References

Appendix A Equilibrium measure

A.1 Large β𝛽\betaitalic_β limit

Appendix B Loss Landscape on the linear interpolation between weights

B.1 Replica approach

B.2 Replica Symmetric ansatz

B.3 Large K𝐾Kitalic_K limit

B.4 Analytical predictions in some particular cases

B.4.1 Error counting loss with a margin

B.4.2 Equal general loss functions at finite temperature

B.4.3 Generic loss – error counting loss with a margin

Appendix C Effective order parameters

Appendix D Computing the overlap between differently sampled solutions

D.1 The error counting loss with a margin

D.2 Large β𝛽\betaitalic_β limit: generic loss – error counting loss with a margin

A.1 Large $\beta$ limit

B.3 Large $K$ limit

D.2 Large $\beta$ limit: generic loss – error counting loss with a margin