Predicting Dynamical Systems across Environments via Diffusive Model Weight Generation

Ruikun Li, Huandong Wang, Jingtao Ding, Yuan Yuan, Qingmin Liao, Yong Li
Tsinghua University
Corresponding author ([email protected])
Abstract

Data-driven methods offer an effective equation-free solution for predicting physical dynamics. However, the same physical system can exhibit significantly different dynamic behaviors in various environments. This causes prediction functions trained for specific environments to fail when transferred to unseen environments. Therefore, cross-environment prediction requires modeling the dynamic functions of different environments. In this work, we propose a model weight generation method, EnvAd-Diff. EnvAd-Diff operates in the weight space of the dynamic function, generating suitable weights from scratch based on environmental condition for zero-shot prediction. Specifically, we first train expert prediction functions on dynamic trajectories from a limited set of visible environments to create a model zoo, thereby constructing sample pairs of prediction function weights and their corresponding environments. Subsequently, we train a latent space diffusion model conditioned on the environment to model the joint distribution of weights and environments. Considering the lack of environmental prior knowledge in real-world scenarios, we propose a physics-informed surrogate label to distinguish different environments. Generalization experiments across multiple systems demonstrate that a 1M parameter prediction function generated by EnvAd-Diff outperforms a pre-trained 500M parameter foundation model.

1 Introduction

Refer to caption
Figure 1: Weight-environment distribution.

Data-driven approaches have emerged as a powerful, equation-free paradigm for predicting physical dynamics [1], achieving considerable success across a diverse range of disciplines, including molecular dynamics [2], fluid mechanics [3], and climate science [4]. In these systems, dynamical systems governed by the same underlying equations can exhibit vastly different evolutionary behaviors under varying environmental conditions e𝑒eitalic_e, which can be formally expressed as

dxdt=f(x,t,e).𝑑𝑥𝑑𝑡𝑓𝑥𝑡𝑒\frac{dx}{dt}=f(x,t,e).divide start_ARG italic_d italic_x end_ARG start_ARG italic_d italic_t end_ARG = italic_f ( italic_x , italic_t , italic_e ) . (1)

For instance, fluid flows, described by the Navier-Stokes equations, can exhibit different vortex structures such as the Reynolds number and external driving forces. Consequently, a predictive model fθ,easubscript𝑓𝜃subscript𝑒𝑎f_{\theta,e_{a}}italic_f start_POSTSUBSCRIPT italic_θ , italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT, trained on observed trajectories of a specific environmental condition easubscript𝑒𝑎e_{a}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT struggles to generalize to unseen environmental conditions ebsubscript𝑒𝑏e_{b}italic_e start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Therefore, modeling the generalizable function f𝑓fitalic_f beyond the specific environment remains a critical problem for scientific machine learning [5, 6].

Significant efforts have been undertaken to enable cross-environment prediction, including the meta learning, foundation models and the in-context learning. Meta-learning approaches facilitate adaptation to unseen environments by simultaneously learning both environment-shared and environment-specific weights [7, 8, 9]. When applied to a new environment, the environment-specific weights are updated by finetuning on new data to derive a tailored predictive model. Another strategy involves training environment-unified foundation models through well-designed architectures and large-scale parameterization [10, 11, 12]. These models, pretrained on massive datasets, can be further refined by finetuning on data specific to a target environment. Furthermore, in-context learning methods aim for generalization by leveraging illustrative predictive examples from the new environment [13, 14]. However, these approaches emphasize acquiring generalizability from finetuning or contexts, yet overlooking the intrinsic connection between the predictive model and the physical environment. If the governing equation f𝑓fitalic_f is known, a numerical solver can provide zero-shot predictions for any environmental condition e𝑒eitalic_e. This highlights a crucial insight: explicitly modeling the conditional dependence p(θ|e)𝑝conditional𝜃𝑒p(\theta|e)italic_p ( italic_θ | italic_e ) of the predictive model fθ,esubscript𝑓𝜃𝑒f_{\theta,e}italic_f start_POSTSUBSCRIPT italic_θ , italic_e end_POSTSUBSCRIPT on environments e𝑒eitalic_e is key to achieving effective cross-environment prediction.

Inspired by treating model weights as a data modality, this work focuses on generating environment-specific weights. Since weights parameterize environment-specific dynamics (Figure 1), generating them directly enables cross-environment prediction using advanced deep learning techniques. However, the challenge of generating model weights for physical dynamics tailored to specific environmental conditions lies in three points. First, model weights, interconnected by the network architecture, are inherently structured. Thus, naive flattening weights into sequences would lead to the loss of crucial structural relationships [15]. Second, the high dimensionality of weights results in an exceptionally vast data space. Minor variations in the weights of even a single layer can be amplified into significant difference in predictive performance [16, 17]. Therefore, traditional metrics like MSE are inadequate for assessing weight similarity. Finally, practical applications typically lack explicit physical knowledge of the environment, resulting in a scarcity of supervisory signals to discern varying environments.

To address these challenges, we propose an Environment-Adaptive Dynamics Diffusion model, EnvAd-Diff. EnvAd-Diff represents predictive models as weight graphs, aggregating weights into node features to preserve their inherent connectivity and accommodate arbitrary model architectures (challenge 1). It employs a node-attention Variational Autoencoder (VAE) to learn latent representations for the diffusion model, and incorporates a functional loss for weight similarity awareness (challenge 2). We construct a high-quality model zoo via domain-adaptive initialization as EnvAd-Diff’s training data. For non-prior environments, we design physics-informed surrogate labels to train a prompter that guides the weight generation of EnvAd-Diff (challenge 3).

Our contributions can be summarized as follows:

  • We propose modeling the conditional dependence of model weights on environments for cross-environment prediction, thereby generating expert model weights for new environments without finetuning.

  • We construct weight graphs based on model architecture to preserve connectivity and design a functional loss for weight similarity perception. This significantly enhances the representational model’s ability to model weight modalities.

  • Extensive experiments on simulated and real-world systems demonstrate that 1M-parameter models generated by EnvAd-Diff for specific environments predict more accurately than 500M-parameter foundation models.

  • We open-source the model zoo and checkpoints of four simulated PDE systems and one real-world system to foster further research in the dynamical system modeling community111Code will be available after review..

2 Preliminary

2.1 Problem Definition

We consider time-varying dynamical systems, the general form of which is described by Equation 1. Given environmental conditions e𝑒e\in\mathcal{E}italic_e ∈ caligraphic_E, the system dynamics function is instantiated as dxdt=f(x,t,e)=fe(x,t)𝑑𝑥𝑑𝑡𝑓𝑥𝑡𝑒subscript𝑓𝑒𝑥𝑡\frac{dx}{dt}=f(x,t,e)=f_{e}(x,t)\in\mathcal{F}divide start_ARG italic_d italic_x end_ARG start_ARG italic_d italic_t end_ARG = italic_f ( italic_x , italic_t , italic_e ) = italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_x , italic_t ) ∈ caligraphic_F. The environment space \mathcal{E}caligraphic_E and the function space \mathcal{F}caligraphic_F are linked by the governing equations f𝑓fitalic_f, forming a joint set {e,fe}𝑒subscript𝑓𝑒\{e,f_{e}\}{ italic_e , italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT }. We employ a data-driven model fθ,esubscript𝑓𝜃𝑒f_{\theta,e}italic_f start_POSTSUBSCRIPT italic_θ , italic_e end_POSTSUBSCRIPT, parameterized by θ𝜃\thetaitalic_θ, to learn fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, thereby formalizing the function space \mathcal{F}caligraphic_F as the model’s weight space ΘΘ\Thetaroman_Θ. The environment space is divided into an observed environment set trsubscript𝑡𝑟\mathcal{E}_{tr}caligraphic_E start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and an unseen environment set tesubscript𝑡𝑒\mathcal{E}_{te}caligraphic_E start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT, and consequently, the weight space is also partitioned into corresponding subspaces ΘtrsubscriptΘ𝑡𝑟\Theta_{tr}roman_Θ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and ΘtrsubscriptΘ𝑡𝑟\Theta_{tr}roman_Θ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT. Treating model weights as the modeling object, we learn the inherent joint distribution of environments and weights from the joint observation space {tr,Θtr}subscript𝑡𝑟subscriptΘ𝑡𝑟\{\mathcal{E}_{tr},\Theta_{tr}\}{ caligraphic_E start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT }. Once learning is complete, we generate a corresponding predictive function fθ,esubscript𝑓𝜃𝑒f_{\theta,e}italic_f start_POSTSUBSCRIPT italic_θ , italic_e end_POSTSUBSCRIPT for a new environment ete𝑒subscript𝑡𝑒e\in\mathcal{E}_{te}italic_e ∈ caligraphic_E start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT.

Notably, we posit that even when sharing the same governing equations, each environment possesses a unique dynamical function (though it may not be complex). We directly sample the complete model weights θP(θ|e)similar-to𝜃𝑃conditional𝜃𝑒\theta\sim P(\theta|e)italic_θ ∼ italic_P ( italic_θ | italic_e ) for a given environment without data for finetuning, which significantly differs from existing practices in dynamics prediction.

2.2 Conditional Diffusion

Diffusion models [18, 19] learn a probabilistic transformation from a prior Gaussian pprior𝒩(𝟎,𝐈)subscript𝑝𝑝𝑟𝑖𝑜𝑟𝒩0𝐈p_{prior}\in\mathcal{N}(\mathbf{0},\mathbf{I})italic_p start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT ∈ caligraphic_N ( bold_0 , bold_I ) distribution to a target distribution ptargetsubscript𝑝𝑡𝑎𝑟𝑔𝑒𝑡p_{target}italic_p start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT. It perturbs data distributions by adding noise and learn to reverse this process through denoising, demonstrating strong fitting capabilities for data across modalities like images, language, and speech [20, 21]. We denote the original sample as x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The forward noising process in standard diffusion models is computed as xn=a¯nx0+1a¯nϵsubscript𝑥𝑛subscript¯𝑎𝑛subscript𝑥01subscript¯𝑎𝑛italic-ϵx_{n}=\sqrt{\overline{a}_{n}}x_{0}+\sqrt{1-\overline{a}_{n}}\epsilonitalic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_ϵ, where ϵitalic-ϵ\epsilonitalic_ϵ and {a¯n}subscript¯𝑎𝑛\{\overline{a}_{n}\}{ over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } represent the Gaussian noise and noise schedule [22], respectively. The reverse process gradually denoises from Gaussian noise to sample data as

pθ(xn1|xn):=𝒩(xn1;μθ(xn,n),σn2𝐈),assignsubscript𝑝𝜃conditionalsubscript𝑥𝑛1subscript𝑥𝑛𝒩subscript𝑥𝑛1subscript𝜇𝜃subscript𝑥𝑛𝑛subscriptsuperscript𝜎2𝑛𝐈p_{\theta}(x_{n-1}|x_{n}):=\mathcal{N}(x_{n-1};\mu_{\theta}(x_{n},n),\sigma^{2% }_{n}\mathbf{I}),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_I ) , (2)

where μθ=1αn(xn1αn1α¯nϵθ(xn,n))subscript𝜇𝜃1subscript𝛼𝑛subscript𝑥𝑛1subscript𝛼𝑛1subscript¯𝛼𝑛subscriptitalic-ϵ𝜃subscript𝑥𝑛𝑛\mu_{\theta}=\frac{1}{\sqrt{\alpha_{n}}}(x_{n}-\frac{1-\alpha_{n}}{\sqrt{1-% \overline{\alpha}_{n}}}\epsilon_{\theta}(x_{n},n))italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n ) ) and {σn}subscript𝜎𝑛\{\sigma_{n}\}{ italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } are step dependent constants. The noise ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is computed by a parameterized neural network, typically implemented as a UNet or Transformer architecture. The network’s parameters are optimized through an objective function [22]

Ln=𝔼n,ϵn,x0ϵnϵθ(α¯nx0+1α¯nϵn,n)2subscript𝐿𝑛subscript𝔼𝑛subscriptitalic-ϵ𝑛subscript𝑥0superscriptnormsubscriptitalic-ϵ𝑛subscriptitalic-ϵ𝜃subscript¯𝛼𝑛subscript𝑥01subscript¯𝛼𝑛subscriptitalic-ϵ𝑛𝑛2L_{n}=\mathbb{E}_{n,\epsilon_{n},x_{0}}||\epsilon_{n}-\epsilon_{\theta}(\sqrt{% \overline{\alpha}_{n}}x_{0}+\sqrt{1-\overline{\alpha}_{n}}\epsilon_{n},n)||^{2}italic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_n , italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (3)

to minimize the negative log-likelihood 𝔼x0q(x0)[pθ(x0)]subscript𝔼similar-tosubscript𝑥0𝑞subscript𝑥0delimited-[]subscript𝑝𝜃subscript𝑥0\mathbb{E}_{x_{0}\sim q(x_{0})}[-p_{\theta}(x_{0})]blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ - italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ]. To model conditional distributions p(x|c)𝑝conditional𝑥𝑐p(x|c)italic_p ( italic_x | italic_c ), state-of-the-art methods inject conditional information during noise prediction using techniques like adaptive layer normalization [23], as ϵθ(xn,n,c)subscriptitalic-ϵ𝜃subscript𝑥𝑛𝑛𝑐\epsilon_{\theta}(x_{n},n,c)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n , italic_c ).

3 Methodology

In this section, we first introduce the method for modeling and sampling the joint distribution of model weights and environments on a given model zoo and environmental conditions. Subsequently, we present a physics-informed prompter that operates without the need for environmental prior knowledge. Finally, we detail the construction of a domain-adaptive model zoo.

3.1 Environment Adaptive Dynamics Diffusion model

EnvAd-Diff first organizes the expert small model weights into a weight graph. It then pretrains a weight VAE, yielding a high-quality latent space. Finally, an environment-conditioned diffusion model is trained within this latent space.

3.1.1 Weight Graph

Refer to caption
Figure 2: Illustration of weight graphs.

Model weights constitute a novel data modality, inherently structured by the network architecture. A straightforward approach is to flatten weights layer by layer into fixed-length token sequences for representation using sequence models like transformers. However, here we consider the inherent connection structure of the neural network [15]. Specifically, we aggregate layer weights based on the forward data flow through the network topology.

We focus on designing the weight organization method for the basic computational units of common architectures: Linear layers and CNN layers. For a linear layer, learnable parameters include weights wDout×Din×1𝑤superscriptsubscript𝐷𝑜𝑢𝑡subscript𝐷𝑖𝑛1w\in\mathbb{R}^{D_{out}\times D_{in}\times 1}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT and bias bDout×1𝑏superscriptsubscript𝐷𝑜𝑢𝑡1b\in\mathbb{R}^{D_{out}\times 1}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, where the Doutsubscript𝐷𝑜𝑢𝑡D_{out}italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT and Dinsubscript𝐷𝑖𝑛D_{in}italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT are the dimension of output and input respectively. A CNN layer similarly comprises weights wCout×Cin×h×w𝑤superscriptsubscript𝐶𝑜𝑢𝑡subscript𝐶𝑖𝑛𝑤w\in\mathbb{R}^{C_{out}\times C_{in}\times h\times w}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_h × italic_w end_POSTSUPERSCRIPT and bias bCout×1𝑏superscriptsubscript𝐶𝑜𝑢𝑡1b\in\mathbb{R}^{C_{out}\times 1}italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × 1 end_POSTSUPERSCRIPT, where coutsubscript𝑐𝑜𝑢𝑡c_{out}italic_c start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT and cinsubscript𝑐𝑖𝑛c_{in}italic_c start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT are the channels of output and input, respectively, and h×w𝑤h\times witalic_h × italic_w is kernel size. We treat the output neurons of linear layers and output channels of CNN layers as nodes of the weight graph, constructing the graph structure shown in Figure 2a. Centering on the output nodes, we flatten and concatenate the weights (and corresponding bias) associated with connections leading to each output node within a layer, forming the feature vector wbdirect-sum𝑤𝑏w\oplus bitalic_w ⊕ italic_b for that output node. Thus, a linear layer’s weights are organized as Doutsubscript𝐷𝑜𝑢𝑡D_{out}italic_D start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT nodes with (Din+1)subscript𝐷𝑖𝑛1(D_{in}+1)( italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + 1 )-dimensional features, and a CNN layer’s weights are organized as Coutsubscript𝐶𝑜𝑢𝑡C_{out}italic_C start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT nodes with (Cin×h×w+1)subscript𝐶𝑖𝑛𝑤1(C_{in}\times h\times w+1)( italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_h × italic_w + 1 )-dimensional features.

Considering the prevalence of skip connections in modern deep learning, we incorporate their weights. Following the data flow, we concatenate the weights of the skip connection path as additional features to the feature vector of the node where it merges with the main path, as depicted in Figure 2b. Consequently, the entire model weights are structured as a weight graph with heterogeneous node features, where the total number of nodes equals the sum of the output neurons/channels across all layers. We normalize weights based on input-output node pairs and biases based on nodes.

The proposed weight graph aggregates weights to nodes. This not only captures inherent connection relationships but also significantly reduces computational overhead compared to maintaining dense edge features. This organization method is applicable to models of any architecture, as demonstrated in Section 4.6.

3.1.2 Weight VAE

We now encode the heterogeneous graph of model weights. We train a node attention-based VAE with a loss function given by

L=𝔼qϕ(𝐳|𝐰)[logpθ(𝐰|𝐳)]+β𝐊𝐋[qϕ(𝐳|𝐰)||p(𝐳)],L=-\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{w})}[\log p_{\theta}(\mathbf{w}|% \mathbf{z})]+\beta\mathbf{KL}[q_{\phi}(\mathbf{z}|\mathbf{w})||p(\mathbf{z})],italic_L = - blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_w ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_w | bold_z ) ] + italic_β bold_KL [ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_w ) | | italic_p ( bold_z ) ] , (4)

where 𝐰𝐰\mathbf{w}bold_w represents the heterogeneous node features of the weight graph, 𝐳d𝐳superscript𝑑\mathbf{z}\in\mathbb{R}^{d}bold_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the latent representation, and the KL divergence term is used to constrain the posterior distribution qϕ(𝐳|𝐰)subscript𝑞italic-ϕconditional𝐳𝐰q_{\phi}(\mathbf{z}|\mathbf{w})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_w ). The VAE architecture first employs a layer-wise linear mapping for each layer’s nodes to project them into a same dimension. Subsequently, we utilize a multi-head attention mechanism to model inter-node relationships, capturing interactions among neurons within and across original model layers. The resulting latent representation is then passed through another layer-wise linear mapping, projecting it back to the original dimensions for reconstruction.

We notice that prediction models exhibiting similar performance can possess distinct parameter values [17]. This observation motivates our approach to the reconstruction error term in the VAE objective. The similarity between model weights should be gauged by their functional consistency, rather than merely their identical absolute values. We introduce a function loss,

Lfunc=𝔼xiXf𝐰^(xi)f𝐰(xi)22,subscript𝐿𝑓𝑢𝑛𝑐subscript𝔼subscript𝑥𝑖𝑋superscriptsubscriptnormsubscript𝑓^𝐰subscript𝑥𝑖subscript𝑓𝐰subscript𝑥𝑖22L_{func}=\mathbb{E}_{x_{i}\in X}||f_{\mathbf{\hat{w}}}(x_{i})-f_{\mathbf{w}}(x% _{i})||_{2}^{2},italic_L start_POSTSUBSCRIPT italic_f italic_u italic_n italic_c end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT | | italic_f start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where f𝐰(xi)subscript𝑓𝐰subscript𝑥𝑖f_{\mathbf{w}}(x_{i})italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and f𝐰^(xi)subscript𝑓^𝐰subscript𝑥𝑖f_{\mathbf{\hat{w}}}(x_{i})italic_f start_POSTSUBSCRIPT over^ start_ARG bold_w end_ARG end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are the output values of the original and reconstructed weights, respectively, when applied to an input sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Intuitively, the function loss allows the VAE to reconstruct weights that may not appear identical to the originals but perform similarly. It relaxes the encoder’s optimization constraints, promoting the learning of a latent space characterized by functional semantics.

3.1.3 Weight Latent Diffusion Model

In the latent space, we instantiate the noise network ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using a transformer architecture. Conditioned on environmental information e𝑒eitalic_e, we inject this information into the network using adaptive layer norm (adaLN) [23], forming ϵθ(xn,n,c)subscriptitalic-ϵ𝜃subscript𝑥𝑛𝑛𝑐\epsilon_{\theta}(x_{n},n,c)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n , italic_c ). Compared to performing diffusive generation directly on the heterogeneous weight graph, the latent space offers significant dimensionality reduction, which alleviates the computationally intensive nature of the diffusion process and simplifies the generation of representations.

3.2 Physics-informed Prompter

In the previous section, we assume the environmental condition e𝑒eitalic_e is given. However, real-world applications often lack prior knowledge of the physical environment generating the observation trajectories. Here, we propose using a physics-informed surrogate label c𝑐citalic_c to differentiate observation trajectories originating from unknown environments. Within the constructed model zoo, the parameterized prediction models fθ,esubscript𝑓𝜃𝑒f_{\theta,e}italic_f start_POSTSUBSCRIPT italic_θ , italic_e end_POSTSUBSCRIPT capture the dynamical behavior of environment e𝑒eitalic_e, thus serving as reliable physical proxies. Inspired by the concept of function distance in functional analysis, we quantify the differences between environments by computing the L2 distance between the responses of their corresponding physical proxies fθ,esubscript𝑓𝜃𝑒f_{\theta,e}italic_f start_POSTSUBSCRIPT italic_θ , italic_e end_POSTSUBSCRIPT to the same observation samples, calculated as

Dij=𝔼xkXfθ,i(xk)fθ,j(xk)22.subscript𝐷𝑖𝑗subscript𝔼subscript𝑥𝑘𝑋superscriptsubscriptnormsubscript𝑓𝜃𝑖subscript𝑥𝑘subscript𝑓𝜃𝑗subscript𝑥𝑘22D_{ij}=\mathbb{E}_{x_{k}\in X}||f_{\theta,i}(x_{k})-f_{\theta,j}(x_{k})||_{2}^% {2}.italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_X end_POSTSUBSCRIPT | | italic_f start_POSTSUBSCRIPT italic_θ , italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_θ , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

By orderly stacking the distances between each environment’s physical proxy and those of other environments, we obtain the feature vector for each environment, Di=[Di0,Di1,Di2,]subscript𝐷𝑖subscript𝐷𝑖0subscript𝐷𝑖1subscript𝐷𝑖2D_{i}=[D_{i0},D_{i1},D_{i2},...]italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_D start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … ]. We use these feature vectors to distinguish different physical environments.

To identify the environmental condition for new environments in the test set, we apply principal component analysis, extracting the 1-dimensional principal component along the direction of the feature vectors’ largest eigenvalue. This 1D principal component serves as the surrogate label c𝑐citalic_c, which is used to train a regression model, termed "Prompter". Prompter takes the initial frame of a trajectory as input and predicts the physical environment c𝑐citalic_c to which that trajectory belongs. Here we employ classic support vector regression for prompter to prevent model complexity.

3.3 Domain-adaptive Model Zoo

We construct a model zoo to serve as the training corpus for EnvAd-Diff, adhering to three principles: 1) creating powerful expert models tailored to specific environments; 2) ensuring efficient and rapid construction; and 3) guaranteeing the stability of EnvAd-Diff training.

Neural operators [24] are employ as expert models, with minimum parameter count required for effective prediction. Prior to large-scale training of expert models for each environment, a global model is pretrained on trajectories from all available environments. This pretraining provides domain-adaptive initialization [14] for environment-specific expert training, significantly reducing the required number of training epochs. To encourage exploration of the weight landscape for different expert models within the same environment, we randomly select a layer of weights from the domain-adaptive initialization and introduce noise. We demonstrate that domain-adaptive initialization not only accelerates model zoo construction but also reduces the degrees of freedom of the weights while maintaining expert model accuracy. This significantly stabilizes the subsequent generative learning.

4 Experiment

4.1 Experimental Setup

We assume unknown environmental conditions for all dynamical systems, training models solely on observed trajectories across diverse visible environments. At test time, models autoregressively predict future states given a single initial frame. Test environments are categorized as in-domain (seen during training, novel initial conditions) and out-domain (unseen environments). We evaluate prediction quality using root mean square error (RMSE) and structural similarity index (SSIM).

Baselines

We compare against two baseline categories: foundation models (One-for-All) and meta-learning approaches (Env-Adaptive). The foundation models are trained via empirical risk minimization [25] on trajectories from all visible environments. The meta-learning method learns environment-shared weights and an environment-specific weight-generating hypernetwork. Following existing work [9], we enable zero-shot prediction by conditioning the hypernetwork on environmental parameters, which assumes known ground-truth environmental conditions. Additionally, we assume all environments are visible and train a dedicated Fourier neural operator [26] (FNO) for each environment as a performance upper bound (One-per-Env). Unless otherwise specified, we use FNO as the expert small model for EnvAd-Diff and other meta-learning methods. Architectural and hyperparameter details are in Appendix C and F.

4.2 Dynamical Systems and Model Zoo

We validate the model’s effectiveness on four time-dependent PDE systems and one real-world dataset: 1) Cylinder Flow [27]; 2) Lambda-Omega [28]; 3) Kolmogorov Flow [29]; 4) Navier-Stokes Equations [7]; and 5) ERA5 Dataset [30]. For the PDE systems, we use equation coefficients or external forcing as environmental variables and simulate multiple trajectories under different environments for training and testing. We train 100 FNO weight sets for each seen environment across all systems to serve as EnvAd-Diff’s model zoo (size 100). Detailed descriptions and data generation procedures for each system are provided in Appendix A and B.

4.3 Main Results

PDE systems

We report the generalization performance on 4 PDE systems in Table 1, detailing the number of in/out-domain environments and the parameter size of models for each system during testing. Env-Adaptive methods, which adjust optimal weights for each environment, have significantly fewer parameters than the foundation models. Across nearly all systems, EnvAd-Diff achieves the best average performance, demonstrating its ability to model the conditional dependence of the predictive model on environments. Consequently, its small, environment-specific expert models outperform foundation models with hundreds of times more parameters. Furthermore, unlike other meta-learning approaches, EnvAd-Diff treats model weights holistically during generation, without forcing the retention of environment-shared components. This potentially expands EnvAd-Diff’s search space for improved generalization.

We also find that some models can outperform One-per-Env in specific environments. This is likely due to the stochasticity of initialization and the training process, as One-per-Env models do not always converge to the optimal point. We illustrate this result with Cylinder Flow (2 environmental variables), as shown in Figure 3. The overall SSIM of One-per-Env is close to 1, however, it exhibits suboptimal performance in certain regions (green box in Figure 3). The FNO weights generated by EnvAd-Diff perform better than One-per-Env in some environments, even unseen ones. This suggests that EnvAd-Diff captures the manifold where the joint distribution of weights and environments lies, whereas the optimizer training process can fail to converge onto this manifold possibly due to getting stuck in local optima [31].

Table 1: Average RMSE (± std from 5 runs) in in- and out-domain environments (split shown in the first row). Best in bold, underlined for suboptimal.
Methods Testing Params Cylinder Flow (96:400) Lambda-Omega (12:39) Kolmogorov Flow (12:39) Navier-Stokes (24:121)
In-domain Out-domain In-domain Out-domain In-domain Out-domain In-domain Out-domain
One-for-All FNO [5] 500M+ 0.082±0.025subscript0.082plus-or-minus0.0250.082_{\pm 0.025}0.082 start_POSTSUBSCRIPT ± 0.025 end_POSTSUBSCRIPT 0.083±0.023subscript0.083plus-or-minus0.0230.083_{\pm 0.023}0.083 start_POSTSUBSCRIPT ± 0.023 end_POSTSUBSCRIPT 0.352±0.041subscript0.352plus-or-minus0.0410.352_{\pm 0.041}0.352 start_POSTSUBSCRIPT ± 0.041 end_POSTSUBSCRIPT 0.363±0.040subscript0.363plus-or-minus0.0400.363_{\pm 0.040}0.363 start_POSTSUBSCRIPT ± 0.040 end_POSTSUBSCRIPT 0.080±0.020subscript0.080plus-or-minus0.0200.080_{\pm 0.020}0.080 start_POSTSUBSCRIPT ± 0.020 end_POSTSUBSCRIPT 0.096±0.016subscript0.096plus-or-minus0.0160.096_{\pm 0.016}0.096 start_POSTSUBSCRIPT ± 0.016 end_POSTSUBSCRIPT 0.066±0.009subscript0.066plus-or-minus0.0090.066_{\pm 0.009}0.066 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT 0.074±0.015subscript0.074plus-or-minus0.0150.074_{\pm 0.015}0.074 start_POSTSUBSCRIPT ± 0.015 end_POSTSUBSCRIPT
DPOT [11] 500M+ 0.091±0.008subscript0.091plus-or-minus0.0080.091_{\pm 0.008}0.091 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.090±0.007subscript0.090plus-or-minus0.0070.090_{\pm 0.007}0.090 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT 0.324±0.007subscript0.324plus-or-minus0.0070.324_{\pm 0.007}0.324 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT 0.325±0.007subscript0.325plus-or-minus0.0070.325_{\pm 0.007}0.325 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT 0.079±0.012subscript0.079plus-or-minus0.012\mathbf{0.079_{\pm 0.012}}bold_0.079 start_POSTSUBSCRIPT ± bold_0.012 end_POSTSUBSCRIPT 0.091±0.020subscript0.091plus-or-minus0.0200.091_{\pm 0.020}0.091 start_POSTSUBSCRIPT ± 0.020 end_POSTSUBSCRIPT 0.087±0.021subscript0.087plus-or-minus0.0210.087_{\pm 0.021}0.087 start_POSTSUBSCRIPT ± 0.021 end_POSTSUBSCRIPT 0.093±0.020subscript0.093plus-or-minus0.0200.093_{\pm 0.020}0.093 start_POSTSUBSCRIPT ± 0.020 end_POSTSUBSCRIPT
Poseidon [10] 600M+ 0.085±0.014subscript0.085plus-or-minus0.0140.085_{\pm 0.014}0.085 start_POSTSUBSCRIPT ± 0.014 end_POSTSUBSCRIPT 0.083±0.015subscript0.083plus-or-minus0.0150.083_{\pm 0.015}0.083 start_POSTSUBSCRIPT ± 0.015 end_POSTSUBSCRIPT 0.301±0.013subscript0.301plus-or-minus0.0130.301_{\pm 0.013}0.301 start_POSTSUBSCRIPT ± 0.013 end_POSTSUBSCRIPT 0.318±0.009subscript0.318plus-or-minus0.0090.318_{\pm 0.009}0.318 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT 0.102±0.006subscript0.102plus-or-minus0.0060.102_{\pm 0.006}0.102 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT 0.103±0.005subscript0.103plus-or-minus0.0050.103_{\pm 0.005}0.103 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT 0.092±0.017subscript0.092plus-or-minus0.0170.092_{\pm 0.017}0.092 start_POSTSUBSCRIPT ± 0.017 end_POSTSUBSCRIPT 0.095±0.016subscript0.095plus-or-minus0.0160.095_{\pm 0.016}0.095 start_POSTSUBSCRIPT ± 0.016 end_POSTSUBSCRIPT
MPP [12] 550M+ 0.102±0.020subscript0.102plus-or-minus0.0200.102_{\pm 0.020}0.102 start_POSTSUBSCRIPT ± 0.020 end_POSTSUBSCRIPT 0.098±0.019subscript0.098plus-or-minus0.0190.098_{\pm 0.019}0.098 start_POSTSUBSCRIPT ± 0.019 end_POSTSUBSCRIPT 0.311±0.054subscript0.311plus-or-minus0.0540.311_{\pm 0.054}0.311 start_POSTSUBSCRIPT ± 0.054 end_POSTSUBSCRIPT 0.313±0.055subscript0.313plus-or-minus0.0550.313_{\pm 0.055}0.313 start_POSTSUBSCRIPT ± 0.055 end_POSTSUBSCRIPT 0.098±0.017subscript0.098plus-or-minus0.0170.098_{\pm 0.017}0.098 start_POSTSUBSCRIPT ± 0.017 end_POSTSUBSCRIPT 0.103±0.022subscript0.103plus-or-minus0.0220.103_{\pm 0.022}0.103 start_POSTSUBSCRIPT ± 0.022 end_POSTSUBSCRIPT 0.095±0.026subscript0.095plus-or-minus0.0260.095_{\pm 0.026}0.095 start_POSTSUBSCRIPT ± 0.026 end_POSTSUBSCRIPT 0.096±0.028subscript0.096plus-or-minus0.0280.096_{\pm 0.028}0.096 start_POSTSUBSCRIPT ± 0.028 end_POSTSUBSCRIPT
Env-Adaptive DyAd [8] 1M+ 0.101±0.021subscript0.101plus-or-minus0.0210.101_{\pm 0.021}0.101 start_POSTSUBSCRIPT ± 0.021 end_POSTSUBSCRIPT 0.104±0.019subscript0.104plus-or-minus0.0190.104_{\pm 0.019}0.104 start_POSTSUBSCRIPT ± 0.019 end_POSTSUBSCRIPT 0.138±0.078subscript0.138plus-or-minus0.0780.138_{\pm 0.078}0.138 start_POSTSUBSCRIPT ± 0.078 end_POSTSUBSCRIPT 0.137±0.075subscript0.137plus-or-minus0.0750.137_{\pm 0.075}0.137 start_POSTSUBSCRIPT ± 0.075 end_POSTSUBSCRIPT 0.099±0.006subscript0.099plus-or-minus0.0060.099_{\pm 0.006}0.099 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT 0.098±0.005subscript0.098plus-or-minus0.0050.098_{\pm 0.005}0.098 start_POSTSUBSCRIPT ± 0.005 end_POSTSUBSCRIPT 0.091±0.018subscript0.091plus-or-minus0.0180.091_{\pm 0.018}0.091 start_POSTSUBSCRIPT ± 0.018 end_POSTSUBSCRIPT 0.096±0.015subscript0.096plus-or-minus0.0150.096_{\pm 0.015}0.096 start_POSTSUBSCRIPT ± 0.015 end_POSTSUBSCRIPT
LEADS [32] 0.117±0.031subscript0.117plus-or-minus0.0310.117_{\pm 0.031}0.117 start_POSTSUBSCRIPT ± 0.031 end_POSTSUBSCRIPT 0.115±0.036subscript0.115plus-or-minus0.0360.115_{\pm 0.036}0.115 start_POSTSUBSCRIPT ± 0.036 end_POSTSUBSCRIPT 0.132±0.034subscript0.132plus-or-minus0.0340.132_{\pm 0.034}0.132 start_POSTSUBSCRIPT ± 0.034 end_POSTSUBSCRIPT 0.123±0.032subscript0.123plus-or-minus0.0320.123_{\pm 0.032}0.123 start_POSTSUBSCRIPT ± 0.032 end_POSTSUBSCRIPT 0.107±0.011subscript0.107plus-or-minus0.0110.107_{\pm 0.011}0.107 start_POSTSUBSCRIPT ± 0.011 end_POSTSUBSCRIPT 0.105±0.010subscript0.105plus-or-minus0.0100.105_{\pm 0.010}0.105 start_POSTSUBSCRIPT ± 0.010 end_POSTSUBSCRIPT 0.091±0.022subscript0.091plus-or-minus0.0220.091_{\pm 0.022}0.091 start_POSTSUBSCRIPT ± 0.022 end_POSTSUBSCRIPT 0.094±0.020subscript0.094plus-or-minus0.0200.094_{\pm 0.020}0.094 start_POSTSUBSCRIPT ± 0.020 end_POSTSUBSCRIPT
CoDA [7] 0.112±0.032subscript0.112plus-or-minus0.0320.112_{\pm 0.032}0.112 start_POSTSUBSCRIPT ± 0.032 end_POSTSUBSCRIPT 0.110±0.033subscript0.110plus-or-minus0.0330.110_{\pm 0.033}0.110 start_POSTSUBSCRIPT ± 0.033 end_POSTSUBSCRIPT 0.119±0.034subscript0.119plus-or-minus0.0340.119_{\pm 0.034}0.119 start_POSTSUBSCRIPT ± 0.034 end_POSTSUBSCRIPT 0.116±0.032subscript0.116plus-or-minus0.0320.116_{\pm 0.032}0.116 start_POSTSUBSCRIPT ± 0.032 end_POSTSUBSCRIPT 0.097±0.019subscript0.097plus-or-minus0.0190.097_{\pm 0.019}0.097 start_POSTSUBSCRIPT ± 0.019 end_POSTSUBSCRIPT 0.098±0.019subscript0.098plus-or-minus0.0190.098_{\pm 0.019}0.098 start_POSTSUBSCRIPT ± 0.019 end_POSTSUBSCRIPT 0.096±0.016subscript0.096plus-or-minus0.0160.096_{\pm 0.016}0.096 start_POSTSUBSCRIPT ± 0.016 end_POSTSUBSCRIPT 0.098±0.014subscript0.098plus-or-minus0.0140.098_{\pm 0.014}0.098 start_POSTSUBSCRIPT ± 0.014 end_POSTSUBSCRIPT
GEPS [29] 0.086±0.020subscript0.086plus-or-minus0.0200.086_{\pm 0.020}0.086 start_POSTSUBSCRIPT ± 0.020 end_POSTSUBSCRIPT 0.093±0.021subscript0.093plus-or-minus0.0210.093_{\pm 0.021}0.093 start_POSTSUBSCRIPT ± 0.021 end_POSTSUBSCRIPT 0.094±0.041subscript0.094plus-or-minus0.0410.094_{\pm 0.041}0.094 start_POSTSUBSCRIPT ± 0.041 end_POSTSUBSCRIPT 0.092±0.039subscript0.092plus-or-minus0.0390.092_{\pm 0.039}0.092 start_POSTSUBSCRIPT ± 0.039 end_POSTSUBSCRIPT 0.089±0.009subscript0.089plus-or-minus0.0090.089_{\pm 0.009}0.089 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT 0.086±0.008subscript0.086plus-or-minus0.0080.086_{\pm 0.008}0.086 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.098±0.011subscript0.098plus-or-minus0.0110.098_{\pm 0.011}0.098 start_POSTSUBSCRIPT ± 0.011 end_POSTSUBSCRIPT 0.099±0.010subscript0.099plus-or-minus0.0100.099_{\pm 0.010}0.099 start_POSTSUBSCRIPT ± 0.010 end_POSTSUBSCRIPT
CAMEL [9] 0.109±0.018subscript0.109plus-or-minus0.0180.109_{\pm 0.018}0.109 start_POSTSUBSCRIPT ± 0.018 end_POSTSUBSCRIPT 0.109±0.015subscript0.109plus-or-minus0.0150.109_{\pm 0.015}0.109 start_POSTSUBSCRIPT ± 0.015 end_POSTSUBSCRIPT 0.104±0.018subscript0.104plus-or-minus0.0180.104_{\pm 0.018}0.104 start_POSTSUBSCRIPT ± 0.018 end_POSTSUBSCRIPT 0.103±0.018subscript0.103plus-or-minus0.0180.103_{\pm 0.018}0.103 start_POSTSUBSCRIPT ± 0.018 end_POSTSUBSCRIPT 0.096±0.013subscript0.096plus-or-minus0.0130.096_{\pm 0.013}0.096 start_POSTSUBSCRIPT ± 0.013 end_POSTSUBSCRIPT 0.101±0.016subscript0.101plus-or-minus0.0160.101_{\pm 0.016}0.101 start_POSTSUBSCRIPT ± 0.016 end_POSTSUBSCRIPT 0.106±0.018subscript0.106plus-or-minus0.0180.106_{\pm 0.018}0.106 start_POSTSUBSCRIPT ± 0.018 end_POSTSUBSCRIPT 0.109±0.015subscript0.109plus-or-minus0.0150.109_{\pm 0.015}0.109 start_POSTSUBSCRIPT ± 0.015 end_POSTSUBSCRIPT
EnvAd-Diff 0.053±0.025subscript0.053plus-or-minus0.025\mathbf{0.053_{\pm 0.025}}bold_0.053 start_POSTSUBSCRIPT ± bold_0.025 end_POSTSUBSCRIPT 0.065±0.031subscript0.065plus-or-minus0.031\mathbf{0.065_{\pm 0.031}}bold_0.065 start_POSTSUBSCRIPT ± bold_0.031 end_POSTSUBSCRIPT 0.093±0.026subscript0.093plus-or-minus0.026\mathbf{0.093_{\pm 0.026}}bold_0.093 start_POSTSUBSCRIPT ± bold_0.026 end_POSTSUBSCRIPT 0.089±0.027subscript0.089plus-or-minus0.027\mathbf{0.089_{\pm 0.027}}bold_0.089 start_POSTSUBSCRIPT ± bold_0.027 end_POSTSUBSCRIPT 0.083±0.008subscript0.083plus-or-minus0.0080.083_{\pm 0.008}0.083 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.084±0.006subscript0.084plus-or-minus0.006\mathbf{0.084_{\pm 0.006}}bold_0.084 start_POSTSUBSCRIPT ± bold_0.006 end_POSTSUBSCRIPT 0.0600.007subscript0.0600.007\mathbf{0.060_{0.007}}bold_0.060 start_POSTSUBSCRIPT bold_0.007 end_POSTSUBSCRIPT 0.064±0.007subscript0.064plus-or-minus0.007\mathbf{0.064_{\pm 0.007}}bold_0.064 start_POSTSUBSCRIPT ± bold_0.007 end_POSTSUBSCRIPT
One-per-Env 1M+ 0.040±0.040subscript0.040plus-or-minus0.0400.040_{\pm 0.040}0.040 start_POSTSUBSCRIPT ± 0.040 end_POSTSUBSCRIPT 0.038±0.040subscript0.038plus-or-minus0.0400.038_{\pm 0.040}0.038 start_POSTSUBSCRIPT ± 0.040 end_POSTSUBSCRIPT 0.038±0.032subscript0.038plus-or-minus0.0320.038_{\pm 0.032}0.038 start_POSTSUBSCRIPT ± 0.032 end_POSTSUBSCRIPT 0.035±0.008subscript0.035plus-or-minus0.0080.035_{\pm 0.008}0.035 start_POSTSUBSCRIPT ± 0.008 end_POSTSUBSCRIPT 0.069±0.021subscript0.069plus-or-minus0.0210.069_{\pm 0.021}0.069 start_POSTSUBSCRIPT ± 0.021 end_POSTSUBSCRIPT 0.071±0.019subscript0.071plus-or-minus0.0190.071_{\pm 0.019}0.071 start_POSTSUBSCRIPT ± 0.019 end_POSTSUBSCRIPT 0.046±0.007subscript0.046plus-or-minus0.0070.046_{\pm 0.007}0.046 start_POSTSUBSCRIPT ± 0.007 end_POSTSUBSCRIPT 0.047±0.009subscript0.047plus-or-minus0.0090.047_{\pm 0.009}0.047 start_POSTSUBSCRIPT ± 0.009 end_POSTSUBSCRIPT
Refer to caption
Figure 3: Predicting performance on Cylinder Flow. SSIM distribution of (a) One-per-Env and (b) EnvAd-Gen; (c) Ratio where [method] outperforms One-per-Env; (d) Differences between EnvAd-Gen and One-per-Env. The green circle means seen environment during training.
Real-world dataset

We utilize the ERA5 reanalysis dataset, including east-west and north-south wind speed data at a height of 100 meters. The spatial resolution is 0.25°, and the temporal resolution is 1 hour. We use January 2018 wind speeds as the training set and January 2019 as the test set. To define different environments, we divide the globe into 6×12 grid subregions at 30° intervals [8]. We randomly select 24 subregions as seen environments, with the remaining 48 as unseen environments. The experimental results are shown in Figure 4. EnvAd-Diff’s prediction performance outperform all foundation models and is able to surpass One-per-Env in partial unseen subregions.

Refer to caption
Figure 4: Predicting performance on ERA5 data. (a) One frame of ground true wind speed. (b) SSIM difference between EnvAd-Diff and One-per-Env. The green box means seen environment during training). (c) Average prediction RMSE of EnvAd-Diff and foundation models.

4.4 Explainability

Refer to caption
Figure 5: Joint distribution of weights and environments on Cylinder Flow.

We visualize the joint distribution of weights and environments using the Cylinder Flow system as an example to aid qualitative analysis, where over 80% environments are unseen by EnvAd-Diff. In Figure 5, the x-axis represents the surrogate environment labels predicted by the prompter, and the y-axis represents the first principal component of the weights of a specific layer. The weight-environment landscape learned by EnvAd-Diff closely resembles that learned by One-per-Env through optimizer training. This indicates that EnvAd-Diff successfully models the joint distribution of weights and environments, thereby explaining its superior performance in Table 1.

Refer to caption
Figure 6: Generative functions of EnvAd-Diff for the LV system.

To quantitatively analyze the environmental-weight joint distribution fitted by EnvAd-Diff, we introduce a simple ODE system, the Lotka-Volterra (LV) equations [7], as a toy example. We use a symbolic regression algorithm [33] to distill the predictive model generated by EnvAd-Diff for specific environmental conditions (β𝛽\betaitalic_β and δ𝛿\deltaitalic_δ) into an equation expression, as shown in Figure 6. The equivalent equation for the weights generated by EnvAd-Diff is consistent in form with the LV equations, and the environmental coefficients are close. This quantitatively demonstrates that EnvAd-Diff can fit the generalizable function f𝑓fitalic_f in equation 1 rather than an environment-specific function. We detail the experimental setup in Appendix E.2.

4.5 Robustness

We assess EnvAd-Diff’s robustness to data variations. Specifically, we investigate the impact of the number of seen environments and the model zoo size. We first examine the effect of model zoo size on Cylinder Flow and Lambda-Omega systems, as depicted in Figures 7a and b. The results indicate that EnvAd-Diff exhibits relatively stable performance with a zoo size of 50. As the zoo size decreases further, performance begins to deteriorate, even within the distribution. Subsequently, we test the influence of the number of seen environments on the Kolmogorov Flow and Navier-Stokes systems. The number of environments ranged from approximately 5% to 20% of the total. The findings reveal that increasing the number of seen environments reduces prediction error, but the gains become marginal after reaching around 20%. This suggests that EnvAd-Diff learns the underlying joint distribution of weights and environments from a small number of environments, rather than overfitting to trajectory samples within those environments.

Refer to caption
Figure 7: Robustness experiments. Impact of model zoo size on EnvAd-Diff’s performance on (a) Cylinder Flow and (b) Lambda-Omega. Impact of the number of seen environments (e) on EnvAd-Diff’s performance on (c) Kolmogorov Flow and (d) Navier-Stokes.

4.6 Extensibility

Refer to caption
Figure 8: EnvAd-Diff on the Cylinder Flow with different expert models.

The weight graph structure proposed in Section 3.1.1 is capable of organizing neural networks of arbitrary architectures. Here, we extent to more neural operators as expert models within EnvAd-Diff, including Wavelet Neural Operator [34] (WNO), and U-shape Neural Operator [35] (UNO). Our experimental results on Cylinder Flow are presented in Figure 8. EnvAd-Diff, when using different neural operators, consistently achieves excellent generalization performance, with actual performance showing only minor variations depending on the specific operator architecture. This demonstrates that EnvAd-Diff is a model-agnostic framework capable of benefiting from the sophisticated architectural designs of its expert models. Detailed architectures of these neural operators are provided in Appendix F.

4.7 Ablation Study

We conduct an ablation study on the key components of EnvAd-Diff. Specifically, we verify the necessity of domain initialization when building the model zoo and the function loss used during VAE training. Experimental results on the Kolmogorov Flow and Navier-Stokes systems are presented in Table 4. When function loss is omitted, the VAE relies solely on MSE for reconstruction similarity, leading to suboptimal generated weights, particularly out-domain. Function loss relaxes VAE encoding constraints and helps prevent overfitting by prioritizing functional consistency over exact reconstruction. Removing domain initialization results in a significant deterioration in generated weight performance. This is attributed to the high complexity and distribution of a randomly initialized model zoo, which significantly increases the diffusion model’s modeling difficulty. We conclude that for weight generation aimed at generalization, sample quality is far more critical than diversity.

In addition, we also compare real environmental conditions with surrogate labels in Appendix E.1.

4.8 Computational Cost

Refer to caption
Figure 9: (a) Time cost and (b) GPU memory during testing on the Navier-Stokes system.
Time cost

We compare the time overhead of EnvAd-Diff and One-per-Env when adapting to new environments, as shown in Figure 9a. One-per-Env requires training weights for each new environment using observational data. EnvAd-Diff’s overhead includes building the model zoo (accelerated by domain initialization) and generating weights for new environments. Though the upfront time cost of preparing the model zoo, EnvAd-Diff generates weights significantly faster than training a new prediction model and requires no data.

GPU memory

We compare the GPU memory usage of EnvAd-Diff and other baselines during inference (Figure 9b). Thanks to the proposed weight graph structure, EnvAd-Diff’s attention computation unfolds along the node dimension, significantly reducing computational overhead and memory. In addition, we detail the storage overhead of the model zoo in Appendix B.

5 Related Work

5.1 Dynamics Prediction across Environments

Developing dynamic prediction models with cross-environment generalization is a crucial problem in scientific machine learning and has garnered significant research interest. We review the main approaches and related work in this area. The first category trains large-parameter neural solvers as foundation models using extensive simulated data [36, 37, 38]. Subramanian et al. [5] explore the generalization performance of classical FNO architectures across different parameter sizes. Subsequently, models such as MPP [12], DPOT [11], and Poseidon [10] employed more advanced architectures to improve computational efficiency and approximation capabilities. The second approach is meta-learning [39]. These methods capture cross-environment invariants through environment-shared weights and fine-tune environment-specific weights on limited data from new environments for adaptation, including DyAd [8], LEADS [32], CoDA [7], GEPS [29], CAMEL [9], and NCF [40]. Additionally, other methods exist, like in-context learning [14]. Yang et al. [13] frame differential equation forward and inverse problems as natural language statements, pre-train transformers, and provide solution examples for new environments as context to enhance model performance. Compared to these works, we innovatively treat the complete model weights as generated objects and explicitly model their joint distribution with the environment.

5.2 Diffusion for Network Weight Generation

Generating neural network weights is a relatively nascent research area [41]. An initial line of work involved training MLPs to overfit implicit neural fields, distilling them into model weights, and subsequently generating these MLP weights as an alternative to directly generating the fields [42]. Another category proposes using generated weights to replace hand-crafted initialization, thereby accelerating and improving the neural network training process [43, 44, 45]. These efforts primarily focus on image modalities. More recent studies leverage diffusion models to address generalization in various domains. Yuan et al. [46] employ urban knowledge graph as prompts to guide diffusion for generating spatio-temporal prediction model weights for new cities. Zhang et al. [47] replace the inner loop gradient updates of the meta learning with diffusion-generated weights. A recent work [48] explores extracting features from unseen datasets and controlling diffusion to generate adapted model weights for them. However, these methods exhibit limited zero-shot performance. This may stem from them disrupting the neural network’s inherent topological connections by directly flattening the weights, which constrains the representational capacity of the generative model. In contrast, we organize weights in the form of a neural graph and introduce a function loss to guide their representation. The zero-shot prediction performance of our generated model surpasses pre-trained models with hundreds of times more weights.

6 Conclusion

In this work, we suggest explicitly modeling the conditional dependence of the predictive model on environments to achieving effective cross-environment prediction. We propose a Environment-adaptive dynamics diffusion model, EnvAd-Diff, to generate expert models for specific environments. We organize the model weights as weight graph to preserve their inherent connectivity, accommodating arbitrary model architectures. In additional, we design physics-informed surrogate labels for unknown physical environments and a high-quaility model zoo for the training of EnvAd-Diff. Experiments on simulated and real-world systems emonstrate that 1M-parameter models generated by EnvAd-Diff for specific environments predict more accurately than 500M-parameter foundation models.

Limitations & future work

The performance of EnvAd-Diff is contingent on the quality of the model zoo, including the diversity of environments and the performance of expert models. This presents a challenge for the initial construction of the model zoo. Future work will primarily focus on developing more data-efficient and robust models for weight representation.

References

  • [1] Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
  • [2] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noé. Vampnets for deep learning of molecular kinetics. Nature communications, 9(1):5, 2018.
  • [3] Dule Shu, Zijie Li, and Amir Barati Farimani. A physics-informed diffusion model for high-fidelity flow field reconstruction. Journal of Computational Physics, 478:111972, 2023.
  • [4] Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
  • [5] Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael W Mahoney, and Amir Gholami. Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior. Advances in Neural Information Processing Systems, 36:71242–71262, 2023.
  • [6] Somdatta Goswami, Katiana Kontolati, Michael D Shields, and George Em Karniadakis. Deep transfer operator learning for partial differential equations under conditional shift. Nature Machine Intelligence, 4(12):1155–1164, 2022.
  • [7] Matthieu Kirchmeyer, Yuan Yin, Jérémie Donà, Nicolas Baskiotis, Alain Rakotomamonjy, and Patrick Gallinari. Generalizing to new physical systems via context-informed dynamics model. In International Conference on Machine Learning, pages 11283–11301. PMLR, 2022.
  • [8] Rui Wang, Robin Walters, and Rose Yu. Meta-learning dynamics forecasting using task inference. Advances in Neural Information Processing Systems, 35:21640–21653, 2022.
  • [9] Matthieu Blanke and Marc Lelarge. Interpretable meta-learning of physical systems. In The Twelfth International Conference on Learning Representations.
  • [10] Maximilian Herde, Bogdan Raonic, Tobias Rohner, Roger Käppeli, Roberto Molinaro, Emmanuel de Bézenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes. Advances in Neural Information Processing Systems, 37:72525–72624, 2024.
  • [11] Zhongkai Hao, Chang Su, Songming Liu, Julius Berner, Chengyang Ying, Hang Su, Anima Anandkumar, Jian Song, and Jun Zhu. Dpot: Auto-regressive denoising operator transformer for large-scale pde pre-training. In International Conference on Machine Learning, pages 17616–17635. PMLR, 2024.
  • [12] Michael McCabe, Bruno Régaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for spatiotemporal surrogate models. Advances in Neural Information Processing Systems, 37:119301–119335, 2024.
  • [13] Liu Yang, Siting Liu, Tingwei Meng, and Stanley J Osher. In-context operator learning with data prompts for differential equation problems. Proceedings of the National Academy of Sciences, 120(39):e2310142120, 2023.
  • [14] Wuyang Chen, Jialin Song, Pu Ren, Shashank Subramanian, Dmitriy Morozov, and Michael W Mahoney. Data-efficient operator learning via unsupervised pretraining and in-context learning. Advances in Neural Information Processing Systems, 37:6213–6245, 2024.
  • [15] Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J Burghouts, Efstratios Gavves, Cees GM Snoek, and David W Zhang. Graph neural networks for learning equivariant representations of neural networks. arXiv preprint arXiv:2403.12143, 2024.
  • [16] Maximilian Plattner, Arturs Berzins, and Johannes Brandstetter. Shape generation via weight space learning. arXiv preprint arXiv:2503.21830, 2025.
  • [17] Léo Meynent, Ivan Melev, Konstantin Schürholt, Göran Kauermann, and Damian Borth. Structure is not enough: Leveraging behavior for neural network weight reconstruction. arXiv preprint arXiv:2503.17138, 2025.
  • [18] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [19] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • [20] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023.
  • [21] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  • [22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [23] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
  • [24] Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes. Journal of Machine Learning Research, 24(89):1–97, 2023.
  • [25] Ibrahim Ayed, Emmanuel de Bézenac, Arthur Pajot, Julien Brajard, and Patrick Gallinari. Learning dynamical systems from partial observations. arXiv preprint arXiv:1902.11136, 2019.
  • [26] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020.
  • [27] Ruikun Li, Jingwen Cheng, Huandong Wang, Qingmin Liao, and Yong Li. Predicting the dynamics of complex system via multiscale diffusion autoencoder. arXiv preprint arXiv:2505.02450, 2025.
  • [28] Kathleen Champion, Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Data-driven discovery of coordinates and governing equations. Proceedings of the National Academy of Sciences, 116(45):22445–22451, 2019.
  • [29] Armand Kassaï Koupaï, Jorge Mifsut Benet, Yuan Yin, Jean-Noël Vittaut, and Patrick Gallinari. Geps: Boosting generalization in parametric pde neural solvers through adaptive conditioning. arXiv preprint arXiv:2410.23889, 2024.
  • [30] Zongwei Zhang, Lianlei Lin, Sheng Gao, Junkai Wang, Hanqing Zhao, and Hangyi Yu. A machine learning model for hub-height short-term wind speed prediction. Nature Communications, 16(1):3195, 2025.
  • [31] Antonio Sclocchi and Matthieu Wyart. On the different regimes of stochastic gradient descent. Proceedings of the National Academy of Sciences, 121(9):e2316301121, 2024.
  • [32] Yuan Yin, Ibrahim Ayed, Emmanuel de Bézenac, Nicolas Baskiotis, and Patrick Gallinari. Leads: Learning dynamical systems that generalize across environments. Advances in Neural Information Processing Systems, 34:7561–7573, 2021.
  • [33] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15):3932–3937, 2016.
  • [34] Tapas Tripura and Souvik Chakraborty. Wavelet neural operator for solving parametric partial differential equations in computational mechanics problems. Computer Methods in Applied Mechanics and Engineering, 404:115783, 2023.
  • [35] Md Ashiqur Rahman, Zachary E Ross, and Kamyar Azizzadenesheli. U-no: U-shaped neural operators. arXiv preprint arXiv:2204.11127, 2022.
  • [36] Md Ashiqur Rahman, Robert Joseph George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes. Advances in Neural Information Processing Systems, 37:104035–104064, 2024.
  • [37] Benedikt Alkin, Andreas Fürst, Simon Schmid, Lukas Gruber, Markus Holzleitner, and Johannes Brandstetter. Universal physics transformers: A framework for efficiently scaling neural operators. Advances in Neural Information Processing Systems, 37:25152–25194, 2024.
  • [38] Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Chonghan Gao, Rongye Shi, Shanghang Zhang, and Jianxin Li. Building flexible machine learning models for scientific computing at scale. arXiv preprint arXiv:2402.16014, 2024.
  • [39] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  • [40] Roussel Desmond Nzoyem, David AW Barton, and Tom Deakin. Neural context flows for meta-learning of dynamical systems. arXiv preprint arXiv:2405.02154, 2024.
  • [41] Kai Wang, Dongwen Tang, Boya Zeng, Yida Yin, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, and Yang You. Neural network diffusion. arXiv preprint arXiv:2402.13144, 2024.
  • [42] Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14300–14310, 2023.
  • [43] Yifan Gong, Zheng Zhan, Yanyu Li, Yerlan Idelbayev, Andrey Zharkov, Kfir Aberman, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficient training with denoised neural weights. In European Conference on Computer Vision, pages 18–34, 2024.
  • [44] Konstantin Schürholt, Boris Knyazev, Xavier Giró-i Nieto, and Damian Borth. Hyper-representations as generative models: Sampling unseen neural network weights. Advances in Neural Information Processing Systems, 35:27906–27920, 2022.
  • [45] Konstantin Schürholt, Michael W Mahoney, and Damian Borth. Towards scalable and versatile weight space learning. In Proceedings of the 41st International Conference on Machine Learning, pages 43947–43966, 2024.
  • [46] Yuan Yuan, Chenyang Shao, Jingtao Ding, Depeng Jin, and Yong Li. Spatio-temporal few-shot learning via diffusive neural network generation. arXiv preprint arXiv:2402.11922, 2024.
  • [47] Baoquan Zhang, Chuyao Luo, Demin Yu, Xutao Li, Huiwei Lin, Yunming Ye, and Bowen Zhang. Metadiff: Meta-learning with conditional diffusion for few-shot learning. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 16687–16695, 2024.
  • [48] Bedionita Soro, Bruno Andreis, Hayeon Lee, Wonyong Jeong, Song Chong, Frank Hutter, and Sung Ju Hwang. Diffusion-based neural network weights generation. arXiv preprint arXiv:2402.18153, 2024.
  • [49] Jacob Page, Peter Norgaard, Michael P Brenner, and Rich R Kerswell. Recurrent flow patterns as a basis for two-dimensional turbulence: Predicting statistics from structures. Proceedings of the National Academy of Sciences, 121(23):e2320007121, 2024.
  • [50] Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems, 35:1596–1611, 2022.
  • [51] Pantelis R Vlachas, Georgios Arampatzis, Caroline Uhler, and Petros Koumoutsakos. Multiscale simulations of complex systems by learning their effective dynamics. Nature Machine Intelligence, 4(4):359–366, 2022.
  • [52] Alan A. Kaptanoglu, Brian M. de Silva, Urban Fasel, Kadierdan Kaheman, Andy J. Goldschmidt, Jared Callaham, Charles B. Delahunt, Zachary G. Nicolaou, Kathleen Champion, Jean-Christophe Loiseau, J. Nathan Kutz, and Steven L. Brunton. Pysindy: A comprehensive python package for robust sparse system identification. Journal of Open Source Software, 7(69):3994, 2022.
  • [53] Brian de Silva, Kathleen Champion, Markus Quade, Jean-Christophe Loiseau, J. Kutz, and Steven Brunton. Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data. Journal of Open Source Software, 5(49):2104, 2020.
  • [54] Jean Kossaifi, Nikola Kovachki, Zongyi Li, David Pitt, Miguel Liu-Schiaffini, Robert Joseph George, Boris Bonev, Kamyar Azizzadenesheli, Julius Berner, and Anima Anandkumar. A library for learning neural operators, 2024.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Appendix A Data Generation

Cylinder Flow system [27] is governed by:

{u˙t=uu1αp+βαΔu,v˙t=vv+1αpβαΔv.\left\{\begin{aligned} \dot{u}_{t}&=-u\cdot\nabla u-\frac{1}{\alpha}\nabla p+% \frac{\beta}{\alpha}\Delta u,\\ \dot{v}_{t}&=-v\cdot\nabla v+\frac{1}{\alpha}\nabla p-\frac{\beta}{\alpha}% \Delta v.\end{aligned}\right.{ start_ROW start_CELL over˙ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = - italic_u ⋅ ∇ italic_u - divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ∇ italic_p + divide start_ARG italic_β end_ARG start_ARG italic_α end_ARG roman_Δ italic_u , end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = - italic_v ⋅ ∇ italic_v + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ∇ italic_p - divide start_ARG italic_β end_ARG start_ARG italic_α end_ARG roman_Δ italic_v . end_CELL end_ROW (7)

In this system, we use the Reynolds number Re𝑅𝑒Reitalic_R italic_e and characteristic length r𝑟ritalic_r as two environmental variables. The Reynolds number and characteristic length influence the lattice viscosity, which in turn affects the collision frequency, leading to different dynamic behaviors.

Lambda–Omega system [28] is governed by

{u˙t=μuΔu+(1u2v2)u+β(u2+v2)vv˙t=μvΔv+(1u2v2)vβ(u2+v2)u,\left\{\begin{aligned} \dot{u}_{t}&=\mu_{u}\Delta u+(1-u^{2}-v^{2})u+\beta(u^{% 2}+v^{2})v\\ \dot{v}_{t}&=\mu_{v}\Delta v+(1-u^{2}-v^{2})v-\beta(u^{2}+v^{2})u,\end{aligned% }\right.{ start_ROW start_CELL over˙ start_ARG italic_u end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_μ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT roman_Δ italic_u + ( 1 - italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_u + italic_β ( italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_v end_CELL end_ROW start_ROW start_CELL over˙ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_Δ italic_v + ( 1 - italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_v - italic_β ( italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_v start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_u , end_CELL end_ROW (8)

where ΔΔ\Deltaroman_Δ is the Laplacian operator. For this system, we use β𝛽\betaitalic_β as a 1-dimensional environmental variable. μvsubscript𝜇𝑣\mu_{v}italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and μvsubscript𝜇𝑣\mu_{v}italic_μ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are both set to 0.5.

Kolmogrov Flow system [49] is governed by

tω+(𝐮)ωsubscript𝑡𝜔𝐮𝜔\displaystyle\partial_{t}\omega+(\mathbf{u}\cdot\nabla)\omega∂ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ω + ( bold_u ⋅ ∇ ) italic_ω =1ReΔωncos(ny),absent1𝑅𝑒Δ𝜔𝑛𝑛𝑦\displaystyle=\frac{1}{Re}\Delta\omega-n\cos(ny),= divide start_ARG 1 end_ARG start_ARG italic_R italic_e end_ARG roman_Δ italic_ω - italic_n roman_cos ( italic_n italic_y ) ,
2ψsuperscript2𝜓\displaystyle\nabla^{2}\psi∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ =ω,absent𝜔\displaystyle=-\omega,= - italic_ω ,
𝐮𝐮\displaystyle\mathbf{u}bold_u =(u,v)=(ψy,ψx),absent𝑢𝑣𝜓𝑦𝜓𝑥\displaystyle=(u,v)=\left(\frac{\partial\psi}{\partial y},-\frac{\partial\psi}% {\partial x}\right),= ( italic_u , italic_v ) = ( divide start_ARG ∂ italic_ψ end_ARG start_ARG ∂ italic_y end_ARG , - divide start_ARG ∂ italic_ψ end_ARG start_ARG ∂ italic_x end_ARG ) ,
ω𝜔\displaystyle\omegaitalic_ω =(×𝐮)𝐳^=vxuy,absent𝐮^𝐳𝑣𝑥𝑢𝑦\displaystyle=(\nabla\times\mathbf{u})\cdot\hat{\mathbf{z}}=\frac{\partial v}{% \partial x}-\frac{\partial u}{\partial y},= ( ∇ × bold_u ) ⋅ over^ start_ARG bold_z end_ARG = divide start_ARG ∂ italic_v end_ARG start_ARG ∂ italic_x end_ARG - divide start_ARG ∂ italic_u end_ARG start_ARG ∂ italic_y end_ARG ,

For this system, we use Re𝑅𝑒Reitalic_R italic_e as a 1-dimensional environmental variable. n𝑛nitalic_n is set to 3.

Navier-Stokes system [50] is governed by

ωt+(𝐮)ω𝜔𝑡𝐮𝜔\displaystyle\frac{\partial\omega}{\partial t}+(\mathbf{u}\cdot\nabla)\omegadivide start_ARG ∂ italic_ω end_ARG start_ARG ∂ italic_t end_ARG + ( bold_u ⋅ ∇ ) italic_ω =νΔω+f,absent𝜈Δ𝜔𝑓\displaystyle=\nu\Delta\omega+f,= italic_ν roman_Δ italic_ω + italic_f ,
2ψsuperscript2𝜓\displaystyle\nabla^{2}\psi∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ =ω,absent𝜔\displaystyle=-\omega,= - italic_ω ,
𝐮𝐮\displaystyle\mathbf{u}bold_u =(u,v)=(ψy,ψx),absent𝑢𝑣𝜓𝑦𝜓𝑥\displaystyle=(u,v)=\left(\frac{\partial\psi}{\partial y},-\frac{\partial\psi}% {\partial x}\right),= ( italic_u , italic_v ) = ( divide start_ARG ∂ italic_ψ end_ARG start_ARG ∂ italic_y end_ARG , - divide start_ARG ∂ italic_ψ end_ARG start_ARG ∂ italic_x end_ARG ) ,
ω𝜔\displaystyle\omegaitalic_ω =(×𝐮)𝐳^=vxuy,absent𝐮^𝐳𝑣𝑥𝑢𝑦\displaystyle=(\nabla\times\mathbf{u})\cdot\hat{\mathbf{z}}=\frac{\partial v}{% \partial x}-\frac{\partial u}{\partial y},= ( ∇ × bold_u ) ⋅ over^ start_ARG bold_z end_ARG = divide start_ARG ∂ italic_v end_ARG start_ARG ∂ italic_x end_ARG - divide start_ARG ∂ italic_u end_ARG start_ARG ∂ italic_y end_ARG ,

where f=A(sin(2π(x+y+s))+cos(2π(x+y+s)))𝑓𝐴2𝜋𝑥𝑦𝑠2𝜋𝑥𝑦𝑠f=A\left(\sin\left(2\pi(x+y+s)\right)+\cos\left(2\pi(x+y+s)\right)\right)italic_f = italic_A ( roman_sin ( 2 italic_π ( italic_x + italic_y + italic_s ) ) + roman_cos ( 2 italic_π ( italic_x + italic_y + italic_s ) ) ) is the driving force. We use amplitude A𝐴Aitalic_A and phase s𝑠sitalic_s as the two-dimensional environmental variables for this system, and the viscosity coefficient is set to 0.01.

The range of environmental values and simulation settings for each equation are listed in Table 2.

The Cylinder flow system is simulated using the lattice Boltzmann method (LBM) [51], with dynamics governed by the Navier-Stokes equations for turbulent flow around a cylindrical obstacle. The system is discretized using a lattice velocity grid, and the relaxation time is determined based on the kinematic viscosity and Reynolds number. Data collection begins once the turbulence has stabilized.

For Lambda–Omega system, the system’s reaction-diffusion equations are numerically integrated over time using an ODE solver.

For Kolmogorov Flow and Navier-Stokes systems, we perform numerical simulations based on the vorticity form equations. The process includes calculating the velocity field from vorticity by solving a Poisson equation using Fourier transforms, employing numerical methods to handle spatial derivatives, and subsequently using an ODE solver for time integration to simulate the evolution of vorticity over time.

Table 2: Simulation settings of each PDE system.
Cylinder flow Lambda–Omega Kolmgorov Flow Navier-Stokes
Spatial Domain —— [10,10]2superscript10102[-10,10]^{2}[ - 10 , 10 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [π,π]2superscript𝜋𝜋2[-\pi,\pi]^{2}[ - italic_π , italic_π ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [32,32]2superscript32322[-32,32]^{2}[ - 32 , 32 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Grid Num 128×6412864128\times 64128 × 64 64×64646464\times 6464 × 64 64×64646464\times 6464 × 64 64×64646464\times 6464 × 64
dt 200 0.04 0.2 0.025
T 45,000 40.0 40.0 50.0
Environments Re:[200,500,31]:𝑅𝑒20050031Re:[200,500,31]italic_R italic_e : [ 200 , 500 , 31 ], r:[10,25,16]:𝑟102516r:[10,25,16]italic_r : [ 10 , 25 , 16 ] β:[1.0,1.5,51]:𝛽1.01.551\beta:[1.0,1.5,51]italic_β : [ 1.0 , 1.5 , 51 ] Re:[50,250,51]:𝑅𝑒5025051Re:[50,250,51]italic_R italic_e : [ 50 , 250 , 51 ] A:[0.1,0.3,11]:𝐴0.10.311A:[0.1,0.3,11]italic_A : [ 0.1 , 0.3 , 11 ], s:[0.0,1.0,11]:𝑠0.01.011s:[0.0,1.0,11]italic_s : [ 0.0 , 1.0 , 11 ]

For each environment of each system, we predict 100 trajectories from different initial conditions for training and 20 trajectories for testing. For Cylinder Flow and Lambda–Omega systems, autoregressive prediction is performed for 100 steps during testing, while for Kolmogorov Flow and Navier-Stokes systems, prediction is performed for 50 steps during testing.

Appendix B Model Zoo

In our main experiments, the settings for all meta-learning methods (except for DyAd, which uses a UNet by default) and the basic model of EnvAd-Diff are shown in Table 3. Additionally, we report the storage overhead of the model zoo and the hyperparameter settings during generation. During training, we uniformly use the Adam optimizer with a learning rate of 1e41𝑒41e-41 italic_e - 4, and other parameters are set to their default values.

Table 3: Detailed settings of the model zoo for each systems.
Cylinder flow Lambda–Omega Kolmgorov Flow Navier-Stokes ERA5
Channel Num 2 2 3 3 2
N_modes [12,6]126[12,6][ 12 , 6 ] [8,8]88[8,8][ 8 , 8 ] [8,8]88[8,8][ 8 , 8 ] [8,8]88[8,8][ 8 , 8 ] [8,8]88[8,8][ 8 , 8 ]
N_layers 4 4 8 8 4
Hidden 64 64 64 64 64
Domain Pretraining (epochs) 20 100 10 10 50
Finetuning (epochs) 50 50 50 50 20
Storage Space (GB) 18.0 3.5 3.5 7.1 3.4

Appendix C Model Architecture

The learnable parameters of EnvAd-Diff consist of a weight VAE and a weight latent transformer diffusion model. The VAE includes layer-wise linear projection layers at the start and end stages, and inter-node attention layers in between. The diffusion model includes a noise network with a transformer architecture, where conditions are injected through adaptive layer normalization.

Appendix D Baseline Implementation

The same training settings were used for all models, including training for 100 epochs using the Adam optimizer with a learning rate of 1e-4. Regarding the selection of foundation model parameters, we uniformly adjusted the embedding dimension, number of layers, and number of heads based on the dimensions suggested in the original papers to ensure comparable parameter counts for all models. For environment-adaptive models, we primarily used the default hyperparameters.

Appendix E Additional Results

Table 4: Average RMSE of ablation study on domain initialization and function loss. ’w/o’ stands for ’without’.
Kolmgorov Flow Navier-Stokes
In-domain Out-domain In-domain Out-domain
w/o Domain Init 0.156±0.082subscript0.156plus-or-minus0.0820.156_{\pm 0.082}0.156 start_POSTSUBSCRIPT ± 0.082 end_POSTSUBSCRIPT 0.188±0.102subscript0.188plus-or-minus0.1020.188_{\pm 0.102}0.188 start_POSTSUBSCRIPT ± 0.102 end_POSTSUBSCRIPT 0.197±0.0102subscript0.197plus-or-minus0.01020.197_{\pm 0.0102}0.197 start_POSTSUBSCRIPT ± 0.0102 end_POSTSUBSCRIPT 0.201±0.098subscript0.201plus-or-minus0.0980.201_{\pm 0.098}0.201 start_POSTSUBSCRIPT ± 0.098 end_POSTSUBSCRIPT
w/o Function Loss 0.098±0.034subscript0.098plus-or-minus0.0340.098_{\pm 0.034}0.098 start_POSTSUBSCRIPT ± 0.034 end_POSTSUBSCRIPT 0.104±0.038subscript0.104plus-or-minus0.0380.104_{\pm 0.038}0.104 start_POSTSUBSCRIPT ± 0.038 end_POSTSUBSCRIPT 0.104±0.045subscript0.104plus-or-minus0.0450.104_{\pm 0.045}0.104 start_POSTSUBSCRIPT ± 0.045 end_POSTSUBSCRIPT 0.110±0.046subscript0.110plus-or-minus0.0460.110_{\pm 0.046}0.110 start_POSTSUBSCRIPT ± 0.046 end_POSTSUBSCRIPT
EnvAd-Diff 0.083±0.008subscript0.083plus-or-minus0.008\mathbf{0.083_{\pm 0.008}}bold_0.083 start_POSTSUBSCRIPT ± bold_0.008 end_POSTSUBSCRIPT 0.084±0.006subscript0.084plus-or-minus0.006\mathbf{0.084_{\pm 0.006}}bold_0.084 start_POSTSUBSCRIPT ± bold_0.006 end_POSTSUBSCRIPT 0.0600.007subscript0.0600.007\mathbf{0.060_{0.007}}bold_0.060 start_POSTSUBSCRIPT bold_0.007 end_POSTSUBSCRIPT 0.064±0.007subscript0.064plus-or-minus0.007\mathbf{0.064_{\pm 0.007}}bold_0.064 start_POSTSUBSCRIPT ± bold_0.007 end_POSTSUBSCRIPT

E.1 Prompter

Refer to caption
Figure 10: Prompter performance on the (a, b) Cylinder Flow and (c, d) Navier-Stokes systems. Red marker means seen environments.
Table 5: Average RMSE of ablation study on prompter.
Cylinder Flow Navier-Stokes
In-domain Out-domain In-domain Out-domain
Prompter 0.053±0.025subscript0.053plus-or-minus0.0250.053_{\pm 0.025}0.053 start_POSTSUBSCRIPT ± 0.025 end_POSTSUBSCRIPT 0.065±0.031subscript0.065plus-or-minus0.0310.065_{\pm 0.031}0.065 start_POSTSUBSCRIPT ± 0.031 end_POSTSUBSCRIPT 0.0600.007subscript0.0600.0070.060_{0.007}0.060 start_POSTSUBSCRIPT 0.007 end_POSTSUBSCRIPT 0.064±0.007subscript0.064plus-or-minus0.007\mathbf{0.064_{\pm 0.007}}bold_0.064 start_POSTSUBSCRIPT ± bold_0.007 end_POSTSUBSCRIPT
Environmental condition 0.050±0.014subscript0.050plus-or-minus0.014\mathbf{0.050_{\pm 0.014}}bold_0.050 start_POSTSUBSCRIPT ± bold_0.014 end_POSTSUBSCRIPT 0.061±0.027subscript0.061plus-or-minus0.027\mathbf{0.061_{\pm 0.027}}bold_0.061 start_POSTSUBSCRIPT ± bold_0.027 end_POSTSUBSCRIPT 0.058±0.007plus-or-minus0.0580.007\mathbf{0.058{\pm 0.007}}bold_0.058 ± bold_0.007 0.65±0.006subscript0.65plus-or-minus0.0060.65_{\pm 0.006}0.65 start_POSTSUBSCRIPT ± 0.006 end_POSTSUBSCRIPT

Here we supplement the analysis experiments on the prompter proposed in Section 3.2. We visualize the relationship between the estimated surrogate label c𝑐citalic_c and the real environmental variables e𝑒eitalic_e on the Cylinder Flow and Navier-Stokes systems (Figure 10 a, and c). Results indicate that both exhibit clear distribution patterns, and the explained variance of PCA exceeds 50%. We verify the prediction accuracy of the prompter for the surrogate label c𝑐citalic_c on the test set, as shown in Figure 10 b and d. The SVR-based prompter reliably infers the underlying physical label from the trajectory’s initial frame, which supports EnvAd-Diff’s excellent generalization performance. Furthermore, we show that using 1D PCA components as surrogate labels is sufficient in this context. EnvAd-Diff fully supports extension to more dimensions for more complex environments.

We also compare the performance of EnvAd-Diff when using real environmental conditions e𝑒eitalic_e versus surrogate environmental labels c𝑐citalic_c, as shown in Table 5. Experimental results indicate that there is little difference in EnvAd-Diff’s performance under the two settings. This demonstrates that the prompter effectively helps EnvAd-Diff distinguish different environments for generating suitable weights.

E.2 LV system

The Lotka-Volterra equations describe the interaction between a predator-prey pair in an ecosystem:

dxdt𝑑𝑥𝑑𝑡\displaystyle\frac{dx}{dt}divide start_ARG italic_d italic_x end_ARG start_ARG italic_d italic_t end_ARG =αxβxyabsent𝛼𝑥𝛽𝑥𝑦\displaystyle=\alpha x-\beta xy= italic_α italic_x - italic_β italic_x italic_y
dydt𝑑𝑦𝑑𝑡\displaystyle\frac{dy}{dt}divide start_ARG italic_d italic_y end_ARG start_ARG italic_d italic_t end_ARG =δxyγy,absent𝛿𝑥𝑦𝛾𝑦\displaystyle=\delta xy-\gamma y,= italic_δ italic_x italic_y - italic_γ italic_y ,

where x𝑥xitalic_x and y𝑦yitalic_y respectively represent the quantity of the prey and the predator, and α𝛼\alphaitalic_α, β𝛽\betaitalic_β, δ𝛿\deltaitalic_δ, γ𝛾\gammaitalic_γ define the species interactions. We generate each trajectory with a time interval of 0.1 and a total duration of 10.0, with initial conditions randomly sampled between 1 and 3. We change β𝛽\betaitalic_β and δ𝛿\deltaitalic_δ as 2-dimensional environmental conditions, with both ranging from 0.5 to 1.0, sampled at 11 equally spaced points. Among a total of 11×11=121111112111\times 11=12111 × 11 = 121 environments, we randomly select 24 as the training set and the rest as the test set. For all environments, α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5. We generate 100 trajectories for each environment.

For this system, we adopt a 2-layer MLP as the parameterized dynamic model, with a hidden layer dimension of 128 and Tanh as the activation function. We use a neural ordinary differential equation with the rk4 algorithm to model the dynamics. EnvAd-Diff generates the weights of the MLP and makes forward predictions through a numerical solver. When distilling the generated weights, we first perform autoregressive prediction on a given trajectory using the predicted weights. Once the prediction is complete, we employ the pysindy library [52, 53] for symbolic regression. The operator dictionary uses a 2nd order polynomial dictionary, and other hyperparameters are set to their default values.

Appendix F Architectures of Expert Models

In our main experiments, we deploy three neural operators as expert models for EnvAd-Diff: FNO, UNO, and WNO. Here, taking the Cylinder Flow system as an example, we list the parameter composition and hyperparameter settings of these operators.

FNO

We adopt the code from the open-source repository [54] as the implementation for FNO.

UNO

We adopt the code from the open-source repository [54] as the implementation for UNO.

WNO

We adopt the code from the open-source repository [34] as the implementation for WNO.

Since the parameters of normalization layers are determined by the dataset and are not controlled by the environment, we do not enable normalization layers in all operators (they are also disabled by default in the original code).