Predicting Dynamical Systems across Environments via Diffusive Model Weight Generation
Abstract
Data-driven methods offer an effective equation-free solution for predicting physical dynamics. However, the same physical system can exhibit significantly different dynamic behaviors in various environments. This causes prediction functions trained for specific environments to fail when transferred to unseen environments. Therefore, cross-environment prediction requires modeling the dynamic functions of different environments. In this work, we propose a model weight generation method, EnvAd-Diff. EnvAd-Diff operates in the weight space of the dynamic function, generating suitable weights from scratch based on environmental condition for zero-shot prediction. Specifically, we first train expert prediction functions on dynamic trajectories from a limited set of visible environments to create a model zoo, thereby constructing sample pairs of prediction function weights and their corresponding environments. Subsequently, we train a latent space diffusion model conditioned on the environment to model the joint distribution of weights and environments. Considering the lack of environmental prior knowledge in real-world scenarios, we propose a physics-informed surrogate label to distinguish different environments. Generalization experiments across multiple systems demonstrate that a 1M parameter prediction function generated by EnvAd-Diff outperforms a pre-trained 500M parameter foundation model.
1 Introduction

Data-driven approaches have emerged as a powerful, equation-free paradigm for predicting physical dynamics [1], achieving considerable success across a diverse range of disciplines, including molecular dynamics [2], fluid mechanics [3], and climate science [4]. In these systems, dynamical systems governed by the same underlying equations can exhibit vastly different evolutionary behaviors under varying environmental conditions , which can be formally expressed as
(1) |
For instance, fluid flows, described by the Navier-Stokes equations, can exhibit different vortex structures such as the Reynolds number and external driving forces. Consequently, a predictive model , trained on observed trajectories of a specific environmental condition struggles to generalize to unseen environmental conditions . Therefore, modeling the generalizable function beyond the specific environment remains a critical problem for scientific machine learning [5, 6].
Significant efforts have been undertaken to enable cross-environment prediction, including the meta learning, foundation models and the in-context learning. Meta-learning approaches facilitate adaptation to unseen environments by simultaneously learning both environment-shared and environment-specific weights [7, 8, 9]. When applied to a new environment, the environment-specific weights are updated by finetuning on new data to derive a tailored predictive model. Another strategy involves training environment-unified foundation models through well-designed architectures and large-scale parameterization [10, 11, 12]. These models, pretrained on massive datasets, can be further refined by finetuning on data specific to a target environment. Furthermore, in-context learning methods aim for generalization by leveraging illustrative predictive examples from the new environment [13, 14]. However, these approaches emphasize acquiring generalizability from finetuning or contexts, yet overlooking the intrinsic connection between the predictive model and the physical environment. If the governing equation is known, a numerical solver can provide zero-shot predictions for any environmental condition . This highlights a crucial insight: explicitly modeling the conditional dependence of the predictive model on environments is key to achieving effective cross-environment prediction.
Inspired by treating model weights as a data modality, this work focuses on generating environment-specific weights. Since weights parameterize environment-specific dynamics (Figure 1), generating them directly enables cross-environment prediction using advanced deep learning techniques. However, the challenge of generating model weights for physical dynamics tailored to specific environmental conditions lies in three points. First, model weights, interconnected by the network architecture, are inherently structured. Thus, naive flattening weights into sequences would lead to the loss of crucial structural relationships [15]. Second, the high dimensionality of weights results in an exceptionally vast data space. Minor variations in the weights of even a single layer can be amplified into significant difference in predictive performance [16, 17]. Therefore, traditional metrics like MSE are inadequate for assessing weight similarity. Finally, practical applications typically lack explicit physical knowledge of the environment, resulting in a scarcity of supervisory signals to discern varying environments.
To address these challenges, we propose an Environment-Adaptive Dynamics Diffusion model, EnvAd-Diff. EnvAd-Diff represents predictive models as weight graphs, aggregating weights into node features to preserve their inherent connectivity and accommodate arbitrary model architectures (challenge 1). It employs a node-attention Variational Autoencoder (VAE) to learn latent representations for the diffusion model, and incorporates a functional loss for weight similarity awareness (challenge 2). We construct a high-quality model zoo via domain-adaptive initialization as EnvAd-Diff’s training data. For non-prior environments, we design physics-informed surrogate labels to train a prompter that guides the weight generation of EnvAd-Diff (challenge 3).
Our contributions can be summarized as follows:
-
•
We propose modeling the conditional dependence of model weights on environments for cross-environment prediction, thereby generating expert model weights for new environments without finetuning.
-
•
We construct weight graphs based on model architecture to preserve connectivity and design a functional loss for weight similarity perception. This significantly enhances the representational model’s ability to model weight modalities.
-
•
Extensive experiments on simulated and real-world systems demonstrate that 1M-parameter models generated by EnvAd-Diff for specific environments predict more accurately than 500M-parameter foundation models.
-
•
We open-source the model zoo and checkpoints of four simulated PDE systems and one real-world system to foster further research in the dynamical system modeling community111Code will be available after review..
2 Preliminary
2.1 Problem Definition
We consider time-varying dynamical systems, the general form of which is described by Equation 1. Given environmental conditions , the system dynamics function is instantiated as . The environment space and the function space are linked by the governing equations , forming a joint set . We employ a data-driven model , parameterized by , to learn , thereby formalizing the function space as the model’s weight space . The environment space is divided into an observed environment set and an unseen environment set , and consequently, the weight space is also partitioned into corresponding subspaces and . Treating model weights as the modeling object, we learn the inherent joint distribution of environments and weights from the joint observation space . Once learning is complete, we generate a corresponding predictive function for a new environment .
Notably, we posit that even when sharing the same governing equations, each environment possesses a unique dynamical function (though it may not be complex). We directly sample the complete model weights for a given environment without data for finetuning, which significantly differs from existing practices in dynamics prediction.
2.2 Conditional Diffusion
Diffusion models [18, 19] learn a probabilistic transformation from a prior Gaussian distribution to a target distribution . It perturbs data distributions by adding noise and learn to reverse this process through denoising, demonstrating strong fitting capabilities for data across modalities like images, language, and speech [20, 21]. We denote the original sample as . The forward noising process in standard diffusion models is computed as , where and represent the Gaussian noise and noise schedule [22], respectively. The reverse process gradually denoises from Gaussian noise to sample data as
(2) |
where and are step dependent constants. The noise is computed by a parameterized neural network, typically implemented as a UNet or Transformer architecture. The network’s parameters are optimized through an objective function [22]
(3) |
to minimize the negative log-likelihood . To model conditional distributions , state-of-the-art methods inject conditional information during noise prediction using techniques like adaptive layer normalization [23], as .
3 Methodology
In this section, we first introduce the method for modeling and sampling the joint distribution of model weights and environments on a given model zoo and environmental conditions. Subsequently, we present a physics-informed prompter that operates without the need for environmental prior knowledge. Finally, we detail the construction of a domain-adaptive model zoo.
3.1 Environment Adaptive Dynamics Diffusion model
EnvAd-Diff first organizes the expert small model weights into a weight graph. It then pretrains a weight VAE, yielding a high-quality latent space. Finally, an environment-conditioned diffusion model is trained within this latent space.
3.1.1 Weight Graph

Model weights constitute a novel data modality, inherently structured by the network architecture. A straightforward approach is to flatten weights layer by layer into fixed-length token sequences for representation using sequence models like transformers. However, here we consider the inherent connection structure of the neural network [15]. Specifically, we aggregate layer weights based on the forward data flow through the network topology.
We focus on designing the weight organization method for the basic computational units of common architectures: Linear layers and CNN layers. For a linear layer, learnable parameters include weights and bias , where the and are the dimension of output and input respectively. A CNN layer similarly comprises weights and bias , where and are the channels of output and input, respectively, and is kernel size. We treat the output neurons of linear layers and output channels of CNN layers as nodes of the weight graph, constructing the graph structure shown in Figure 2a. Centering on the output nodes, we flatten and concatenate the weights (and corresponding bias) associated with connections leading to each output node within a layer, forming the feature vector for that output node. Thus, a linear layer’s weights are organized as nodes with -dimensional features, and a CNN layer’s weights are organized as nodes with -dimensional features.
Considering the prevalence of skip connections in modern deep learning, we incorporate their weights. Following the data flow, we concatenate the weights of the skip connection path as additional features to the feature vector of the node where it merges with the main path, as depicted in Figure 2b. Consequently, the entire model weights are structured as a weight graph with heterogeneous node features, where the total number of nodes equals the sum of the output neurons/channels across all layers. We normalize weights based on input-output node pairs and biases based on nodes.
The proposed weight graph aggregates weights to nodes. This not only captures inherent connection relationships but also significantly reduces computational overhead compared to maintaining dense edge features. This organization method is applicable to models of any architecture, as demonstrated in Section 4.6.
3.1.2 Weight VAE
We now encode the heterogeneous graph of model weights. We train a node attention-based VAE with a loss function given by
(4) |
where represents the heterogeneous node features of the weight graph, is the latent representation, and the KL divergence term is used to constrain the posterior distribution . The VAE architecture first employs a layer-wise linear mapping for each layer’s nodes to project them into a same dimension. Subsequently, we utilize a multi-head attention mechanism to model inter-node relationships, capturing interactions among neurons within and across original model layers. The resulting latent representation is then passed through another layer-wise linear mapping, projecting it back to the original dimensions for reconstruction.
We notice that prediction models exhibiting similar performance can possess distinct parameter values [17]. This observation motivates our approach to the reconstruction error term in the VAE objective. The similarity between model weights should be gauged by their functional consistency, rather than merely their identical absolute values. We introduce a function loss,
(5) |
where and are the output values of the original and reconstructed weights, respectively, when applied to an input sample . Intuitively, the function loss allows the VAE to reconstruct weights that may not appear identical to the originals but perform similarly. It relaxes the encoder’s optimization constraints, promoting the learning of a latent space characterized by functional semantics.
3.1.3 Weight Latent Diffusion Model
In the latent space, we instantiate the noise network using a transformer architecture. Conditioned on environmental information , we inject this information into the network using adaptive layer norm (adaLN) [23], forming . Compared to performing diffusive generation directly on the heterogeneous weight graph, the latent space offers significant dimensionality reduction, which alleviates the computationally intensive nature of the diffusion process and simplifies the generation of representations.
3.2 Physics-informed Prompter
In the previous section, we assume the environmental condition is given. However, real-world applications often lack prior knowledge of the physical environment generating the observation trajectories. Here, we propose using a physics-informed surrogate label to differentiate observation trajectories originating from unknown environments. Within the constructed model zoo, the parameterized prediction models capture the dynamical behavior of environment , thus serving as reliable physical proxies. Inspired by the concept of function distance in functional analysis, we quantify the differences between environments by computing the L2 distance between the responses of their corresponding physical proxies to the same observation samples, calculated as
(6) |
By orderly stacking the distances between each environment’s physical proxy and those of other environments, we obtain the feature vector for each environment, . We use these feature vectors to distinguish different physical environments.
To identify the environmental condition for new environments in the test set, we apply principal component analysis, extracting the 1-dimensional principal component along the direction of the feature vectors’ largest eigenvalue. This 1D principal component serves as the surrogate label , which is used to train a regression model, termed "Prompter". Prompter takes the initial frame of a trajectory as input and predicts the physical environment to which that trajectory belongs. Here we employ classic support vector regression for prompter to prevent model complexity.
3.3 Domain-adaptive Model Zoo
We construct a model zoo to serve as the training corpus for EnvAd-Diff, adhering to three principles: 1) creating powerful expert models tailored to specific environments; 2) ensuring efficient and rapid construction; and 3) guaranteeing the stability of EnvAd-Diff training.
Neural operators [24] are employ as expert models, with minimum parameter count required for effective prediction. Prior to large-scale training of expert models for each environment, a global model is pretrained on trajectories from all available environments. This pretraining provides domain-adaptive initialization [14] for environment-specific expert training, significantly reducing the required number of training epochs. To encourage exploration of the weight landscape for different expert models within the same environment, we randomly select a layer of weights from the domain-adaptive initialization and introduce noise. We demonstrate that domain-adaptive initialization not only accelerates model zoo construction but also reduces the degrees of freedom of the weights while maintaining expert model accuracy. This significantly stabilizes the subsequent generative learning.
4 Experiment
4.1 Experimental Setup
We assume unknown environmental conditions for all dynamical systems, training models solely on observed trajectories across diverse visible environments. At test time, models autoregressively predict future states given a single initial frame. Test environments are categorized as in-domain (seen during training, novel initial conditions) and out-domain (unseen environments). We evaluate prediction quality using root mean square error (RMSE) and structural similarity index (SSIM).
Baselines
We compare against two baseline categories: foundation models (One-for-All) and meta-learning approaches (Env-Adaptive). The foundation models are trained via empirical risk minimization [25] on trajectories from all visible environments. The meta-learning method learns environment-shared weights and an environment-specific weight-generating hypernetwork. Following existing work [9], we enable zero-shot prediction by conditioning the hypernetwork on environmental parameters, which assumes known ground-truth environmental conditions. Additionally, we assume all environments are visible and train a dedicated Fourier neural operator [26] (FNO) for each environment as a performance upper bound (One-per-Env). Unless otherwise specified, we use FNO as the expert small model for EnvAd-Diff and other meta-learning methods. Architectural and hyperparameter details are in Appendix C and F.
4.2 Dynamical Systems and Model Zoo
We validate the model’s effectiveness on four time-dependent PDE systems and one real-world dataset: 1) Cylinder Flow [27]; 2) Lambda-Omega [28]; 3) Kolmogorov Flow [29]; 4) Navier-Stokes Equations [7]; and 5) ERA5 Dataset [30]. For the PDE systems, we use equation coefficients or external forcing as environmental variables and simulate multiple trajectories under different environments for training and testing. We train 100 FNO weight sets for each seen environment across all systems to serve as EnvAd-Diff’s model zoo (size 100). Detailed descriptions and data generation procedures for each system are provided in Appendix A and B.
4.3 Main Results
PDE systems
We report the generalization performance on 4 PDE systems in Table 1, detailing the number of in/out-domain environments and the parameter size of models for each system during testing. Env-Adaptive methods, which adjust optimal weights for each environment, have significantly fewer parameters than the foundation models. Across nearly all systems, EnvAd-Diff achieves the best average performance, demonstrating its ability to model the conditional dependence of the predictive model on environments. Consequently, its small, environment-specific expert models outperform foundation models with hundreds of times more parameters. Furthermore, unlike other meta-learning approaches, EnvAd-Diff treats model weights holistically during generation, without forcing the retention of environment-shared components. This potentially expands EnvAd-Diff’s search space for improved generalization.
We also find that some models can outperform One-per-Env in specific environments. This is likely due to the stochasticity of initialization and the training process, as One-per-Env models do not always converge to the optimal point. We illustrate this result with Cylinder Flow (2 environmental variables), as shown in Figure 3. The overall SSIM of One-per-Env is close to 1, however, it exhibits suboptimal performance in certain regions (green box in Figure 3). The FNO weights generated by EnvAd-Diff perform better than One-per-Env in some environments, even unseen ones. This suggests that EnvAd-Diff captures the manifold where the joint distribution of weights and environments lies, whereas the optimizer training process can fail to converge onto this manifold possibly due to getting stuck in local optima [31].
Methods | Testing Params | Cylinder Flow (96:400) | Lambda-Omega (12:39) | Kolmogorov Flow (12:39) | Navier-Stokes (24:121) | |||||
In-domain | Out-domain | In-domain | Out-domain | In-domain | Out-domain | In-domain | Out-domain | |||
One-for-All | FNO [5] | 500M+ | ||||||||
DPOT [11] | 500M+ | |||||||||
Poseidon [10] | 600M+ | |||||||||
MPP [12] | 550M+ | |||||||||
Env-Adaptive | DyAd [8] | 1M+ | ||||||||
LEADS [32] | ||||||||||
CoDA [7] | ||||||||||
GEPS [29] | ||||||||||
CAMEL [9] | ||||||||||
EnvAd-Diff | ||||||||||
One-per-Env | 1M+ |

Real-world dataset
We utilize the ERA5 reanalysis dataset, including east-west and north-south wind speed data at a height of 100 meters. The spatial resolution is 0.25°, and the temporal resolution is 1 hour. We use January 2018 wind speeds as the training set and January 2019 as the test set. To define different environments, we divide the globe into 6×12 grid subregions at 30° intervals [8]. We randomly select 24 subregions as seen environments, with the remaining 48 as unseen environments. The experimental results are shown in Figure 4. EnvAd-Diff’s prediction performance outperform all foundation models and is able to surpass One-per-Env in partial unseen subregions.

4.4 Explainability

We visualize the joint distribution of weights and environments using the Cylinder Flow system as an example to aid qualitative analysis, where over 80% environments are unseen by EnvAd-Diff. In Figure 5, the x-axis represents the surrogate environment labels predicted by the prompter, and the y-axis represents the first principal component of the weights of a specific layer. The weight-environment landscape learned by EnvAd-Diff closely resembles that learned by One-per-Env through optimizer training. This indicates that EnvAd-Diff successfully models the joint distribution of weights and environments, thereby explaining its superior performance in Table 1.

To quantitatively analyze the environmental-weight joint distribution fitted by EnvAd-Diff, we introduce a simple ODE system, the Lotka-Volterra (LV) equations [7], as a toy example. We use a symbolic regression algorithm [33] to distill the predictive model generated by EnvAd-Diff for specific environmental conditions ( and ) into an equation expression, as shown in Figure 6. The equivalent equation for the weights generated by EnvAd-Diff is consistent in form with the LV equations, and the environmental coefficients are close. This quantitatively demonstrates that EnvAd-Diff can fit the generalizable function in equation 1 rather than an environment-specific function. We detail the experimental setup in Appendix E.2.
4.5 Robustness
We assess EnvAd-Diff’s robustness to data variations. Specifically, we investigate the impact of the number of seen environments and the model zoo size. We first examine the effect of model zoo size on Cylinder Flow and Lambda-Omega systems, as depicted in Figures 7a and b. The results indicate that EnvAd-Diff exhibits relatively stable performance with a zoo size of 50. As the zoo size decreases further, performance begins to deteriorate, even within the distribution. Subsequently, we test the influence of the number of seen environments on the Kolmogorov Flow and Navier-Stokes systems. The number of environments ranged from approximately 5% to 20% of the total. The findings reveal that increasing the number of seen environments reduces prediction error, but the gains become marginal after reaching around 20%. This suggests that EnvAd-Diff learns the underlying joint distribution of weights and environments from a small number of environments, rather than overfitting to trajectory samples within those environments.

4.6 Extensibility

The weight graph structure proposed in Section 3.1.1 is capable of organizing neural networks of arbitrary architectures. Here, we extent to more neural operators as expert models within EnvAd-Diff, including Wavelet Neural Operator [34] (WNO), and U-shape Neural Operator [35] (UNO). Our experimental results on Cylinder Flow are presented in Figure 8. EnvAd-Diff, when using different neural operators, consistently achieves excellent generalization performance, with actual performance showing only minor variations depending on the specific operator architecture. This demonstrates that EnvAd-Diff is a model-agnostic framework capable of benefiting from the sophisticated architectural designs of its expert models. Detailed architectures of these neural operators are provided in Appendix F.
4.7 Ablation Study
We conduct an ablation study on the key components of EnvAd-Diff. Specifically, we verify the necessity of domain initialization when building the model zoo and the function loss used during VAE training. Experimental results on the Kolmogorov Flow and Navier-Stokes systems are presented in Table 4. When function loss is omitted, the VAE relies solely on MSE for reconstruction similarity, leading to suboptimal generated weights, particularly out-domain. Function loss relaxes VAE encoding constraints and helps prevent overfitting by prioritizing functional consistency over exact reconstruction. Removing domain initialization results in a significant deterioration in generated weight performance. This is attributed to the high complexity and distribution of a randomly initialized model zoo, which significantly increases the diffusion model’s modeling difficulty. We conclude that for weight generation aimed at generalization, sample quality is far more critical than diversity.
In addition, we also compare real environmental conditions with surrogate labels in Appendix E.1.
4.8 Computational Cost

Time cost
We compare the time overhead of EnvAd-Diff and One-per-Env when adapting to new environments, as shown in Figure 9a. One-per-Env requires training weights for each new environment using observational data. EnvAd-Diff’s overhead includes building the model zoo (accelerated by domain initialization) and generating weights for new environments. Though the upfront time cost of preparing the model zoo, EnvAd-Diff generates weights significantly faster than training a new prediction model and requires no data.
GPU memory
We compare the GPU memory usage of EnvAd-Diff and other baselines during inference (Figure 9b). Thanks to the proposed weight graph structure, EnvAd-Diff’s attention computation unfolds along the node dimension, significantly reducing computational overhead and memory. In addition, we detail the storage overhead of the model zoo in Appendix B.
5 Related Work
5.1 Dynamics Prediction across Environments
Developing dynamic prediction models with cross-environment generalization is a crucial problem in scientific machine learning and has garnered significant research interest. We review the main approaches and related work in this area. The first category trains large-parameter neural solvers as foundation models using extensive simulated data [36, 37, 38]. Subramanian et al. [5] explore the generalization performance of classical FNO architectures across different parameter sizes. Subsequently, models such as MPP [12], DPOT [11], and Poseidon [10] employed more advanced architectures to improve computational efficiency and approximation capabilities. The second approach is meta-learning [39]. These methods capture cross-environment invariants through environment-shared weights and fine-tune environment-specific weights on limited data from new environments for adaptation, including DyAd [8], LEADS [32], CoDA [7], GEPS [29], CAMEL [9], and NCF [40]. Additionally, other methods exist, like in-context learning [14]. Yang et al. [13] frame differential equation forward and inverse problems as natural language statements, pre-train transformers, and provide solution examples for new environments as context to enhance model performance. Compared to these works, we innovatively treat the complete model weights as generated objects and explicitly model their joint distribution with the environment.
5.2 Diffusion for Network Weight Generation
Generating neural network weights is a relatively nascent research area [41]. An initial line of work involved training MLPs to overfit implicit neural fields, distilling them into model weights, and subsequently generating these MLP weights as an alternative to directly generating the fields [42]. Another category proposes using generated weights to replace hand-crafted initialization, thereby accelerating and improving the neural network training process [43, 44, 45]. These efforts primarily focus on image modalities. More recent studies leverage diffusion models to address generalization in various domains. Yuan et al. [46] employ urban knowledge graph as prompts to guide diffusion for generating spatio-temporal prediction model weights for new cities. Zhang et al. [47] replace the inner loop gradient updates of the meta learning with diffusion-generated weights. A recent work [48] explores extracting features from unseen datasets and controlling diffusion to generate adapted model weights for them. However, these methods exhibit limited zero-shot performance. This may stem from them disrupting the neural network’s inherent topological connections by directly flattening the weights, which constrains the representational capacity of the generative model. In contrast, we organize weights in the form of a neural graph and introduce a function loss to guide their representation. The zero-shot prediction performance of our generated model surpasses pre-trained models with hundreds of times more weights.
6 Conclusion
In this work, we suggest explicitly modeling the conditional dependence of the predictive model on environments to achieving effective cross-environment prediction. We propose a Environment-adaptive dynamics diffusion model, EnvAd-Diff, to generate expert models for specific environments. We organize the model weights as weight graph to preserve their inherent connectivity, accommodating arbitrary model architectures. In additional, we design physics-informed surrogate labels for unknown physical environments and a high-quaility model zoo for the training of EnvAd-Diff. Experiments on simulated and real-world systems emonstrate that 1M-parameter models generated by EnvAd-Diff for specific environments predict more accurately than 500M-parameter foundation models.
Limitations & future work
The performance of EnvAd-Diff is contingent on the quality of the model zoo, including the diversity of environments and the performance of expert models. This presents a challenge for the initial construction of the model zoo. Future work will primarily focus on developing more data-efficient and robust models for weight representation.
References
- [1] Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
- [2] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noé. Vampnets for deep learning of molecular kinetics. Nature communications, 9(1):5, 2018.
- [3] Dule Shu, Zijie Li, and Amir Barati Farimani. A physics-informed diffusion model for high-fidelity flow field reconstruction. Journal of Computational Physics, 478:111972, 2023.
- [4] Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, 2023.
- [5] Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael W Mahoney, and Amir Gholami. Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior. Advances in Neural Information Processing Systems, 36:71242–71262, 2023.
- [6] Somdatta Goswami, Katiana Kontolati, Michael D Shields, and George Em Karniadakis. Deep transfer operator learning for partial differential equations under conditional shift. Nature Machine Intelligence, 4(12):1155–1164, 2022.
- [7] Matthieu Kirchmeyer, Yuan Yin, Jérémie Donà, Nicolas Baskiotis, Alain Rakotomamonjy, and Patrick Gallinari. Generalizing to new physical systems via context-informed dynamics model. In International Conference on Machine Learning, pages 11283–11301. PMLR, 2022.
- [8] Rui Wang, Robin Walters, and Rose Yu. Meta-learning dynamics forecasting using task inference. Advances in Neural Information Processing Systems, 35:21640–21653, 2022.
- [9] Matthieu Blanke and Marc Lelarge. Interpretable meta-learning of physical systems. In The Twelfth International Conference on Learning Representations.
- [10] Maximilian Herde, Bogdan Raonic, Tobias Rohner, Roger Käppeli, Roberto Molinaro, Emmanuel de Bézenac, and Siddhartha Mishra. Poseidon: Efficient foundation models for pdes. Advances in Neural Information Processing Systems, 37:72525–72624, 2024.
- [11] Zhongkai Hao, Chang Su, Songming Liu, Julius Berner, Chengyang Ying, Hang Su, Anima Anandkumar, Jian Song, and Jun Zhu. Dpot: Auto-regressive denoising operator transformer for large-scale pde pre-training. In International Conference on Machine Learning, pages 17616–17635. PMLR, 2024.
- [12] Michael McCabe, Bruno Régaldo-Saint Blancard, Liam Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, et al. Multiple physics pretraining for spatiotemporal surrogate models. Advances in Neural Information Processing Systems, 37:119301–119335, 2024.
- [13] Liu Yang, Siting Liu, Tingwei Meng, and Stanley J Osher. In-context operator learning with data prompts for differential equation problems. Proceedings of the National Academy of Sciences, 120(39):e2310142120, 2023.
- [14] Wuyang Chen, Jialin Song, Pu Ren, Shashank Subramanian, Dmitriy Morozov, and Michael W Mahoney. Data-efficient operator learning via unsupervised pretraining and in-context learning. Advances in Neural Information Processing Systems, 37:6213–6245, 2024.
- [15] Miltiadis Kofinas, Boris Knyazev, Yan Zhang, Yunlu Chen, Gertjan J Burghouts, Efstratios Gavves, Cees GM Snoek, and David W Zhang. Graph neural networks for learning equivariant representations of neural networks. arXiv preprint arXiv:2403.12143, 2024.
- [16] Maximilian Plattner, Arturs Berzins, and Johannes Brandstetter. Shape generation via weight space learning. arXiv preprint arXiv:2503.21830, 2025.
- [17] Léo Meynent, Ivan Melev, Konstantin Schürholt, Göran Kauermann, and Damian Borth. Structure is not enough: Leveraging behavior for neural network weight reconstruction. arXiv preprint arXiv:2503.17138, 2025.
- [18] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- [19] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- [20] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10850–10869, 2023.
- [21] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
- [22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- [23] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023.
- [24] Nikola Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Learning maps between function spaces with applications to pdes. Journal of Machine Learning Research, 24(89):1–97, 2023.
- [25] Ibrahim Ayed, Emmanuel de Bézenac, Arthur Pajot, Julien Brajard, and Patrick Gallinari. Learning dynamical systems from partial observations. arXiv preprint arXiv:1902.11136, 2019.
- [26] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020.
- [27] Ruikun Li, Jingwen Cheng, Huandong Wang, Qingmin Liao, and Yong Li. Predicting the dynamics of complex system via multiscale diffusion autoencoder. arXiv preprint arXiv:2505.02450, 2025.
- [28] Kathleen Champion, Bethany Lusch, J Nathan Kutz, and Steven L Brunton. Data-driven discovery of coordinates and governing equations. Proceedings of the National Academy of Sciences, 116(45):22445–22451, 2019.
- [29] Armand Kassaï Koupaï, Jorge Mifsut Benet, Yuan Yin, Jean-Noël Vittaut, and Patrick Gallinari. Geps: Boosting generalization in parametric pde neural solvers through adaptive conditioning. arXiv preprint arXiv:2410.23889, 2024.
- [30] Zongwei Zhang, Lianlei Lin, Sheng Gao, Junkai Wang, Hanqing Zhao, and Hangyi Yu. A machine learning model for hub-height short-term wind speed prediction. Nature Communications, 16(1):3195, 2025.
- [31] Antonio Sclocchi and Matthieu Wyart. On the different regimes of stochastic gradient descent. Proceedings of the National Academy of Sciences, 121(9):e2316301121, 2024.
- [32] Yuan Yin, Ibrahim Ayed, Emmanuel de Bézenac, Nicolas Baskiotis, and Patrick Gallinari. Leads: Learning dynamical systems that generalize across environments. Advances in Neural Information Processing Systems, 34:7561–7573, 2021.
- [33] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15):3932–3937, 2016.
- [34] Tapas Tripura and Souvik Chakraborty. Wavelet neural operator for solving parametric partial differential equations in computational mechanics problems. Computer Methods in Applied Mechanics and Engineering, 404:115783, 2023.
- [35] Md Ashiqur Rahman, Zachary E Ross, and Kamyar Azizzadenesheli. U-no: U-shaped neural operators. arXiv preprint arXiv:2204.11127, 2022.
- [36] Md Ashiqur Rahman, Robert Joseph George, Mogab Elleithy, Daniel Leibovici, Zongyi Li, Boris Bonev, Colin White, Julius Berner, Raymond A Yeh, Jean Kossaifi, et al. Pretraining codomain attention neural operators for solving multiphysics pdes. Advances in Neural Information Processing Systems, 37:104035–104064, 2024.
- [37] Benedikt Alkin, Andreas Fürst, Simon Schmid, Lukas Gruber, Markus Holzleitner, and Johannes Brandstetter. Universal physics transformers: A framework for efficiently scaling neural operators. Advances in Neural Information Processing Systems, 37:25152–25194, 2024.
- [38] Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Chonghan Gao, Rongye Shi, Shanghang Zhang, and Jianxin Li. Building flexible machine learning models for scientific computing at scale. arXiv preprint arXiv:2402.16014, 2024.
- [39] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
- [40] Roussel Desmond Nzoyem, David AW Barton, and Tom Deakin. Neural context flows for meta-learning of dynamical systems. arXiv preprint arXiv:2405.02154, 2024.
- [41] Kai Wang, Dongwen Tang, Boya Zeng, Yida Yin, Zhaopan Xu, Yukun Zhou, Zelin Zang, Trevor Darrell, Zhuang Liu, and Yang You. Neural network diffusion. arXiv preprint arXiv:2402.13144, 2024.
- [42] Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14300–14310, 2023.
- [43] Yifan Gong, Zheng Zhan, Yanyu Li, Yerlan Idelbayev, Andrey Zharkov, Kfir Aberman, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficient training with denoised neural weights. In European Conference on Computer Vision, pages 18–34, 2024.
- [44] Konstantin Schürholt, Boris Knyazev, Xavier Giró-i Nieto, and Damian Borth. Hyper-representations as generative models: Sampling unseen neural network weights. Advances in Neural Information Processing Systems, 35:27906–27920, 2022.
- [45] Konstantin Schürholt, Michael W Mahoney, and Damian Borth. Towards scalable and versatile weight space learning. In Proceedings of the 41st International Conference on Machine Learning, pages 43947–43966, 2024.
- [46] Yuan Yuan, Chenyang Shao, Jingtao Ding, Depeng Jin, and Yong Li. Spatio-temporal few-shot learning via diffusive neural network generation. arXiv preprint arXiv:2402.11922, 2024.
- [47] Baoquan Zhang, Chuyao Luo, Demin Yu, Xutao Li, Huiwei Lin, Yunming Ye, and Bowen Zhang. Metadiff: Meta-learning with conditional diffusion for few-shot learning. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 16687–16695, 2024.
- [48] Bedionita Soro, Bruno Andreis, Hayeon Lee, Wonyong Jeong, Song Chong, Frank Hutter, and Sung Ju Hwang. Diffusion-based neural network weights generation. arXiv preprint arXiv:2402.18153, 2024.
- [49] Jacob Page, Peter Norgaard, Michael P Brenner, and Rich R Kerswell. Recurrent flow patterns as a basis for two-dimensional turbulence: Predicting statistics from structures. Proceedings of the National Academy of Sciences, 121(23):e2320007121, 2024.
- [50] Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Daniel MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. Pdebench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems, 35:1596–1611, 2022.
- [51] Pantelis R Vlachas, Georgios Arampatzis, Caroline Uhler, and Petros Koumoutsakos. Multiscale simulations of complex systems by learning their effective dynamics. Nature Machine Intelligence, 4(4):359–366, 2022.
- [52] Alan A. Kaptanoglu, Brian M. de Silva, Urban Fasel, Kadierdan Kaheman, Andy J. Goldschmidt, Jared Callaham, Charles B. Delahunt, Zachary G. Nicolaou, Kathleen Champion, Jean-Christophe Loiseau, J. Nathan Kutz, and Steven L. Brunton. Pysindy: A comprehensive python package for robust sparse system identification. Journal of Open Source Software, 7(69):3994, 2022.
- [53] Brian de Silva, Kathleen Champion, Markus Quade, Jean-Christophe Loiseau, J. Kutz, and Steven Brunton. Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data. Journal of Open Source Software, 5(49):2104, 2020.
- [54] Jean Kossaifi, Nikola Kovachki, Zongyi Li, David Pitt, Miguel Liu-Schiaffini, Robert Joseph George, Boris Bonev, Kamyar Azizzadenesheli, Julius Berner, and Anima Anandkumar. A library for learning neural operators, 2024.
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
Appendix A Data Generation
Cylinder Flow system [27] is governed by:
(7) |
In this system, we use the Reynolds number and characteristic length as two environmental variables. The Reynolds number and characteristic length influence the lattice viscosity, which in turn affects the collision frequency, leading to different dynamic behaviors.
Lambda–Omega system [28] is governed by
(8) |
where is the Laplacian operator. For this system, we use as a 1-dimensional environmental variable. and are both set to 0.5.
Kolmogrov Flow system [49] is governed by
For this system, we use as a 1-dimensional environmental variable. is set to 3.
Navier-Stokes system [50] is governed by
where is the driving force. We use amplitude and phase as the two-dimensional environmental variables for this system, and the viscosity coefficient is set to 0.01.
The range of environmental values and simulation settings for each equation are listed in Table 2.
The Cylinder flow system is simulated using the lattice Boltzmann method (LBM) [51], with dynamics governed by the Navier-Stokes equations for turbulent flow around a cylindrical obstacle. The system is discretized using a lattice velocity grid, and the relaxation time is determined based on the kinematic viscosity and Reynolds number. Data collection begins once the turbulence has stabilized.
For Lambda–Omega system, the system’s reaction-diffusion equations are numerically integrated over time using an ODE solver.
For Kolmogorov Flow and Navier-Stokes systems, we perform numerical simulations based on the vorticity form equations. The process includes calculating the velocity field from vorticity by solving a Poisson equation using Fourier transforms, employing numerical methods to handle spatial derivatives, and subsequently using an ODE solver for time integration to simulate the evolution of vorticity over time.
Cylinder flow | Lambda–Omega | Kolmgorov Flow | Navier-Stokes | |
Spatial Domain | —— | |||
Grid Num | ||||
dt | 200 | 0.04 | 0.2 | 0.025 |
T | 45,000 | 40.0 | 40.0 | 50.0 |
Environments | , | , |
For each environment of each system, we predict 100 trajectories from different initial conditions for training and 20 trajectories for testing. For Cylinder Flow and Lambda–Omega systems, autoregressive prediction is performed for 100 steps during testing, while for Kolmogorov Flow and Navier-Stokes systems, prediction is performed for 50 steps during testing.
Appendix B Model Zoo
In our main experiments, the settings for all meta-learning methods (except for DyAd, which uses a UNet by default) and the basic model of EnvAd-Diff are shown in Table 3. Additionally, we report the storage overhead of the model zoo and the hyperparameter settings during generation. During training, we uniformly use the Adam optimizer with a learning rate of , and other parameters are set to their default values.
Cylinder flow | Lambda–Omega | Kolmgorov Flow | Navier-Stokes | ERA5 | |
Channel Num | 2 | 2 | 3 | 3 | 2 |
N_modes | |||||
N_layers | 4 | 4 | 8 | 8 | 4 |
Hidden | 64 | 64 | 64 | 64 | 64 |
Domain Pretraining (epochs) | 20 | 100 | 10 | 10 | 50 |
Finetuning (epochs) | 50 | 50 | 50 | 50 | 20 |
Storage Space (GB) | 18.0 | 3.5 | 3.5 | 7.1 | 3.4 |
Appendix C Model Architecture
The learnable parameters of EnvAd-Diff consist of a weight VAE and a weight latent transformer diffusion model. The VAE includes layer-wise linear projection layers at the start and end stages, and inter-node attention layers in between. The diffusion model includes a noise network with a transformer architecture, where conditions are injected through adaptive layer normalization.
Appendix D Baseline Implementation
The same training settings were used for all models, including training for 100 epochs using the Adam optimizer with a learning rate of 1e-4. Regarding the selection of foundation model parameters, we uniformly adjusted the embedding dimension, number of layers, and number of heads based on the dimensions suggested in the original papers to ensure comparable parameter counts for all models. For environment-adaptive models, we primarily used the default hyperparameters.
Appendix E Additional Results
Kolmgorov Flow | Navier-Stokes | |||
In-domain | Out-domain | In-domain | Out-domain | |
w/o Domain Init | ||||
w/o Function Loss | ||||
EnvAd-Diff |
E.1 Prompter

Cylinder Flow | Navier-Stokes | |||
In-domain | Out-domain | In-domain | Out-domain | |
Prompter | ||||
Environmental condition |
Here we supplement the analysis experiments on the prompter proposed in Section 3.2. We visualize the relationship between the estimated surrogate label and the real environmental variables on the Cylinder Flow and Navier-Stokes systems (Figure 10 a, and c). Results indicate that both exhibit clear distribution patterns, and the explained variance of PCA exceeds 50%. We verify the prediction accuracy of the prompter for the surrogate label on the test set, as shown in Figure 10 b and d. The SVR-based prompter reliably infers the underlying physical label from the trajectory’s initial frame, which supports EnvAd-Diff’s excellent generalization performance. Furthermore, we show that using 1D PCA components as surrogate labels is sufficient in this context. EnvAd-Diff fully supports extension to more dimensions for more complex environments.
We also compare the performance of EnvAd-Diff when using real environmental conditions versus surrogate environmental labels , as shown in Table 5. Experimental results indicate that there is little difference in EnvAd-Diff’s performance under the two settings. This demonstrates that the prompter effectively helps EnvAd-Diff distinguish different environments for generating suitable weights.
E.2 LV system
The Lotka-Volterra equations describe the interaction between a predator-prey pair in an ecosystem:
where and respectively represent the quantity of the prey and the predator, and , , , define the species interactions. We generate each trajectory with a time interval of 0.1 and a total duration of 10.0, with initial conditions randomly sampled between 1 and 3. We change and as 2-dimensional environmental conditions, with both ranging from 0.5 to 1.0, sampled at 11 equally spaced points. Among a total of environments, we randomly select 24 as the training set and the rest as the test set. For all environments, and . We generate 100 trajectories for each environment.
For this system, we adopt a 2-layer MLP as the parameterized dynamic model, with a hidden layer dimension of 128 and Tanh as the activation function. We use a neural ordinary differential equation with the rk4 algorithm to model the dynamics. EnvAd-Diff generates the weights of the MLP and makes forward predictions through a numerical solver. When distilling the generated weights, we first perform autoregressive prediction on a given trajectory using the predicted weights. Once the prediction is complete, we employ the pysindy library [52, 53] for symbolic regression. The operator dictionary uses a 2nd order polynomial dictionary, and other hyperparameters are set to their default values.
Appendix F Architectures of Expert Models
In our main experiments, we deploy three neural operators as expert models for EnvAd-Diff: FNO, UNO, and WNO. Here, taking the Cylinder Flow system as an example, we list the parameter composition and hyperparameter settings of these operators.
FNO
We adopt the code from the open-source repository [54] as the implementation for FNO.
UNO
We adopt the code from the open-source repository [54] as the implementation for UNO.
WNO
We adopt the code from the open-source repository [34] as the implementation for WNO.
Since the parameters of normalization layers are determined by the dataset and are not controlled by the environment, we do not enable normalization layers in all operators (they are also disabled by default in the original code).