Towards diffusion models for large-scale sea-ice modelling

Tobias Sebastian Finn Charlotte Durand Alban Farchi Marc Bocquet Julien Brajard

Abstract

We make the first steps towards diffusion models for unconditional generation of multivariate and Arctic-wide sea-ice states. While targeting to reduce the computational costs by diffusion in latent space, latent diffusion models also offer the possibility to integrate physical knowledge into the generation process. We tailor latent diffusion models to sea-ice physics with a censored Gaussian distribution in data space to generate data that follows the physical bounds of the modelled variables. Our latent diffusion models reach similar scores as the diffusion model trained in data space, but they smooth the generated fields as caused by the latent mapping. While enforcing physical bounds cannot reduce the smoothing, it improves the representation of the marginal ice zone. Therefore, for large-scale Earth system modelling, latent diffusion models can have many advantages compared to diffusion in data space if the significant barrier of smoothing can be resolved.

Generative diffusion, Earth system modelling, sea-ice simulations

1 Introduction

Generative diffusion (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) has revolutionised data generation in computer vision (Dhariwal & Nichol, 2021; Saharia et al., 2022a) and other domains (Bar-Tal et al., 2024; Huang et al., 2023). Initial applications to geophysical systems show promise for prediction (Leinonen et al., 2023; Li et al., 2023; Price et al., 2024; Finn et al., 2024) and downscaling (Mardani et al., 2023; Wan et al., 2023; Brajard et al., 2024). To cut costs, data can be mapped into a latent space using a pre-trained autoencoder, where diffusion models can be learned (Sinha et al., 2021; Vahdat et al., 2021; Rombach et al., 2022). However, the mapping can reduce the model’s ability to generate fine-grained data (Dai et al., 2023).

Here, we examine how well latent diffusion models (LDMs) can generate geophysical data compared to diffusion directly applied in data space. As a proof-of-concept, we unconditionally generate Arctic-wide sea-ice states as simulated with the state-of-the-art sea-ice model neXtSIM (Rampal et al., 2016; Ólason et al., 2022) at a $\frac{1}{4}^{\circ}$ resolution (Boutin et al., 2023). Sea ice, with its discrete elements, anisotropy, and multifractality, poses a challenging problem for generative models and serves as ideal candidate for evaluating LDMs. We show that while the general structure remains the same, LDMs produces smoother solutions than diffusion models in data space, as presented in Fig. 1.

Refer to caption — Figure 1: Samples generated with a diffusion model in data space (a–d) and a latent diffusion model with censoring (e–h). The thickness (a, e) and concentration (b, f) are directly generated, while the speed (c, g) and deformation (d, h) are derived from the velocities. Note, for visualisation purpose, only the central Arctic is shown while the whole Arctic is modelled. The remaining noise might be caused by using only 20 integration steps with a Heun sampler.

Although generative diffusion is fundamentally defined by its Gaussian diffusion process, we can integrate prior physical information into the encoder and decoder of an LDM, as similarly done in Shmakov et al. (2023). Hence, we tailor the approach of LDMs to Earth system modelling by incorporating physical bounds into the autoencoder. To bound the variables, we replace the common Gaussian reconstruction loss (mean-squared error) by the log-likelihood of censored Gaussian distributions, which combines regression and classification, enabling us to explicitly account for the bounds of the output during reconstruction. Such censored distributions improve the representation of the sea-ice extent and thereby enhance the physical consistency of the output, as demonstrated in this study.

2 Methods

In the following, we introduce a brief overview on LDMs. Specifically, we focus on how to make use of prior physical knowledge into such models. For a more detailed treatment, we refer to Appendix A.

Our goal is to train neural networks (NNs) to generate state samples ${\widehat{\mathbf{x}}\in\mathbb{R}^{5\times 512\times 512}}$ that should look like samples drawn from the here-used sea-ice simulation $\mathbf{x}\sim\mathcal{D}$ ; modelled are the thickness (SIT), concentration (SIC), damage (SID), and the velocities in $x$ - (SIU) and $y$ -direction (SIV). For the generation, we employ a diffusion model that works in a lower-dimensional latent space ${\mathbf{z}_{x}\in\mathbb{R}^{16\times 64\times 64}}$ as spanned by an autoencoder. The encoder $\mathrm{enc}(\mathbf{x})$ maps from data space into latent space, while the decoder $\mathrm{dec}(\mathbf{z}_{x})$ maps back into data space. The encoder and decoder are implemented as fully convolutional NNs (CNNs) with their weights and biases $\bm{\theta}$ and further described in Appendix B.2.

The autoencoder should reduce the data dimensionality without compressing the semantic (Rombach et al., 2022). We nevertheless want to achieve a continuous latent space, from which the decoder can map similar values to similar data points. Consequently, we apply a variational autoencoder (Kingma & Welling, 2013), where we downweight the Kullback-Leibler divergence by $\beta=0.001$ (Higgins et al., 2016). Given a data sample $\mathbf{x}$ , the loss function for the autoencoder then reads,


$\displaystyle\mathcal{L}(\mathbf{x},\bm{\theta})=$	$\displaystyle-\mathbb{E}_{q(\mathbf{z}_{x}\mid\mathbf{x},\bm{\theta})}\Big{[}% \log\big{(}p(\mathbf{x}\mid\mathbf{z}_{x},\bm{\theta})\big{)}\Big{]}$	(1a)
	$\displaystyle+\beta D_{KL}\Big{(}q(\mathbf{z}_{x}\mid\mathbf{x},\bm{\theta})\\|% p(\mathbf{z}_{x})\Big{)},$	(1b)

which is minimised by averaging the loss across a mini-batch of samples and using a variant of gradient descent.

The encoder maps a data sample into the mean and standard deviation of an assumed Gaussian distribution in latent space, $q(\mathbf{z}_{x}\mid\mathbf{x},\bm{\theta})=\mathcal{N}\left(\mu_{\mathsf{enc}% ,\bm{\theta}}(\mathbf{x}),(\sigma_{\mathsf{enc},\bm{\theta}}(\mathbf{x}))^{2}% \mathbf{I}\right)$ , while we assume a prior Gaussian distribution with mean $\bm{0}$ and a diagonal covariance $\mathbf{I}$ , $p(\mathbf{z}_{x})=\mathcal{N}\left(\bm{0},\mathbf{I}\right)$ .

The reconstruction loss in Eq. (1a) corresponds to the negative log-likelihood (NLL) of the data sample given an assumed distribution and includes the decoder, which maps from latent space to a Gaussian distribution, from where we can sample the decoder output $\widehat{\mathbf{y}}$ ,

\widehat{\mathbf{y}}\sim\mathcal{N}({dec}_{\bm{\theta}}(\mathbf{z}_{x}),% \mathbf{s}^{2}\mathbf{I}).

(2)

The standard deviation $\mathbf{s}$ is spatially shared with one value per output variable and learned alongside the autoencoder (Cipolla et al., 2018; Rybkin et al., 2020; Finn et al., 2023). From such a sample $\widehat{\mathbf{y}}$ , we can recover the reconstructed data sample $\widehat{\mathbf{x}}$ by applying a deterministic (possibly non-invertible) transformation function $\widehat{\mathbf{x}}=f(\mathbf{\widehat{\mathbf{y}}})$ . Depending on the transformation function, we obtain different reconstruction losses:

•

If we apply an identity function $f(\mathbf{\widehat{\mathbf{y}}})=\mathbf{\widehat{\mathbf{y}}}$ , the reconstruction loss is the NLL of a Gaussian distribution and includes a mean-squared error weighted by $\mathbf{s}$ .
•

If we clip the output into physical bounds, e.g. $f(\mathbf{\widehat{\mathbf{y}}})=\mathrm{min}(\mathrm{max}(\mathbf{\widehat{% \mathbf{y}}},\mathbf{x}_{\mathsf{L}}),\mathbf{x}_{\mathsf{U}})$ by specifying a lower $\mathbf{x}_{\mathsf{L}}$ and/or upper bound $\mathbf{x}_{\mathsf{U}}$ , the reconstruction loss is the NLL of a censored Gaussian distribution. The censored distribution utilises the decoder output, as described in Eq. (2), to regress the reconstruction and to determine whether it will be clipped into the bounds. Hence, this NLL combines a regression with a classification task and can be seen as NLL of a type-I Tobit model (Tobin, 1958; Amemiya, 1984).

For a detailed treatment of the reconstruction loss, we refer to Appendix A.2.

After training the autoencoder with Eq. (1), we use the mean prediction of the encoder, $\mu_{\mathsf{enc},\bm{\theta}}(\mathbf{x})$ , as deterministic mapping from data space into latent space and train a diffusion model in this space. The instantiated diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020) can be seen as denoiser $D_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)$ with its weights and biases $\bm{\phi}$ (Karras et al., 2022), which takes a noised latent state $\mathbf{z}_{\tau}$ at a pseudo-time $\tau\in[0,1]$ and should output the cleaned latent state $\mathbf{z}_{x}$ . During training, the noised latent state is produced by a variance-preserving diffusion process (Kingma et al., 2021; Song et al., 2021), under the assumption that the latent sample $\mathbf{z}_{x}$ has been normalised to mean $0$ and standard deviation $1$ ,

\displaystyle\mathbf{z}_{\tau}

\displaystyle=\alpha_{\tau}\mathbf{z}_{x}+\sigma_{\tau}\bm{\epsilon},

\displaystyle\text{with}~{}\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I}),

(3)

where the signal $\alpha_{\tau}$ and noise $\sigma_{\tau}$ amplitudes are given in terms of logarithmic signal-to-noise ratio $\gamma(\tau)=\log\Big{(}\frac{\alpha_{\tau}^{2}}{\sigma_{\tau}^{2}}\Big{)}$ . The training noise scheduler specifies the dependency of the ratio on the pseudo-time and is adapted during the training process (Kingma & Gao, 2023) as further described in Appendix A.3.

Given a latent sample $\mathbf{z}_{x}$ and pseudo-time $\tau$ , the loss function to train the diffusion model reads

\displaystyle\mathcal{L}(\mathbf{z}_{x},\tau,\bm{\phi})=w(\gamma)\Big{(}-\frac% {d\gamma}{d\tau}\Big{)}e^{\gamma}\lVert\mathbf{z}_{x}-D_{\bm{\phi}}(\mathbf{z}% _{\tau},\tau)\rVert_{2}^{2},

(4)

with $w(\gamma)=\exp{(-\frac{\gamma}{2})}$ as external weighting function (Kingma & Gao, 2023). The diffusion model is parameterised with a $v$ -prediction (Salimans & Ho, 2022) and implemented as UViT architecture (Hoogeboom et al., 2023) combining a U-Net (Ronneberger et al., 2015) with a vision transformer (Dosovitskiy et al., 2021). The minimum value of the noise scheduler is set to $\gamma_{\mathsf{min}}=-15$ and the maximum value to $\gamma_{\mathsf{max}}=15$ . We further describe the diffusion model in Appendix A.1 and its architecture in Appendix B.3.

To generate new data samples with the trained models, we draw a sample $\widehat{\mathbf{z}}_{1}\sim\mathcal{N}(\bm{0},\mathbf{I})$ , and integrate an ordinary differential equation (Song et al., 2021) with our trained diffusion model $D_{\bm{\phi}}(\widehat{\mathbf{z}}_{\tau},\tau)$ in $20$ integration steps. We use the second-order Heun integrator and sampling noise scheduler from Karras et al. (2022) to obtain a latent sample $\widehat{\mathbf{z}}_{x}$ . We then apply the decoder to reconstruct a sample in data space $\widehat{\mathbf{x}}$ . If the decoder is trained with censoring, we clip the $\text{SIT}\in[0,\infty)$ , $\text{SIC}\in[0,1]$ , and $\text{SID}\in[0,1]$ to their physical bounds.

3 Results

In our experiments, we evaluate how LDMs perform compared to generative diffusion trained in data space and how the censoring of the decoder improves the physical consistency of the generated samples. We train the models on simulations from the state-of-the-art Lagrangian sea-ice model neXtSIM (Rampal et al., 2016; Ólason et al., 2022) which has been coupled to the ocean component of the NEMO framework (Madec, 2008). The simulation data contains more than 20 simulation years (Boutin et al., 2023) with a six-hourly output on a $\frac{1}{4}^{\circ}$ ( $\simeq 12\,\text{km}$ in the Arctic) curvilinear grid. Omitting the first five years as spin-up, we train the models on data from 2000–2014, validate on 2015, and estimate the here-presented test scores on 2016–2018. All models are trained with ADAM (Kingma & Ba, 2017), a batch size of $16$ , and a learning rate of $2\times 10^{-4}$ for $10^{5}$ iterations. For further experimental details, we refer to Appendix C.

Table 1: Evaluation of the autoencoders for reconstructing the testing dataset with the normalised root-mean-squared error (RMSE), the structural similarity index measure (SSIM), and the accuracy in the sea-ice extent (

\text{ACC}_{\text{SIE}}

\beta

is the regularisation factor.

$\beta$	RMSE $\downarrow$	SSIM $\uparrow$	$\text{ACC}_{\text{SIE}}$ $\uparrow$
$1$	0.123	0.878	0.964
$10^{-3}$	0.076	0.939	0.986
$10^{-3}$ (censored)	0.078	0.941	0.991

The autoencoders compress the input data from $5\times 512\times 512$ to $16\times 64\times 64$ . We ablate design choices and evaluate the autoencoder performance of the deterministic mapping as used for diffusion in terms of reconstruction quality with the mean squared error (MSE), the structural similarity index measure (SSIM), and the accuracy in the sea-ice extent ( $\text{ACC}_{\text{SIE}}$ ). The definition of the metrics and their calculations are explained in Appendix D. The results are shown in Table 1.

Since its regularisation is much lower, the VAE trained with $\beta=10^{-3}$ reconstitutes into a better reconstruction than the VAE with $\beta=1$ , resulting into lower errors and higher similarities. Replacing the Gaussian log-likelihood in the reconstruction loss by a censored log-likelihood has a neutral impact on the RMSE and SSIM. However, the VAE with censoring can better reconstruct the marginal ice zone, which leads to a higher accuracy in the sea-ice extent. While higher accuracies therein might have only a small impact for generating new data, they might have a higher impact for other tasks where the output would be reused and were small errors can amplify, e.g., in surrogate modelling.

On top of the pre-trained autoencoder, we train LDMs as described in Section 2. We compare these LDMs to a diffusion model in data space, trained with Eq. (4) by setting $\mathbf{z}_{x}=\mathbf{x}$ . As shown in Fig. 1, LDMs smooth the generated fields compared to diffusion in data space. In the following, we establish how and why this smoothing happens by showing estimated spectra for the sea-ice thickness in Fig. 2.

LDMs lose small-scale information compared to the testing dataset, while diffusion models in data space retain information across all scales. This loss of small-scale information results into a smoothing in the generated samples as shown in Fig. 1. Since we find the same for the autoencoder, this smoothing is a result of the pre-training as variational autoencoder and the data compression in latent space.

The two diffusion models additionally experience a shift towards larger energies than the test dataset. This shift might be caused by a discrepancy between training (2000–2014) and testing (2016–2018) or by the approximations during sampling.

Table 2: Evaluation of the statistics for samples generated by the diffusion models compared to the testing dataset with the same size. The Fréchet distance (

\mathrm{FAED}

) estimates how well the structures align while the RMSE in the sea-ice extent (

\text{RMSE}_{\text{SIE}}

) shows how well ice edges are represented, both are multiplied for visualisation by

10^{2}

. Diffusion corresponds to diffusion in data space, while LDM are the latent diffusion models. The validation dataset has only 1/3 of the testing dataset size.

Model	$\mathrm{FAED}$ $\downarrow$	$\text{RMSE}_{\text{SIE}}$ $\downarrow$	Speed ( $\unit{s}$ ) $\downarrow$
Validation	2.38	7.02	–
Diffusion	1.73	9.66	1.764
LDM	1.65	11.56	0.069
LDM (censored)	1.79	7.32	0.070

Inspired by the Fréchet inception distance (Heusel et al., 2017), we evaluate the diffusion models in comparison to the testing dataset with the Fréchet distance in the latent space of an independently trained VAE ( $\mathrm{FAED}$ ). We additionally compare the probability that a grid point is covered by sea ice with the root-mean-squared error $\text{RMSE}_{\text{SIE}}$ , with results shown in Table 2. Whereas we evaluate on $4383$ generated samples only, the results are robust to the dataset size as demonstrated in Appendix E.2. Additionally, we measure the speed (latency) of the diffusion models in seconds for a single generation.

The diffusion models consistently outperform the validation dataset for the $\mathrm{FAED}$ . Although the validation $\mathrm{FAED}$ is limited by its dataset size, this nevertheless indicates that diffusion models effectively capture large-scale structures and correlations, as observed in the testing dataset. The differences in the $\mathrm{FAED}$ between the diffusion models might be caused by the volatility during training.

Diffusion models are trained with a Gaussian diffusion process, and the data space model struggles to accurately represent the sea-ice extent, leading to a higher RMSE than for the validation dataset. The smoothing in the LDM exacerbates the inaccuracies in representing sea-ice extent, resulting in even higher RMSE values. However, as for the reconstruction, censoring improves the sea-ice extent representation, bringing the RMSE to the realm of the validation dataset. In addition to an improved representation, the latent diffusion models are $25\times$ faster than the diffusion model in data space.

4 Conclusions

In this manuscript, we make the first step towards using latent diffusion models (LDMs) for generating multivariate, Arctic-wide sea-ice data. We compare LDMs with diffusion models in data space and find that LDMs lose small-scale information compared to the target fields, though their scores are similar to those of diffusion models in data space. This loss of information is due to the latent mapping process and its pre-training as variational autoencoder. To mitigate this, we could increase the number of channels in latent space (see Appendix E.1). Another approach would be to pre-train the autoencoder with an additional adversarial (Rombach et al., 2022) or spectral loss (Kochkov et al., 2023). These methods to try to mitigate the smoothing could complicate (e.g., Rombach et al., 2022) and notoriously destabilise (e.g., Karras et al., 2018) the training and tuning process.

Despite the loss of small-scale information, LDMs offer a significant advantage over diffusion models in data space: the ability to enforce physical bounds during data generation and their speed. By replacing the Gaussian data log-likelihood by a censored log-likelihood, we explicitly incorporate the clipping of the decoder output into the training process. The censoring has a neutral impact on the sample quality but leads to a more accurate representation of the sea-ice extent, indicating an increased physical consistency. This increased consistency could benefit other tasks as surrogate modelling or downscaling.

While we have tested LDMs only for sea-ice data, we anticipate similar smoothing issues for other Earth system components. Therefore, the smoothing issue remains a significant barrier to the broader application of the otherwise promising LDMs for large-scale Earth system modelling.

Acknowledgments

This manuscript is a contribution to the DeepGeneSIS project as financed by INSU/CNRS in the PNTS program and the SASIP project, supported under Grant no. 353 by Schmidt Science, LLC – a philanthropic initiative that seeks to improve societal outcomes through the development of emerging science and technologies. TSF, CD, AF, MB additionally acknowledge financial support by INSU/CNRS for the project GenD²M in the LEFE-MANU program. This work was granted access to the HPC resources of IDRIS under the allocations 2023-AD011013069R2 made by GENCI. We would like to thank Guillaume Boutin for providing access to the data and Laurent Bertino for helping to obtain the funding for the DeepGeneSIS project. With their reviews, four anonymous referees have improved the manuscript. CEREA is a member of the Institut Pierre-Simon Laplace (IPSL).

Impact Statement

We present here work whose goal is to advance the field of Earth system modelling and machine learning. While there are many potential societal consequences of our work, we will briefly discuss the impact on the energy consumption only.

Training neural networks consumes a lot of energy and in the future, the energy consumption of machine learning will rather increase than decrease, especially with the rise of generative models like diffusion models discussed here. Finding methods to reduce the computational costs of those models can be key to also reduce their energy consumption. While latent diffusion models might have the limitation of smoothing compared to diffusion models in data space, they offer an opportunity to decrease the costs and, thus, also the energy consumption. This might be especially helpful if we further strive towards larger and larger deep learning architectures, in the end possibly for coupled Earth system models.

References

Amemiya (1984) Amemiya, T. Tobit models: A survey. Journal of Econometrics, 24(1):3–61, January 1984. ISSN 0304-4076. doi: 10.1016/0304-4076(84)90074-5.
Anderson (1982) Anderson, B. D. O. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, May 1982. ISSN 0304-4149. doi: 10.1016/0304-4149(82)90051-5.
Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer Normalization. arXiv:1607.06450 [cs, stat], July 2016. arXiv: 1607.06450 tex.ids= ba2016a.
Bar-Tal et al. (2024) Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., Li, Y., Rubinstein, M., Michaeli, T., Wang, O., Sun, D., Dekel, T., and Mosseri, I. Lumiere: A Space-Time Diffusion Model for Video Generation, February 2024. arXiv:2401.12945 [cs].
Bouchat et al. (2022) Bouchat, A., Hutter, N., Chanut, J., Dupont, F., Dukhovskoy, D., Garric, G., Lee, Y. J., Lemieux, J.-F., Lique, C., Losch, M., Maslowski, W., Myers, P. G., Ólason, E., Rampal, P., Rasmussen, T., Talandier, C., Tremblay, B., and Wang, Q. Sea Ice Rheology Experiment (SIREx): 1. Scaling and Statistical Properties of Sea-Ice Deformation Fields. Journal of Geophysical Research: Oceans, 127(4):e2021JC017667, 2022. ISSN 2169-9291. doi: 10.1029/2021JC017667. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1029/2021JC017667.
Boutin et al. (2023) Boutin, G., Ólason, E., Rampal, P., Regan, H., Lique, C., Talandier, C., Brodeau, L., and Ricker, R. Arctic sea ice mass balance in a new coupled ice–ocean model using a brittle rheology framework. The Cryosphere, 17(2):617–638, February 2023. ISSN 1994-0416. doi: 10.5194/tc-17-617-2023. Publisher: Copernicus GmbH.
Brajard et al. (2024) Brajard, J., Korosov, A., Davy, R., and Wang, Y. Super-resolution of satellite observations of sea ice thickness using diffusion models and physical modeling. Technical Report EGU24-3826, Copernicus Meetings, March 2024. Conference Name: EGU24.
Cipolla et al. (2018) Cipolla, R., Gal, Y., and Kendall, A. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7482–7491, Salt Lake City, UT, USA, June 2018. IEEE. ISBN 978-1-5386-6420-9. doi: 10.1109/CVPR.2018.00781.
Dai et al. (2023) Dai, X., Hou, J., Ma, C.-Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., Dubey, A., Yu, M., Kadian, A., Radenovic, F., Mahajan, D., Li, K., Zhao, Y., Petrovic, V., Singh, M. K., Motwani, S., Wen, Y., Song, Y., Sumbaly, R., Ramanathan, V., He, Z., Vajda, P., and Parikh, D. Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack, September 2023. arXiv:2309.15807 [cs].
Dansereau et al. (2016) Dansereau, V., Weiss, J., Saramito, P., and Lattes, P. A Maxwell elasto-brittle rheology for sea ice modelling. The Cryosphere, 10(3):1339–1359, July 2016. ISSN 1994-0416. doi: 10.5194/tc-10-1339-2016. publisher: Copernicus GmbH.
Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems, volume 34, pp. 8780–8794. Curran Associates, Inc., 2021.
Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. arXiv:2010.11929 [cs].
Durand et al. (2024) Durand, C., Finn, T. S., Farchi, A., Bocquet, M., Boutin, G., and Ólason, E. Data-driven surrogate modeling of high-resolution sea-ice thickness in the Arctic. The Cryosphere, 18(4):1791–1815, April 2024. ISSN 1994-0416. doi: 10.5194/tc-18-1791-2024. Publisher: Copernicus GmbH.
Falcon et al. (2020) Falcon, W., Borovec, J., Wälchli, A., Eggert, N., Schock, J., Jordan, J., Skafte, N., Ir1dXD, Bereznyuk, V., Harris, E., Murrell, T., Yu, P., Præsius, S., Addair, T., Zhong, J., Lipin, D., Uchida, S., Bapat, S., Schröter, H., Dayma, B., Karnachev, A., Kulkarni, A., Komatsu, S., Martin.B, SCHIRATTI, J.-B., Mary, H., Byrne, D., Eyzaguirre, C., cinjon, and Bakhtin, A. PyTorchLightning: 0.7.6 release, May 2020.
Finn et al. (2023) Finn, T. S., Durand, C., Farchi, A., Bocquet, M., Chen, Y., Carrassi, A., and Dansereau, V. Deep learning subgrid-scale parametrisations for short-term forecasting of sea-ice dynamics with a Maxwell elasto-brittle rheology. The Cryosphere, 17(7):2965–2991, July 2023. ISSN 1994-0416. doi: 10.5194/tc-17-2965-2023. Publisher: Copernicus GmbH.
Finn et al. (2024) Finn, T. S., Durand, C., Farchi, A., Bocquet, M., Rampal, P., and Carrassi, A. Generative diffusion for regional surrogate models from sea-ice simulations. April 2024. doi: 10.22541/au.171386536.64344222/v1.
Fishman et al. (2023) Fishman, N., Klarner, L., Bortoli, V. D., Mathieu, E., and Hutchinson, M. J. Diffusion models for constrained domains. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
Girard et al. (2011) Girard, L., Bouillon, S., Weiss, J., Amitrano, D., Fichefet, T., and Legat, V. A new modeling framework for sea-ice mechanics based on elasto-brittle rheology. Annals of Glaciology, 52(57):123–132, January 2011. ISSN 0260-3055, 1727-5644. doi: 10.3189/172756411795931499.
Goessling et al. (2016) Goessling, H. F., Tietsche, S., Day, J. J., Hawkins, E., and Jung, T. Predictability of the Arctic sea ice edge. Geophysical Research Letters, 43(4):1642–1650, 2016. ISSN 1944-8007. doi: 10.1002/2015GL067232. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/2015GL067232.
Hendrycks & Gimpel (2016) Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus), 2016.
Hersbach et al. (2020) Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Simmons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., Chiara, G. D., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes, M., Geer, A., Haimberger, L., Healy, S., Hogan, R. J., Hólm, E., Janisková, M., Keeley, S., Laloyaux, P., Lopez, P., Lupu, C., Radnoti, G., Rosnay, P. d., Rozum, I., Vamborg, F., Villaume, S., and Thépaut, J.-N. The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730):1999–2049, 2020. ISSN 1477-870X. doi: https://doi.org/10.1002/qj.3803. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/qj.3803.
Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500 [cs, stat], June 2017. arXiv: 1706.08500.
Higgins et al. (2016) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. November 2016.
Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Probabilistic Models, December 2020. tex.ids= ho2020, ho2020a, ho2020b arXiv: 2006.11239 [cs, stat] number: arXiv:2006.11239.
Hoogeboom et al. (2023) Hoogeboom, E., Heek, J., and Salimans, T. simple diffusion: End-to-end diffusion for high resolution images, January 2023. arXiv:2301.11093 [cs, stat].
Huang et al. (2023) Huang, Q., Park, D. S., Wang, T., Denk, T. I., Ly, A., Chen, N., Zhang, Z., Zhang, Z., Yu, J., Frank, C., Engel, J., Le, Q. V., Chan, W., Chen, Z., and Han, W. Noise2Music: Text-conditioned Music Generation with Diffusion Models, March 2023. arXiv:2302.03917 [cs, eess].
Kadow et al. (2020) Kadow, C., Hall, D. M., and Ulbrich, U. Artificial intelligence reconstructs missing climate information. Nature Geoscience, 13(6):408–413, June 2020. ISSN 1752-0908. doi: 10.1038/s41561-020-0582-5. Number: 6 Publisher: Nature Publishing Group.
Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. February 2018.
Karras et al. (2022) Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models, October 2022. arXiv:2206.00364 [cs, stat].
Kingma et al. (2021) Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational Diffusion Models. In Advances in Neural Information Processing Systems, volume 34, pp. 21696–21707. Curran Associates, Inc., 2021. tex.ids= kingma_variational_2021-1.
Kingma & Ba (2017) Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], January 2017. arXiv: 1412.6980.
Kingma & Gao (2023) Kingma, D. P. and Gao, R. Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation, September 2023. arXiv:2303.00848 [cs, stat].
Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat], December 2013. arXiv: 1312.6114.
Kochkov et al. (2023) Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Lottes, J., Rasp, S., Düben, P., Klöwer, M., Hatfield, S., Battaglia, P., Sanchez-Gonzalez, A., Willson, M., Brenner, M. P., and Hoyer, S. Neural General Circulation Models, November 2023. arXiv:2311.07222 [physics].
Leinonen et al. (2023) Leinonen, J., Hamann, U., Nerini, D., Germann, U., and Franch, G. Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification, April 2023. arXiv:2304.12891 [physics].
Li et al. (2023) Li, L., Carver, R., Lopez-Gomez, I., Sha, F., and Anderson, J. SEEDS: Emulation of Weather Forecast Ensembles with Diffusion Models, October 2023. arXiv:2306.14066 [physics].
Liu et al. (2018) Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. pp. 85–100, 2018.
Liu et al. (2022) Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. A ConvNet for the 2020s, March 2022.
Lou & Ermon (2023) Lou, A. and Ermon, S. Reflected Diffusion Models. In Proceedings of the 40th International Conference on Machine Learning, pp. 22675–22701. PMLR, July 2023. ISSN: 2640-3498.
Madec (2008) Madec, G. NEMO ocean engine. Project report 1288-1619, Institut Pierre-Simon Laplace (IPSL), 2008. Series: 27.
Mardani et al. (2023) Mardani, M., Brenowitz, N., Cohen, Y., Pathak, J., Chen, C.-Y., Liu, C.-C., Vahdat, A., Kashinath, K., Kautz, J., and Pritchard, M. Generative Residual Diffusion Modeling for Km-scale Atmospheric Downscaling, September 2023. arXiv:2309.15214 [physics].
Odena et al. (2016) Odena, A., Dumoulin, V., and Olah, C. Deconvolution and Checkerboard Artifacts. Distill, 1(10):e3, October 2016. ISSN 2476-0757. doi: 10.23915/distill.00003.
Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
Perez et al. (2017) Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer, December 2017. arXiv:1709.07871 [cs, stat].
Price et al. (2024) Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P., Lam, R., and Willson, M. GenCast: Diffusion-based ensemble forecasting for medium-range weather, May 2024. arXiv:2312.15796 [physics] version: 2.
Rampal et al. (2016) Rampal, P., Bouillon, S., Ólason, E., and Morlighem, M. neXtSIM: a new Lagrangian sea ice model. The Cryosphere, 10(3):1055–1073, May 2016. ISSN 1994-0416. doi: 10.5194/tc-10-1055-2016. publisher: Copernicus GmbH.
Rampal et al. (2019) Rampal, P., Dansereau, V., Olason, E., Bouillon, S., Williams, T., Korosov, A., and Samaké, A. On the multi-fractal scaling properties of sea ice deformation. The Cryosphere, 13(9):2457–2474, September 2019. ISSN 1994-0416. doi: 10.5194/tc-13-2457-2019. Publisher: Copernicus GmbH.
Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. pp. 10684–10695, 2022.
Ronneberger et al. (2015) Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597 [cs], May 2015. arXiv: 1505.04597.
Rybkin et al. (2020) Rybkin, O., Daniilidis, K., and Levine, S. Simple and Effective VAE Training with Calibrated Decoders. arXiv:2006.13202 [cs, eess, stat], June 2020. arXiv: 2006.13202.
Saharia et al. (2022a) Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Advances in Neural Information Processing Systems, 35:36479–36494, December 2022a.
Saharia et al. (2022b) Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image Super-Resolution Via Iterative Refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–14, 2022b. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3204461. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
Salimans & Ho (2022) Salimans, T. and Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models, June 2022. arXiv:2202.00512 [cs, stat].
Shmakov et al. (2023) Shmakov, A., Greif, K., Fenton, M., Ghosh, A., Baldi, P., and Whiteson, D. End-To-End Latent Variational Diffusion Models for Inverse Problems in High Energy Physics, May 2023. arXiv:2305.10399 [hep-ex].
Sinha et al. (2021) Sinha, A., Song, J., Meng, C., and Ermon, S. D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation. In Advances in Neural Information Processing Systems, volume 34, pp. 12533–12548. Curran Associates, Inc., 2021.
Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585 [cond-mat, q-bio, stat], November 2015. arXiv: 1503.03585.
Song & Ermon (2020) Song, Y. and Ermon, S. Improved Techniques for Training Score-Based Generative Models. In Advances in Neural Information Processing Systems, volume 33, pp. 12438–12448. Curran Associates, Inc., 2020.
Song et al. (2021) Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum Likelihood Training of Score-Based Diffusion Models, October 2021. arXiv:2101.09258 [cs, stat].
Talandier & Lique (2021) Talandier, C. and Lique, C. CREG025.L75-NEMO_r3.6.0, December 2021.
Tobin (1958) Tobin, J. Estimation of Relationships for Limited Dependent Variables. Econometrica, 26(1):24–36, 1958. ISSN 0012-9682. doi: 10.2307/1907382. Publisher: [Wiley, Econometric Society].
Vahdat et al. (2021) Vahdat, A., Kreis, K., and Kautz, J. Score-based Generative Modeling in Latent Space, December 2021. arXiv:2106.05931 [cs, stat].
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. arXiv:1706.03762 [cs], December 2017. tex.ids: vaswani_attention_2017-1 arXiv: 1706.03762.
Wan et al. (2023) Wan, Z. Y., Baptista, R., Chen, Y.-f., Anderson, J., Boral, A., Sha, F., and Zepeda-Núñez, L. Debias Coarsely, Sample Conditionally: Statistical Downscaling through Optimal Transport and Probabilistic Diffusion Models, October 2023. arXiv:2305.15618 [physics].
Xiong et al. (2020) Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On Layer Normalization in the Transformer Architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, November 2020. ISSN: 2640-3498.
Yadan (2019) Yadan, O. Hydra - A framework for elegantly configuring complex applications, 2019. tex.howpublished: Github.
Ólason et al. (2021) Ólason, E., Rampal, P., and Dansereau, V. On the statistical properties of sea-ice lead fraction and heat fluxes in the Arctic. The Cryosphere, 15(2):1053–1064, February 2021. ISSN 1994-0416. doi: 10.5194/tc-15-1053-2021. Publisher: Copernicus GmbH.
Ólason et al. (2022) Ólason, E., Boutin, G., Korosov, A., Rampal, P., Williams, T., Kimmritz, M., Dansereau, V., and Samaké, A. A New Brittle Rheology and Numerical Framework for Large-Scale Sea-Ice Models. Journal of Advances in Modeling Earth Systems, 14(8):e2021MS002685, 2022. ISSN 1942-2466. doi: 10.1029/2021MS002685. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1029/2021MS002685.

Appendix A Additional methods

In the main manuscript, we shortly explained the methods around our main message: we can train latent diffusion models for geophysics end-to-end and incorporate physical bounds into the latent diffusion model. In the following, we explain the methods more in detail. We elucidate on our variational diffusion formulation in Appendix A.1. We introduce censored Gaussian distributions for the data log-likelihood in Appendix A.2, and we discuss our noise scheduler in Appendix A.3.

A.1 Latent diffusion formulation

In our formulation of generative diffusion, we rely on variational diffusion models as introduced in Kingma et al. (2021). These models allow us to naturally make use of encoder and decoder structures to reduce the dimensionality of the system.

Given data drawn from a dataset $\mathbf{x}\sim\mathcal{D}$ , the encoder produces the corresponding latent state $\mathbf{z}_{x}=\mathrm{enc}(\mathbf{x})$ . This latent state is noised by a variance-preserving diffusion process,

\displaystyle\mathbf{z}_{\tau}

\displaystyle=\alpha_{\tau}\mathbf{z}_{x}+\sigma_{\tau}\bm{\epsilon},

\displaystyle\text{with}~{}\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I}),

(3)

where the pseudo-time $\tau\in[0,1]$ determines the temporal position of the process. The signal amplitude $\alpha_{\tau}$ specifies how much signal is contained in the noised sample, and the noise amplitude $\sigma_{\tau}=\sqrt{1-\alpha_{\tau}^{2}}$ how much signal has been replaced by noise. The dependency of the (noise) scheduling of $\alpha_{\tau}$ and $\sigma_{\tau}$ on the pseudo-time is formulated as logarithmic signal-to-noise ratio $\gamma(\tau)=\log(\frac{\alpha_{\tau}^{2}}{\sigma_{\tau}^{2}})$ such that we can recover $\alpha_{\tau}^{2}=\frac{1}{1+\exp(-\gamma(\tau))}$ and $\sigma_{\tau}^{2}=\frac{1}{1+\exp(\gamma(\tau))}$ .

As in the variational autoencoder framework (Kingma & Welling, 2013), the data log-likelihood $\log(p(\mathbf{x}))$ is lower bounded by the evidence lower bound (ELBO). In equivalence, we can write an upper bound on the negative data log-likelihood (Kingma et al., 2021),

-\log(p(\mathbf{x}))\leq-\mathbb{E}_{\mathbf{z}_{0}\sim q(\mathbf{z}_{0}\mid% \mathbf{x})}\Big{[}\log p(\mathbf{x}\mid\mathbf{z}_{0})\Big{]}+D_{KL}\bigg{(}q% (\mathbf{z}_{1}\mid\mathbf{x})\|p(\mathbf{z}_{1})\bigg{)}+\mathbb{E}_{\mathbf{% z}_{x}\sim q(\mathbf{z}_{x}\mid\mathbf{x}),\tau\sim[0,1]}\mathcal{L}(\phi,% \mathbf{z}_{x},\tau).

(5)

The first term is the negative log-likelihood of the data given a sample $\mathbf{z}_{0}$ drawn in latent space, and also called reconstruction loss, measuring how well we can reconstruct the data from the latent space. The likelihood of the data $p(\mathbf{x}\mid\mathbf{z}_{0})$ maps back from latent space into data space and, hence, contains the decoder. The posterior distribution $q(\mathbf{z}_{0}\mid\mathbf{x})$ specifies the first noised state given a data sample and includes the encoder, mapping from data space to latent space. We will further discuss different options for the data log-likelihood in Appendix A.2.

The second term is the Kullback-Leibler divergence between the distribution at $\tau=1$ , the end of the diffusion process, and a prior distribution for the diffusion process. Given the variance-preserving diffusion process from Eq. (3) and assuming a Gaussian prior distribution with mean $\bm{0}$ and covariance $\mathbf{I}$ , $p(\mathbf{z}_{1})=\mathcal{N}(\bm{0},\mathbf{I})$ , we obtain for the second term,

D_{KL}\bigg{(}q(\mathbf{z}_{1}\mid\mathbf{x})\|p(\mathbf{z}_{1})\bigg{)}=\frac% {1}{2}\sum^{N_{\mathsf{latent}}}_{n=1}\Big{[}\sigma^{2}_{1}+(\alpha_{1}z_{x,n}% )^{2}-1-\log(\sigma^{2}_{1})\Big{]},

(6)

over the $N_{\mathsf{latent}}$ dimensions of the latent space.

The last term is the diffusion loss, the only term that depends on the denoising NN $D_{\phi}(\mathbf{z}_{\tau},\tau)$ . Given a sample in latent space $\mathbf{z}_{x}$ and a drawn pseudo-time $\tau$ , we can calculate a noised sample $\mathbf{z}_{\tau}$ and the loss reads

\mathcal{L}(\bm{\phi},\mathbf{z}_{x},\tau)=\Big{(}-\frac{d\gamma}{d\tau}\Big{)% }\cdot e^{\gamma}\lVert\mathbf{z}_{x}-D_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)% \rVert_{2}^{2}.

(7)

Our loss function used to train the diffusion models, Eq. 4, corresponds to this loss term by setting $w(\gamma)=1$ . As shown in Kingma & Gao (2023), using a monotonic weighting, like $w(\gamma)=\exp(-\frac{\gamma}{2})$ , corresponds to the ELBO with data augmentation. The only variable in Eq. (7) are the weights and biases of the diffusion model and the noise scheduler that specifies $\gamma(\tau)$ . Using the observation that the noise scheduling can be seen as importance sampling, we use an adaptive noise scheduler as explained in Appendix A.3.

In our experiments with diffusion models, we fix the end points of the noise scheduler to $\gamma(0)=\gamma_{min}=-15$ and $\gamma(1)=\gamma_{max}=15$ , and use a fixed or pre-trained encoder and decoder. The first two terms of Eq. (5) are then constant with respect to the trained diffusion model and can be neglected in the optimisation. We obtain Eq. (7) as the only loss to optimise the diffusion model.

Instead of directly predicting the state $\mathbf{z}_{x}$ , we predict with our diffusion model a surrogate target $\mathbf{v}_{\tau}\coloneqq\alpha_{\tau}\bm{\epsilon}-\sigma_{\tau}\mathbf{z}_{x}$ , as this tends to improve the convergence and the stability of the training (Salimans & Ho, 2022). We can recover the denoised state by setting $D_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)=\alpha_{\tau}\mathbf{z}_{x}-\sigma_{\tau% }v_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)$ . Our loss function reads then

\mathcal{L}(\bm{\phi},\mathbf{z}_{\tau},\tau)=w(\gamma)\cdot\Big{(}-\frac{d% \gamma}{d\tau}\Big{)}\cdot(e^{-\gamma}+1)^{-1}\lVert\mathbf{v}_{\tau}-v_{\bm{% \phi}}(\mathbf{z}_{\tau},\tau)\rVert^{2}_{2},

(8)

which corresponds to the ELBO with data augmentation and $\mathbf{v}_{\tau}$ -prediction.

To optimise the encoder and decoder together with the diffusion in an end-to-end training scheme (Vahdat et al., 2021; Shmakov et al., 2023), we can make use of Eq. (5). However, pre-training the encoder and decoder simplifies and stratifies the training process. To pre-train the encoder and decoder, we apply a variational autoencoder loss function,


$\displaystyle\mathcal{L}(\bm{\theta})=$	$\displaystyle-\mathbb{E}_{q(\mathbf{z}_{x}\mid\mathbf{x})}\Big{[}\log\big{(}p(% \mathbf{x}\mid\mathbf{z}_{x})\big{)}\Big{]}$	(1a)
	$\displaystyle+\beta D_{KL}\Big{(}q(\mathbf{z}_{x}\mid\mathbf{x})\\|p(\mathbf{z}% _{x})\Big{)}.$	(1b)

Comparing Eq. (5) with this pre-training loss, we can see that the reconstruction loss appears in both formulations, while the loss terms in latent/noised space are combined into a single loss, and the Eq. (1b) should prepare the latent space for its use in diffusion models. The reconstruction loss is discussed more in detail in Appendix A.2.

We train two different version of diffusion models:

•

The encoder and decoder are pre-trained with Eq. (1) as $\mathbf{z}_{x}=\mathrm{enc}(\mathbf{x})$ and $\hat{\mathbf{x}}=\mathrm{dec}(\mathbf{z}_{x})$ , which results into a latent diffusion model.
•

The encoder and decoder are fixed to the identity functions $\mathbf{z}_{x}=\mathbf{x}$ and $\hat{\mathbf{x}}=\mathbf{z}_{x}$ , which leads to diffusion models in data space.

During their training, we draw a mini-batch of data samples $\mathbf{x}\sim\mathcal{D}$ , converted into latent space $\mathbf{z}_{x}$ , and pseudo-times $\tau\in[0,1]$ . We reduce the variance of the sampling in the pseudo-times by using stratified sampling as proposed in Kingma et al. (2021). Afterwards, we estimate the diffusion loss, Eq. (4), for the current parameters $\phi$ of the diffusion model. This loss is then used in Adam (Kingma & Ba, 2017) to make a gradient descent step.

The diffusion process can be seen as stochastic differential equation (SDE) that is integrated in pseudo-time (Song et al., 2021). The reversion of this process is again a SDE (Anderson, 1982) and we can find an ordinary differential equation (ODE) that has the same marginal distribution as the reverse SDE (Song et al., 2021),

d\widetilde{\mathbf{z}}_{\tau}=\Big{[}\mathbf{f}(\widetilde{\mathbf{z}}_{\tau}% ,\tau)-\frac{1}{2}g(\tau)^{2}\mathbf{s}_{\bm{\phi}}(\widetilde{\mathbf{z}}_{% \tau},\tau)\Big{]}d\tau.

(10)

For the variance-preserving diffusion process as defined in Eq. (3), we obtain for the drift and diffusion term

	$\displaystyle\mathbf{f}(\widetilde{\mathbf{z}}_{\tau},\tau)$	$\displaystyle=-\frac{1}{2}\Big{(}\frac{d}{d\tau}\log(1+e^{-\gamma(\tau)})\Big{% )}\widetilde{\mathbf{z}}_{\tau}$		(11)
	$\displaystyle g(\tau)^{2}$	$\displaystyle=\frac{d}{d\tau}\log(1+e^{-\gamma(\tau)}).$		(12)

Consequently, the only variable in the ODE, Eq. (10), is the score function $\mathbf{s}_{\bm{\phi}}(\widetilde{\mathbf{z}}_{\tau},\tau)$ which we approximate with our trained NN,

\mathbf{s}_{\bm{\phi}}(\widetilde{\mathbf{z}}_{\tau},\tau)=-\Big{(}\widetilde{% \mathbf{z}}_{\tau}+\frac{\alpha_{\tau}}{\sigma_{\tau}}v_{\bm{\phi}}(\widetilde% {\mathbf{z}}_{\tau},\tau)\Big{)}.

(13)

Given Eq. (11)–(13), we can integrate (10) from $\tau=1$ to $\tau=0$ to find a solution and generate new samples; the initial conditions for the ODE are given by $\widetilde{\mathbf{z}}_{1}\sim\mathcal{N}(\bm{0},\mathbf{I})$ . For the integration of the ODE, we use a second-order Heun sampler (Karras et al., 2022) with 20 integration steps. The time steps for the integration are given by the inference noise scheduler as proposed by Karras et al. (2022). We extend the noise scheduler to a wider range of signal-to-noise ratio by setting $\gamma_{min}=-15$ and $\gamma_{max}=15$ with truncation (Kingma & Gao, 2023). Additionally, we adapt the sampling procedure: we apply the Heun procedure for all 20 integration steps, while we use a single Euler step to denoise the output from $\widetilde{\mathbf{z}}_{0}$ to $\widetilde{\mathbf{z}}_{x}$ . The integration hence includes 41 iterations with the trained neural network. After obtaining $\widetilde{\mathbf{z}}_{x}$ , we apply the decoder to map back into data space $\widetilde{\mathbf{x}}=\mathrm{dec}(\widetilde{\mathbf{z}}_{x})$ which gives us then the generated sample.

A.2 Data log-likelihood with censored distributions

The autoencoder should map from latent space to data space by taking physical bounds into account, e.g. the non-negativity of the sea-ice thickness. To achieve such bounds, we can clip the decoder output and set values below (over) the lower (upper) bound explicitly to the bound; similar to what the rectified linear unit (relu) activation function does for the negative bound. Hence, all values below (above) the bound are projected to the bound. In this clipping case, the output of the decoder is no longer the physical quantity itself but another latent variable, which is converted into the physical quantity in a post-processing step. To differentiate between physical quantities and additional latent variable, we denote the decoder output as predicted latent variable $\widehat{\mathbf{y}}\in\mathbb{R}^{5\times 512\times 512}$ in the following. We assume that this predicted latent variable is Gaussian distributed where the decoder predicts the mean $\bm{\mu}=\mathrm{dec}(\mathbf{z}_{x})$ and the standard deviation $\bm{s}$ is spatially shared with one parameter per variable,

\widehat{\mathbf{y}}\sim\mathcal{N}\big{(}\widehat{\bm{\mu}},\bm{s}^{2}\mathbf% {I}\big{)}.

(14)

Caused by the diagonal covariance, we make out of the multivariate data log-likelihood a univariate estimation, which we can simply sum over the variables and grid points. Given this Gaussian assumption, the data log-likelihood would read for $K$ variables and $L$ grid points

-\log p(\mathbf{x}\mid\mathbf{z}_{x})=\frac{1}{2}\sum^{K}_{k=1}\sum^{L}_{l=1}% \frac{(x_{k,l}-\widehat{\mu}_{k,l})^{2}}{(s_{k})^{2}}+\log\big{(}(s_{k})^{2}% \big{)}+\log(2\cdot\pi).

(15)

The log-likelihood is proportional to a mean-squared error (MSE) weighted by the scale parameter. Although the MSE is often employed to train an autoencoder, we make the assumption that the decoder output is the physical quantity itself and we neglect the clipping in the training of the autoencoder. Consequently, if we clip the output into physical bounds, we make a wrong assumption and introduce a bias into the decoder and, thus, the latent space. This bias depends on the number of cases that the lower (upper) bound is reached and is difficult to quantify. To explicitly bake the clipping into the cost function and treating the decoder output as latent variable, we censor the assumed distribution.

To optimise the autoencoder, we derive the data log-likelihood where variables are lower and upper bounded with $\bm{x}_{\mathsf{L}}$ as lower and $\bm{x}_{\mathsf{U}}$ as upper bound, e.g., $x_{\mathsf{L}}=0$ as lower and $x_{\mathsf{U}}=1$ as upper bound for the sea-ice concentration. For variables with only one bound, the case of the missing bound can be simply omitted.

The predicted latent variable for the $k$ -th variable and the $l$ -th grid point is clipped into the physical bounds by

\widehat{\mathbf{x}}_{k,l}=\begin{cases}x_{\mathsf{L},k},&\text{if }\widehat{% \mu}_{k,l}\leq x_{\mathsf{L},k}\\ x_{\mathsf{U},k},&\text{if }\widehat{\mu}_{k,l}\geq x_{\mathsf{U},k}\\ \widehat{\mu}_{k,l},&\text{otherwise}.\end{cases}

(16)

This clipping operation is non-invertible such that we cannot recover the true latent variable $y_{k,l}$ from observing the physical quantity $x_{k,l}$ . However, we know that if $x_{k,l}=x_{\mathsf{L},k}$ or $x_{k,l}=x_{\mathsf{U},k}$ , the true latent variable was clipped. In such cases, rationally, the data log-likelihood hence should increase the probability that the decoder output is clipped.

In the following, we denote the cumulative distribution function (CDF) of the normal Gaussian distribution as $\Phi$ and its probability density function (PDF) as $\varphi$ . Given the assumed Gaussian distribution of the latent variables, Eq. (14), the physical quantity is presumably distributed by a censored Gaussian distribution with the following PDF for the $k$ -th variable and $l$ -th grid point,

p(x_{k,l}\mid\mathbf{z}_{x})=\begin{cases}\Phi(\frac{x_{\mathsf{L},k}-\widehat% {\mu}_{k,l}}{s_{k}}),&\text{if }x_{k,l}=x_{\mathsf{L},k},\\ \Phi(\frac{\widehat{\mu}_{k,l}-x_{\mathsf{U},k}}{s_{k}}),&\text{if }x_{k,l}=x_% {\mathsf{U},k},\\ \frac{1}{s_{k}}\varphi(\frac{x_{k,l}-\widehat{\mu}_{k,l}}{s_{k}}),&\text{if }x% _{\mathsf{L},k}<x_{k,l}<x_{\mathsf{U},k},\\ 0,&\text{otherwise}.\end{cases}

(17)

Different from a truncated Gaussian distribution, Eq. (17) gives values at the bounds a probability larger than zero, and all the density of the latent variable exceeding the bounds is folded to the bounds.

Using the PDF of the censored distribution from Eq. (17) for the data log-likelihood, we obtain

\begin{split}-\log p(\mathbf{x}\mid\mathbf{z}_{x})=\sum^{K}_{k=1}\sum^{L}_{l=1% }\phantom{+}&I(x_{k,l}=x_{\mathsf{L},k})\,\log\bigg{(}\Phi\bigg{(}\frac{x_{% \mathsf{L},k}-\widehat{\mu}_{k,l}}{s_{k}}\bigg{)}\bigg{)}\\ +&I(x_{k,l}=x_{\mathsf{U},k})\,\log\bigg{(}\Phi\bigg{(}\frac{\widehat{\mu}_{k,% l}-x_{\mathsf{U},k}}{s_{k}}\bigg{)}\bigg{)}\\ +&I(x_{\mathsf{L},k}<x_{k,l}<x_{\mathsf{U},k})\,\log\bigg{(}\frac{1}{s_{k}}% \varphi\bigg{(}\frac{x_{k,l}-\widehat{\mu}_{k,l}}{s_{k}}\bigg{)}\bigg{)}\end{split}

(18)

as cost function, where $I(\cdot)$ is the indicator function. The first two parts of the cost function are the bounded cases and correspond to a Gaussian classification model, using as probability the log-CDFs at the bounds. The last part is the log-likelihood of a Gaussian distribution as in Eq. (15) and includes the weighted MSE. The here derived log-likelihood thus combines a regression with a classification and is the negative log-likelihood of a type I Tobit model (Tobin, 1958). Whereas the optimisation of Eq. (18) seems more complicated than the optimisation of a log-likelihood from a Gaussian distribution, we have experienced no optimisation issues with variants of gradients descent.

Caused by the regression-classification mixture, the decoder output specifies not only the variable of interest but combined with the scale also the probability that the output will be clipped. This censored distribution approach works because we neglect correlations between variables and grid points and convert the multivariate into a univariate prediction problem. Nevertheless, we are free to chose any neural network architecture to predict the mean of the distribution in Eq. (14). Consequently, by e.g. using a fully convolutional decoder, we can still represent correlations in the decoder output.

A.3 Adaptive noise scheduler

The noise scheduler maps a given pseudo-time $\tau\in[0,1]$ to a log-signal-to-noise ratio $\gamma(\tau)$ and determines how much time is spent at a given ratio. In the manuscript, we use different schedulers during training and generation. While we generate the data with a fixed scheduler as proposed by Karras et al. (2022), we make use of an adaptive scheduler (Kingma & Gao, 2023) during training. In the following, we briefly introduce the principles of the adaptive scheduler, and we refer to Kingma & Gao (2023) for a longer treatment.

As shortly shown in Appendix A.1, the loss function of the diffusion model optimises the evidence lower bound (ELBO). Consequently, we can extend the variational parameters by the parameters of the training noise scheduler to further tighten the bound (Kingma et al., 2021) which gives us the optimal scheduler in the ELBO sense. Extending on this idea, the distribution given after transforming a drawn pseudo-time into $\gamma$ can be seen as importance sampling distribution $p(\gamma)$ (Kingma & Gao, 2023), and we can write

-\frac{d\gamma}{d\tau}=\frac{1}{p(\gamma)}.

(19)

Interpreting the output of the noise scheduler as a drawn from an importance sampling distribution allows us to formulate the optimal noise scheduler which should fulfil

p(\gamma)\propto\mathbb{E}_{\mathbf{x}\sim\mathcal{D},\epsilon\sim\mathcal{N}(% \bm{0},\mathbf{I})}\Big{[}w(\gamma)\cdot(e^{-\gamma}+1)^{-1}\lVert\mathbf{v}_{% \gamma}-v_{\bm{\phi}}(\mathbf{z}_{\gamma},\gamma)\rVert^{2}_{2}\Big{]},

(20)

where we made a change-of-variables from $\tau$ to $\gamma$ to signify the dependency on $\gamma$ .

As Eq. (20) is infeasible, we approximate the optimal importance sampling distribution. Using 100 equal-distant bins between $\gamma_{\mathrm{min}}=-15$ and $\gamma_{\mathrm{max}}=15$ , we keep track of an exponential moving average of Eq. (20) for each bin. The density is constant within each bin, which allows us to estimate an empirical cumulative distribution function from $\gamma_{\mathrm{max}}$ to $\gamma_{\mathrm{min}}$ . The training noise scheduler is then given as empirical quantile function mapping from $[0,1]$ to $[\gamma_{\mathrm{max}},\gamma_{\mathrm{min}}]$ . In practice, this adaptive noise scheduler reduces the number of needed hyperparameters for the diffusion model and improves its convergence (Kingma & Gao, 2023). It was recently similarly used in Finn et al. (2024).

Appendix B Architectures

In the following, we will describe our neural network architectures. The autoencoder is purely based on a convolutional architecture, while the diffusion models are based on a mixture between a convolutional U-Net (Ronneberger et al., 2015) and a vision transformer (ViT, Dosovitskiy et al. (2021)), termed UVit (Hoogeboom et al., 2023).

B.1 Masked convolutions

As visible in Fig. 1, sea ice only exists on grid points covered by ocean while grid points with land are omitted. To take the masking of land pixels into account, we apply convolutional operations as inspired by partial convolutions (Liu et al., 2018; Kadow et al., 2020; Durand et al., 2024).

Before each convolutional operation, the data is multiplied by the mask $\mathbf{m}\in[0,1]$ , where valid grid points are $1$ . The masking sets all grid points with land to zero such that they act like zero padding and interactions with land grid points are deactivated. Consequently, grid points near land exchange less information with the surrounding grid points which in fact imitates how land masking is implemented in numerical models.

B.2 Autoencoder

The autoencoder maps from $\mathbf{x}\in\mathbb{R}^{5\times 512\times 512}$ to $\mathbf{z}_{z}\in\mathbb{R}^{16\times 64\times 64}$ and vice-versa. Its encoder includes a series of downsampling operations with ConvNeXt blocks, while the decoder combines upsampling operations with ConvNeXt blocks.

The main computing block are ConvNeXt blocks (Liu et al., 2022): a wide masked convolution with a $7\times 7$ kernel extracts channel-wise spatial information. This spatial information is normalised with layer normalisation (Ba et al., 2016). Afterwards, a multi-layered perceptron (MLP) with a Gaussian error linear unit (Gelu, Hendrycks & Gimpel (2016)) activation mixes up the normalised information, before the output gets added again to the input of the block in a residual connection. Throughout the ConvNeXt block, the number of channels remain the same.

The downsampling in the encoder combines layer normalisation with a masked convolution and a $2\times 2$ kernel and a stride of $2$ , which also doubles the number of channels. To downsample the mask, we apply max pooling, i.e. if there is an ocean grid point in a $2\times 2$ window, the output grid point is set to $1$ . This increases the number of ocean grid points compared to a strided downsampling.

In the upsampling of the decoder, a layer normalisation is followed by a nearest neighbour interpolation, which doubles the number of grid points in $y$ - and $x$ -direction. Afterwards, a masked convolution with a $3\times 3$ kernel smooths the interpolated fields and halves the number of channels. This combination of interpolation with convolution results into less checkerboard effects compared to a transposed convolution (Odena et al., 2016).

The architectures for the encoder and decoder read (the number in brackets represents the number of channels):

•

Encoder: Input(5) $\rightarrow$ Point-wise convolution (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Downsampling (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ Downsampling (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ Downsampling (512) $\rightarrow$ ConvNeXt (512) $\rightarrow$ ConvNeXt (512) $\rightarrow$ Layer normalisation (512) $\rightarrow$ rectified linear unit (relu, 512) $\rightarrow$ Point-wise convolution (32)
•

Decoder: Input(16) $\rightarrow$ Point-wise convolution (512) $\rightarrow$ ConvNeXt (512) $\rightarrow$ ConvNeXt (512) $\rightarrow$ Upsampling (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ Upsampling (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ rightarrow (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Layer normalisation (64) $\rightarrow$ relu (64) $\rightarrow$ Point-wise convolution (5).

Note, the encoder outputs the mean and standard deviation in latent space, while the decoder gets only a latent sample where the mean and standard deviation are combined. The relu activation before the last point-wise convolution helps to improve the representation of continuous-discrete sea-ice processes.

In total, the encoder has $2.2\times 10^{6}$ parameters and the decoder has $3.1\times 10^{6}$ parameters.

B.3 Diffusion architecture

The diffusion models modify the UViT architectures as introduced in Hoogeboom et al. (2023). The encoding and decoding part of the architecture are implemented with convolutional operations, similar to how we implemented the autoencoder. In the bottleneck, the architecture uses a vision transformer (Dosovitskiy et al., 2021).

In addition to the latent sample $\mathbf{z}_{\tau}$ , the diffusion models get the pseudo-time $\tau$ as input. This pseudo-time is handled as conditioning information and embedded with a sinusoidal embedding (Vaswani et al., 2017) to extract 512 Fourier features. The embedding is followed by a MLP which reduces the number of features to 256 and where we apply a Gelu for feature activation. These extracted features are fed into the adaptive layer normalisation layers (Perez et al., 2017) with conditioned affine parameters.

The main computing blocks of the encoding and decoding part are again ConvNeXt blocks (see also Appendix B.2), with an additional conditioning of the residual connections on the pseudo-time features. The downsampling in the encoding part remains the same as for the encoder. For both blocks, we simply replace layer normalisation by adaptive layer normalisation conditioned on the pseudo-time.

The upsampling is similar to the upsampling for the decoder (see also Appendix B.2). However, we have also skip connections, transferring information from before the downsampling to the upsampling at the same level. These features are concatenated with the interpolated features before the convolution of the upsampling is applied. Consequently, the convolution has $1.5\times$ more input channels than the convolution in the upsampling of the decoder. Additionally, we again replace layer normalisation by adaptive layer normalisation conditioned on the pseudo-time.

The transformer blocks (Vaswani et al., 2017) closely resemble the default implementation of ViT transformer blocks (Dosovitskiy et al., 2021). A self-attention block with 8 heads is followed by a MLP. Throughout the transformer block, the number of channels remain the same, even in the multi-layered perceptron. We apply adaptive layer normalisation at the beginning of each residual layer in the self-attention block and MLP (Xiong et al., 2020) and additionally conditioned the residual connection on the pseudo-time. We remove land grid points before applying the transformer and add them afterwards again.

The architectures for the diffusion model in data space and in latent space read then (the number in brackets represents the number of channels):

•

Data space: Input(5) $\rightarrow$ Point-wise convolution (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Downsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Downsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Downsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Downsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Downsampling (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ Downsampling (256) $\rightarrow$ Transformer (256) $\rightarrow$ Transformer (256) $\rightarrow$ Transformer (256) $\rightarrow$ Transformer (256) $\rightarrow$ Transformer (256) $\rightarrow$ Transformer (256) $\rightarrow$ Transformer (256) $\rightarrow$ Transformer (256) $\rightarrow$ Upsampling (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ Upsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Upsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Upsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Upsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Upsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Layer normalisation (64) $\rightarrow$ relu (64) $\rightarrow$ Point-wise convolution (5)
•

Latent space: Input(16) $\rightarrow$ Point-wise convolution (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Downsampling (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ Downsampling (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ Downsampling (512) $\rightarrow$ Transformer (512) $\rightarrow$ Transformer (512) $\rightarrow$ Transformer (512) $\rightarrow$ Transformer (512) $\rightarrow$ Transformer (512) $\rightarrow$ Transformer (512) $\rightarrow$ Transformer (512) $\rightarrow$ Transformer (512) $\rightarrow$ Upsampling (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ ConvNeXt (256) $\rightarrow$ Upsampling (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ ConvNeXt (128) $\rightarrow$ Upsampling (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ ConvNeXt (64) $\rightarrow$ Layer normalisation (64) $\rightarrow$ relu (64) $\rightarrow$ Point-wise convolution (5).

Note, caused by the high dimensionality of the data, the diffusion model in data space has a high memory consumption, and we kept $64$ channels for longer than in the latent diffusion model. The diffusion model has $15.3\times 10^{6}$ parameters and the latent diffusion model $13.4\times 10^{6}$ parameters.

Appendix C Experiments

In our experiments, we train autoencoders and diffusion models with the architectures as described in Appendix B. As we train different types of models, we designed our experiments with a unified setup to achieve a fair comparison.

We use sea-ice simulation data from the state-of-the-art sea-ice model neXtSIM (Rampal et al., 2016; Ólason et al., 2022). This sea-ice model uses an advanced brittle rheology (Girard et al., 2011; Dansereau et al., 2016; Ólason et al., 2022) which introduces a damage variable to represent subgrid-scale dynamics. This way, the model can represent sea-ice dynamics at the mesoscale resolution $\simeq 10\,\text{km}$ as they are observed with satellites and buoys (Rampal et al., 2019; Ólason et al., 2021; Bouchat et al., 2022). In the here used simulations (Boutin et al., 2023), neXtSIM has been coupled to the ocean component OPA of the NEMO modelling framework (Madec, 2008). NeXtSIM runs on a Lagrangian mesh with adaptive remeshing, while OPA uses a curvilinear mesh, here in the regional CREG025 configuration (Talandier & Lique, 2021). Both models are run at a similar resolution of $\frac{1}{4}^{\circ}\simeq 12\,\text{km}$ . The output of neXtSIM is conservatively interpolated to the curvilinear OPA mesh. NeXtSIM is driven by ERA5 atmospheric forcings (Hersbach et al., 2020). Since we use unconditional generation of fields, we neglect model output from OPA and the atmospheric forcing for the training of the neural networks.

We target to generate samples for five different variables, the sea-ice thickness, the sea-ice concentration, the two sea-ice velocities in $x$ - and $y$ -direction, and the sea-ice damage, a special variable introduced for brittle rheologies which represents the subgrid-scale fragmentation of sea ice. The model output for all variables is a six-hourly average, while the damage corresponds to a six-hourly instantaneous output. In correspondence with Boutin et al. (2023), we use data from 2000–2018, omitting the first five years (1995-1999) as spin-up. The first 14 years (2000-2014, 21916 samples) are used as training dataset, 2015 as validation dataset (1460 samples), and 2016–2018 for testing (4383 samples). All datasets are normalised by a per-variable global mean and standard deviation estimated over the training dataset.

Given these datasets, we optimise all models with Adam (Kingma & Ba, 2017) with a batch size of 16. The learning rate is linearly increased from $1\times 10^{-6}$ to $2\times 10^{-4}$ within $5\times 10^{3}$ iterations. Afterwards a cosine scheduler is used to decrease the learning rate again up to $1\times 10^{-6}$ after $10^{5}$ iterations, where the training is stopped. All models are trained without weight decay and other regularisation techniques like dropout, and we select the models that achieve the lowest validation loss. We train the models with mixed precision in bfloat16 and evaluate in single precision. All models are trained on either a NVIDIA RTX5000 GPU (24 GB) or a NVIDIA RTX6000 GPU (48 GB); to train on the RTX5000, we use a batch size of 8 and accumulate the gradient of two iterations. All models are implemented in PyTorch (Paszke et al., 2019) with PyTorch lightning (Falcon et al., 2020) and Hydra (Yadan, 2019). The code is available under https://github.com/cerea-daml/ldm_nextsim.

To train the latent diffusion models (LDMs), we use the mean prediction of the pre-trained encoder as deterministic mapping from pixel into latent space. Before training of the LDMs, we normalise the latent space by a per-channel global mean and standard deviation estimated over the training dataset. Although the LDM needs much less memory than the autoencoder and pixel diffusion model during training, we stick to a batch size of $16$ for a fair comparison. All diffusion models use the same fixed log-signal-to-ratio bounds $\gamma_{\mathrm{min}}=-15$ and $\gamma_{\mathrm{max}}=15$ during training and sampling. We perform the sampling with a second-order Heun sampler in 20 integration steps and the EDM sampling noise scheduler (Karras et al., 2022), extended to our bounds by truncation (Kingma & Gao, 2023).

Both, the reconstructions and the generated data are compared to the testing dataset. For the reconstructions, we perform a one-to-one comparison, while we evaluate the statistics of our generated data. In total, we hence generate $4383$ samples with the diffusion models. Although a smaller number than the $50000$ samples normally used in validating image diffusion models, it results into a fair comparison to the testing dataset. In Appendix E.2, we demonstrate that the dataset size is big enough to see differences in the models.

Appendix D Metrics

In the following we introduce the metrics to evaluate our neural networks. To evaluate the autoencoders, we apply point-wise measures, while we quantify the quality of the generated fields in a transformed space. The reconstructed samples $\widetilde{\mathbf{x}}\in\mathbb{R}^{N\times K\times L}$ , and generated samples $\widehat{\mathbf{x}}\in\mathbb{R}^{N\times K\times L}$ are compared to the samples in the testing dataset $\mathbf{x}\in\mathbb{R}^{N\times K\times L}$ for $N=4383$ samples, $K=5$ variables, and $L=127152$ grid points without land.

D.1 Root-mean-squared error

To estimate one averaged root-mean-squared error (RMSE), we normalise the RMSE by the per-variable standard deviation $\sigma_{k}$ , globally calculated for the $k$ -th variable over the full training dataset. The normalised RMSE is then calculated as

\mathrm{RMSE}(\mathbf{x},\widetilde{\mathbf{x}})=\sqrt{\frac{1}{N\cdot K\cdot L% }\sum_{n=1}^{N}\sum_{k=1}^{K}\sum_{l=1}^{L}\frac{(x_{n,k,l}-\widetilde{x}_{n,k% ,l})^{2}}{(\sigma_{k})^{2}}}.

(21)

The lowest (best) possible RMSE is $0$ and its maximum value is unbounded.

D.2 Structural similarity index measure

The structural similarity index measure ( $\mathrm{SSIM}$ ) takes information from a window into account and is hence a more fuzzy verification metric than the MAE. The SSIM for the $n$ -th sample, the $k$ -th variable, and the $l$ -th is given as

\mathrm{SSIM}(x_{n,k,l},y_{n,k,l})=\frac{\Big{(}2\mu^{(x)}_{n,k,l}\mu^{(y)}_{n% ,k,l}\Big{)}\Big{(}2\sigma^{(xy)}_{n,k,l}+c_{2,k}\Big{)}}{\Big{(}(\mu^{(x)}_{n% ,k,l})^{2}+(\mu^{(y)}_{n,k,l})^{2}+c_{1,k}\Big{)}\Big{(}(\sigma^{(x)}_{n,k,l})% ^{2}+(\sigma^{(y)}_{n,k,l})^{2}+c_{2,k}\Big{)}}.

(22)

The means $\mu^{(x)}_{n,k,l}$ and $\mu^{(y)}_{n,k,l}$ , variances $\sigma^{(x)}_{n,k,l}$ and $\sigma^{(y)}_{n,k,l}$ , and covariance $\sigma^{(xy)}_{n,k,l}$ between $x$ and $y$ , respectively are estimated in the given $7\times 7$ window for each sample, variable, and grid point. The stabilisation values $c_{1,k}=(0.01\cdot r_{k})^{2}$ and $c_{2}=(0.03\cdot r_{k})^{2}$ depend on the dynamic range for the $k$ variable with $r_{k}=max(\mathbf{x}_{k})-min(\mathbf{x}_{k})$ as the difference between the maximum and minimum in the training dataset.

The global score for the SSIM is the average of Eq. (22) across all samples, variables, and grid points

\mathrm{SSIM}(\mathbf{x},\widetilde{\mathbf{x}})=\frac{1}{N\cdot K\cdot L}\sum% _{n=1}^{N}\sum_{k=1}^{K}\sum_{l=1}^{L}\mathrm{SSIM}(x_{n,k,l},\widetilde{x}_{n% ,k,l}).

(23)

The highest possible SSIM is $1$ and the lowest $0$ .

D.3 Sea-ice extent accuracy

By applying diffusion models in a latent space, we can incorporate prior knowledge into the autoencoder which maps to and from the latent space. To determine the need of incorporating this knowledge into the autoencoder, we evaluate the accuracy of reconstructing where the sea ice is. While the accuracy might be unneeded for data generation, it can be a necessity for different tasks like surrogate modelling.

In neXtSIM, sea-ice processes are activated or deactivated based on the sea-ice thickness, and we estimate the sea-ice extent based on the sea-ice thickness. Choosing a small threshold value of $SIT_{\mathsf{min}}=0.01\,\text{m}$ , we classify values above this threshold as sea ice and below as no ice (similar results can be achieved by using the sea-ice concentration). Based on this classification, we estimate the average accuracy across all samples and grid points. The accuracy is the sum of the true positive and true negative divided by the total number of cases. This accuracy is the same as $1-\mathrm{IIEE}$ , where $\mathrm{IIEE}$ is the integrated ice edge error (Goessling et al., 2016) estimated based on the sea-ice thickness instead of the sea-ice concentration. The highest possible accuracy is $1$ and the lowest $0$ .

D.4 Fréchet distance in latent space

If we generate data, we have no one-to-one correspondence between samples from the testing dataset and generated samples. To evaluate the generated statistics, we employ a latent space spanned by an independently trained variational autoencoder. The variational autoencoder follows the general structure of our trained autoencoders with $\beta=1$ . As the evidence lower bound is maximised, we can expect that this variational autoencoder encodes spatial and semantic information into the latent space. Additionally, the KL-divergence $D_{KL}\Big{(}q(\mathbf{z}_{x}\mid\mathbf{x})\|\mathcal{N}\big{(}\mathbf{0},% \mathbf{I}\big{)}\Big{)}$ regularises the latent space towards an isotropic Gaussian distribution. Hence, we assume that the latent space is distributed with an isotropic Gaussian.

To emulate how image generators are validated with the Fréchet Inception distance (FID, Heusel et al. (2017)), we employ the Fréchet distance in latent space and call this Fréchet autoencoder distance ( $\mathrm{FAED}$ ). Specifically, we use the mean prediction of the encoding part $\mathrm{enc}(\mathbf{x})$ as deterministic mapping into latent space. The Fréchet autoencoder distance reads then

\mathrm{FAED}(\mathbf{x},\widetilde{\mathbf{x}})=\lVert\mu(\mathrm{enc}(% \mathbf{x}))-\mu(\mathrm{enc}(\widetilde{\mathbf{x}}))\rVert_{2}^{2}+\mathrm{% Tr}\Bigg{(}\Sigma(\mathrm{enc}(\mathbf{x}))+\Sigma(\mathrm{enc}(\widetilde{% \mathbf{x}}))-2\Big{(}\Sigma(\mathrm{enc}(\mathbf{x}))\cdot\Sigma(\mathrm{enc}% (\widetilde{\mathbf{x}}))\Big{)}^{\frac{1}{2}}\Bigg{)},

(24)

where $\mu(\cdot)$ is the point-wise sample mean, $\Sigma(\cdot)$ the sample covariance, $\mathrm{Tr}(\cdot)$ the trace of a matrix, and $(\cdot)^{\frac{1}{2}}$ the matrix square-root. The isotropic Gaussian assumption in latent space allows us to further simplify the estimated covariances matrix to variances, and the Fréchet autoencoder distance reduces to

\mathrm{FAED}(\mathbf{x},\widetilde{\mathbf{x}})=\lVert\mu(\mathrm{enc}(% \mathbf{x}))-\mu(\mathrm{enc}(\widetilde{\mathbf{x}}))\rVert_{2}^{2}+\lVert% \sigma(\mathrm{enc}(\mathbf{x}))-\sigma(\mathrm{enc}(\widetilde{\mathbf{x}}))% \rVert_{2}^{2},

(25)

with $\sigma(\cdot)$ as point-wise standard deviation. The minimum (best) value of the Fréchet autoencoder distance is $0$ while its maximum value is unbounded.

D.5 Root-mean-squared error of the sea-ice extent

To evaluate the climatological consistency of the generated samples for the sea-ice extent, we can estimate the probability $\widetilde{p}_{l}$ that the $l$ -th grid point is covered by sea ice. To classify for the $n$ -th sample if a grid point is covered, we again apply the threshold of $\mathrm{SIT}_{\mathrm{min}}=0.01\,\text{m}$ . The probability is given as

\widetilde{p}_{l}=\frac{1}{N}\sum^{N}_{n=1}\mathrm{I}(\widetilde{\mathrm{SIT}}% _{n,l}\geq\mathrm{SIT}_{\mathrm{min}})

(26)

with $\mathrm{I}(\cdot)$ as indicator function and $\widetilde{\mathrm{SIT}}_{n,l}$ the sea-ice thickness of the $n$ -th generated sample for the $l$ -th grid point.

The root-mean-squared error (RMSE) in the sea-ice extent reads then

\mathrm{RMSE}_{\mathrm{SIE}}(\mathbf{x},\widetilde{\mathbf{x}})=\sqrt{\frac{1}% {L}\sum^{L}_{l=1}(p_{l}-\widetilde{p}_{l})^{2}}

(27)

with $p_{l}$ as probability in the testing dataset. The minimum (best) RMSE is $0$ , while the worst possible RMSE is $1$ .

Appendix E Additional results

In the Appendix, we present additional results that signify the results we have found. We will show how the hyperparameters influence the autoencoder reconstruction, how censoring helps to improve the sea-ice extent representation, and how the diffusion model generalises across experiments.

E.1 Ablation for autoencoder

In our formulation for the pre-training of the autoencoder, the autoencoder has several hyperparameters. One of the them is the number of channels $N_{\text{latent}}$ in latent space. The more channels, the more information can be stored in the latent space. However, more channels can make the diffusion model more difficult to train. Another hyperparameter is the factor $\beta$ which determines the strength of the regularisation in latent space. The larger the factor, the smoother the latent space, but also the higher the semantic compression therein. Throughout the main manuscript, we have made the decision of $N_{\text{latent}}=16$ and $\beta=10^{-3}$ to strike the right balance. In Tab. 3, we show an ablation study on these factors. Different from the experiments in the main manuscript, these experiments were performed with NVIDIA A100 GPUs on the Jean-Zay supercomputing facilities, provided by GENCI.

Table 3: Evaluation of the autoencoders for reconstructing the testing dataset with the normalised root-mean-squared error (RMSE), the structural similarity index measure (SSIM), and the accuracy in the sea-ice extent (

\text{ACC}_{\text{SIE}}

N_{\text{latent}}

is the number of channels in the latent space and

\beta

is the regularisation factor. The two rows marked in bold are used in the main manuscript. Note, the numbers can be different than in Tab. 1 as these autoencoders are trained independently on another computing system.

$N_{\text{latent}}$	$\beta$		RMSE	SSIM	$\text{ACC}_{\text{SIE}}$
8	$10^{-3}$	–	0.094	0.916	0.982
16	$\mathbf{10^{-3}}$	–	0.076	0.937	0.981
32	$10^{-3}$	–	0.061	0.953	0.984
16	$10^{-2}$	–	0.077	0.935	0.971
16	$10^{-4}$	–	0.075	0.939	0.984
16	$\mathbf{10^{-3}}$	censored	0.078	0.942	0.992

With our chosen hyperparameters, the autoencoder results into a $16.4$ -fold compression, which is lower than the $20$ -fold compression from data dimensions only as the dimensionality of the masking is also reduced. Altering this compression rate by changing the number of channels in latent space has a larger impact on the RMSE and the SSIM than the regularisation factor.

In Fig. 3, we additionally show the energy spectrum for the sea-ice thickness if we increase the number of channels or lower the regularisation factor. Once again, changing the regularisation has almost no impact on the averaged spectrum, while a larger latent space can seemingly retain more small-scale features. Hence, if the regularisation is small enough, the compression rate determines how much small-scale features are retained through the latent space. To reduce the smoothing, we can reduce the compression rate. Since the smoothing is a result of double penalty effects caused by the point-wise comparison in the reconstruction loss, we however need other tools to completely remedy the smoothing. Furthermore, by having more channels in latent space, the latent diffusion model can become more difficult to optimise (Rombach et al., 2022).

E.2 Influence of the generated dataset size on the diffusion scores

To evaluate our diffusion models, we use $4383$ samples, while the recommendation to evaluate generative models for images is $50000$ samples. In Table 4, we compare the performance of the same latent diffusion model without censoring if we alter the random seed for the generation of the dataset.

Table 4: Evaluation of the statistics for samples generated by the latent diffusion model without censoring compared to the testing dataset with the same size.

Seed	$\mathrm{FAED}$ $\downarrow$	$\text{RMSE}_{\text{SIE}}$ $\downarrow$
0	1.65	11.56
10	1.66	11.79
11	1.68	11.39
12	1.66	11.76
13	1.64	11.36

While there are some small differences between different seeds, these differences are smaller than the differences shown in Tab. 2. Based on this analysis, we can conclude that the dataset size is large enough to evaluate differences between diffusion models with the proposed scores.

Diffusion models tend to have a high volatility in the scores during training (Song & Ermon, 2020). Normally used to stabilise the diffusion model, we restrain from applying exponential moving averages to estimate the scores of the diffusion model. On the one hand, this allows us to establish the performance of diffusion models without such tricks. On the other hand, the differences between the models for the $\mathrm{FAED}$ in Tab. 2 might be due to the high volatility.

E.3 Clipping of the generative diffusion model without censored distributions

To avoid unphysical samples during generation, the values from the denoiser are commonly clipped into either fixed (e.g., Ho et al., 2020) or dynamical bounds (Saharia et al., 2022b). However, as shown in Appendix A.1, generative diffusion is built around the assumption of unbounded Gaussian diffusion, and the clipping would violate the Gaussian assumption, which has been applied during training. This could introduce biases into the generation.

In Table 5, we show the results of such a clipped diffusion model in addition to the results of the diffusion models as presented in the main manuscript. For this clipped experiment, we use the pre-trained data-space diffusion model and clip the output of its denoiser into the physical bounds ( $\text{SIT}\in[0,\infty)$ , $\text{SIC}\in[0,1]$ , and $\text{SID}\in[0,1]$ ). Additionally, we show what happens if we clip the output of the latent diffusion model (LDM) trained without the censored distribution.

Table 5: Comparison of the diffusion models tested in Sec. 3 with diffusion model where the output of the denoiser is clipped into physical bounds or a latent diffusion model where the output of the autoencoder is clipped without a censored distribution.

Model	$\mathrm{FAED}$ $\downarrow$	$\text{RMSE}_{\text{SIE}}$ $\downarrow$
Validation	2.38	7.02
Diffusion	1.73	9.66
Diffusion (denoiser clipped)	1.76	12.82
LDM	1.65	11.56
LDM (censored)	1.79	7.32
LDM (output clipped)	1.81	11.04

Including a clipping operator into the denoiser that defines the diffusion model in data space has a negative impact on the physical representation of the sea ice, the RMSE in the sea-ice extent is increased compared to the unclipped diffusion model. While this technique is efficient in image generation, it introduces a bias into the denoiser, which seemingly hurts its performance in our sea-ice generator. Because of biasing the diffusion model, we suspect that clipping introduces a degradation in the cross-correlation between variables, which can have a higher impact on geophysical applications than on image generation.

Clipping the output of a LDM trained with a Gaussian distribution has a neutral impact on the physical representation. Since the model is trained without accounting for clipping during training, the model still produces spurious values for target values where the bounds are reached, only values exceeding the bounds are clipped. For the targets that lay on the bounds, the censored distribution, introduced in Appendix A.2, pushes the output of the LDM towards an extreme exceedance of the bounds, increasing the probability that the values is clipped into the bounds. Hence, censored distributions can unlock an effective incorporation of physical bounds into the training of deep learning algorithms.

This results suggests that we need special methods to properly treat physical bounds in diffusion models that work in data space. We can define a log-barrier that pushes the diffusion process away from the bounds (Fishman et al., 2023). Additionally, we can define a reflected diffusion process, where the Brownian motion is reflected at the bounds (Lou & Ermon, 2023). However, with both methods, we would never generate values that exactly on the bounds. If the result from the diffusion model is further used in downstream applications, e.g., by cycling in a surrogate model (Price et al., 2024; Finn et al., 2024), such small errors can amplify and make the pipeline unstable. Contrastingly, censoring of the autoencoder provides a clean and systematic way to incorporate bounds so that the generated data can lie exactly on the bounds.

E.4 Censoring

In Table 2, we show that censoring helps to improve the representation of the sea-ice extent and reduces the RMSE of the sea-ice extent. Here, we disentangle a bit how we obtain this improvement.

In Fig. 4, we show the binarized probability that a grid point is covered by sea ice in the generated dataset dependent on the probability in the testing dataset. The optimal probability would be on the one-to-one line, reaching for each grid point the same probability in the generated dataset.

In the validation dataset, the probabilities are slightly overestimated. The difference is likely caused by the different dataset size (one year in validation and three years in testing). As for the validation dataset, the diffusion model in data space and the latent diffusion model both overestimate the probabilities. Since this overestimation is larger than for the validation dataset, it is likely because of the failure to represent the sea-ice extent correctly. In contrast to the other diffusion models, the diffusion model with censoring tends to underestimate the probabilities. The RMSE in Tab. 2 is reduced as the underestimation for the censored model is smaller than the overestimation for the other diffusion models. Consequently, we have established here that censoring helps to alleviate the overestimation bias of the diffusion models. Its further use for other downstream tasks remains to be seen.

E.5 Uncurated data samples

In Fig. 5, we show uncurated samples from our diffusion models and reference samples from neXtSIM. As discussed in the main manuscript and shown in Fig. 1, the latent diffusion models result into smoothed fields compared to the diffusion model in data space and neXtSIM.

While the derived deformation field of the diffusion model (d) shows small-scale structures as they can be seen in neXtSIM (p), the thickness and concentration remain too noisy. One possible explanation could be that the velocities are easier to generate as they are continuous, whereas the sea-ice thickness can be rather represented by discrete-continuous behaviour. Since we used a rather low number of integration steps (20) without much tuning, we could expect that such fine-scale structure can be better represented with a better tuned sampling and/or with a diffusion model trained on more data.

Appendix F Changes to the Version 1

•

We added the comparison in speed to Table 2 indicating that latent diffusion models are indeed much faster than diffusion models in data space.
•

We strengthened the proof-of-concept character of the study, indicating that we see this work as a first step towards the training of generative diffusion for e.g., Arctic-wide surrogate models.
•

We added Appendix E.3 with additional results if we add clipping to the denoiser in data space. While in image generation, this clipping is very important, we find a negative impact on the physical representation in the diffusion model. We additionally provide insides why a censored distribution helps to improve the physical representation.
•

We added the GitHub link to the code.