Towards diffusion models for large-scale sea-ice modelling

Tobias Sebastian Finn    Charlotte Durand    Alban Farchi    Marc Bocquet    Julien Brajard
Abstract

We make the first steps towards diffusion models for unconditional generation of multivariate and Arctic-wide sea-ice states. While targeting to reduce the computational costs by diffusion in latent space, latent diffusion models also offer the possibility to integrate physical knowledge into the generation process. We tailor latent diffusion models to sea-ice physics with a censored Gaussian distribution in data space to generate data that follows the physical bounds of the modelled variables. Our latent diffusion models reach similar scores as the diffusion model trained in data space, but they smooth the generated fields as caused by the latent mapping. While enforcing physical bounds cannot reduce the smoothing, it improves the representation of the marginal ice zone. Therefore, for large-scale Earth system modelling, latent diffusion models can have many advantages compared to diffusion in data space if the significant barrier of smoothing can be resolved.

Generative diffusion, Earth system modelling, sea-ice simulations

1 Introduction

Generative diffusion (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) has revolutionised data generation in computer vision (Dhariwal & Nichol, 2021; Saharia et al., 2022a) and other domains (Bar-Tal et al., 2024; Huang et al., 2023). Initial applications to geophysical systems show promise for prediction (Leinonen et al., 2023; Li et al., 2023; Price et al., 2024; Finn et al., 2024) and downscaling (Mardani et al., 2023; Wan et al., 2023; Brajard et al., 2024). To cut costs, data can be mapped into a latent space using a pre-trained autoencoder, where diffusion models can be learned (Sinha et al., 2021; Vahdat et al., 2021; Rombach et al., 2022). However, the mapping can reduce the model’s ability to generate fine-grained data (Dai et al., 2023).

Here, we examine how well latent diffusion models (LDMs) can generate geophysical data compared to diffusion directly applied in data space. As a proof-of-concept, we unconditionally generate Arctic-wide sea-ice states as simulated with the state-of-the-art sea-ice model neXtSIM (Rampal et al., 2016; Ólason et al., 2022) at a 14superscript14\frac{1}{4}^{\circ}divide start_ARG 1 end_ARG start_ARG 4 end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT resolution (Boutin et al., 2023). Sea ice, with its discrete elements, anisotropy, and multifractality, poses a challenging problem for generative models and serves as ideal candidate for evaluating LDMs. We show that while the general structure remains the same, LDMs produces smoother solutions than diffusion models in data space, as presented in Fig. 1.

Refer to caption


Figure 1: Samples generated with a diffusion model in data space (a–d) and a latent diffusion model with censoring (e–h). The thickness (a, e) and concentration (b, f) are directly generated, while the speed (c, g) and deformation (d, h) are derived from the velocities. Note, for visualisation purpose, only the central Arctic is shown while the whole Arctic is modelled. The remaining noise might be caused by using only 20 integration steps with a Heun sampler.

Although generative diffusion is fundamentally defined by its Gaussian diffusion process, we can integrate prior physical information into the encoder and decoder of an LDM, as similarly done in Shmakov et al. (2023). Hence, we tailor the approach of LDMs to Earth system modelling by incorporating physical bounds into the autoencoder. To bound the variables, we replace the common Gaussian reconstruction loss (mean-squared error) by the log-likelihood of censored Gaussian distributions, which combines regression and classification, enabling us to explicitly account for the bounds of the output during reconstruction. Such censored distributions improve the representation of the sea-ice extent and thereby enhance the physical consistency of the output, as demonstrated in this study.

2 Methods

In the following, we introduce a brief overview on LDMs. Specifically, we focus on how to make use of prior physical knowledge into such models. For a more detailed treatment, we refer to Appendix A.

Our goal is to train neural networks (NNs) to generate state samples 𝐱^5×512×512^𝐱superscript5512512{\widehat{\mathbf{x}}\in\mathbb{R}^{5\times 512\times 512}}over^ start_ARG bold_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 5 × 512 × 512 end_POSTSUPERSCRIPT that should look like samples drawn from the here-used sea-ice simulation 𝐱𝒟similar-to𝐱𝒟\mathbf{x}\sim\mathcal{D}bold_x ∼ caligraphic_D; modelled are the thickness (SIT), concentration (SIC), damage (SID), and the velocities in x𝑥xitalic_x- (SIU) and y𝑦yitalic_y-direction (SIV). For the generation, we employ a diffusion model that works in a lower-dimensional latent space 𝐳x16×64×64subscript𝐳𝑥superscript166464{\mathbf{z}_{x}\in\mathbb{R}^{16\times 64\times 64}}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 16 × 64 × 64 end_POSTSUPERSCRIPT as spanned by an autoencoder. The encoder enc(𝐱)enc𝐱\mathrm{enc}(\mathbf{x})roman_enc ( bold_x ) maps from data space into latent space, while the decoder dec(𝐳x)decsubscript𝐳𝑥\mathrm{dec}(\mathbf{z}_{x})roman_dec ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) maps back into data space. The encoder and decoder are implemented as fully convolutional NNs (CNNs) with their weights and biases 𝜽𝜽\bm{\theta}bold_italic_θ and further described in Appendix B.2.

The autoencoder should reduce the data dimensionality without compressing the semantic (Rombach et al., 2022). We nevertheless want to achieve a continuous latent space, from which the decoder can map similar values to similar data points. Consequently, we apply a variational autoencoder (Kingma & Welling, 2013), where we downweight the Kullback-Leibler divergence by β=0.001𝛽0.001\beta=0.001italic_β = 0.001 (Higgins et al., 2016). Given a data sample 𝐱𝐱\mathbf{x}bold_x, the loss function for the autoencoder then reads,

(𝐱,𝜽)=𝐱𝜽absent\displaystyle\mathcal{L}(\mathbf{x},\bm{\theta})=caligraphic_L ( bold_x , bold_italic_θ ) = 𝔼q(𝐳x𝐱,𝜽)[log(p(𝐱𝐳x,𝜽))]subscript𝔼𝑞conditionalsubscript𝐳𝑥𝐱𝜽delimited-[]𝑝conditional𝐱subscript𝐳𝑥𝜽\displaystyle-\mathbb{E}_{q(\mathbf{z}_{x}\mid\mathbf{x},\bm{\theta})}\Big{[}% \log\big{(}p(\mathbf{x}\mid\mathbf{z}_{x},\bm{\theta})\big{)}\Big{]}- blackboard_E start_POSTSUBSCRIPT italic_q ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∣ bold_x , bold_italic_θ ) end_POSTSUBSCRIPT [ roman_log ( italic_p ( bold_x ∣ bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_italic_θ ) ) ] (1a)
+βDKL(q(𝐳x𝐱,𝜽)p(𝐳x)),𝛽subscript𝐷𝐾𝐿conditional𝑞conditionalsubscript𝐳𝑥𝐱𝜽𝑝subscript𝐳𝑥\displaystyle+\beta D_{KL}\Big{(}q(\mathbf{z}_{x}\mid\mathbf{x},\bm{\theta})\|% p(\mathbf{z}_{x})\Big{)},+ italic_β italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∣ bold_x , bold_italic_θ ) ∥ italic_p ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ) , (1b)

which is minimised by averaging the loss across a mini-batch of samples and using a variant of gradient descent.

The encoder maps a data sample into the mean and standard deviation of an assumed Gaussian distribution in latent space, q(𝐳x𝐱,𝜽)=𝒩(μ𝖾𝗇𝖼,𝜽(𝐱),(σ𝖾𝗇𝖼,𝜽(𝐱))2𝐈)𝑞conditionalsubscript𝐳𝑥𝐱𝜽𝒩subscript𝜇𝖾𝗇𝖼𝜽𝐱superscriptsubscript𝜎𝖾𝗇𝖼𝜽𝐱2𝐈q(\mathbf{z}_{x}\mid\mathbf{x},\bm{\theta})=\mathcal{N}\left(\mu_{\mathsf{enc}% ,\bm{\theta}}(\mathbf{x}),(\sigma_{\mathsf{enc},\bm{\theta}}(\mathbf{x}))^{2}% \mathbf{I}\right)italic_q ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∣ bold_x , bold_italic_θ ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT sansserif_enc , bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) , ( italic_σ start_POSTSUBSCRIPT sansserif_enc , bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ), while we assume a prior Gaussian distribution with mean 𝟎0\bm{0}bold_0 and a diagonal covariance 𝐈𝐈\mathbf{I}bold_I, p(𝐳x)=𝒩(𝟎,𝐈)𝑝subscript𝐳𝑥𝒩0𝐈p(\mathbf{z}_{x})=\mathcal{N}\left(\bm{0},\mathbf{I}\right)italic_p ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = caligraphic_N ( bold_0 , bold_I ).

The reconstruction loss in Eq. (1a) corresponds to the negative log-likelihood (NLL) of the data sample given an assumed distribution and includes the decoder, which maps from latent space to a Gaussian distribution, from where we can sample the decoder output 𝐲^^𝐲\widehat{\mathbf{y}}over^ start_ARG bold_y end_ARG,

𝐲^𝒩(dec𝜽(𝐳x),𝐬2𝐈).similar-to^𝐲𝒩𝑑𝑒subscript𝑐𝜽subscript𝐳𝑥superscript𝐬2𝐈\widehat{\mathbf{y}}\sim\mathcal{N}({dec}_{\bm{\theta}}(\mathbf{z}_{x}),% \mathbf{s}^{2}\mathbf{I}).over^ start_ARG bold_y end_ARG ∼ caligraphic_N ( italic_d italic_e italic_c start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) , bold_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) . (2)

The standard deviation 𝐬𝐬\mathbf{s}bold_s is spatially shared with one value per output variable and learned alongside the autoencoder (Cipolla et al., 2018; Rybkin et al., 2020; Finn et al., 2023). From such a sample 𝐲^^𝐲\widehat{\mathbf{y}}over^ start_ARG bold_y end_ARG, we can recover the reconstructed data sample 𝐱^^𝐱\widehat{\mathbf{x}}over^ start_ARG bold_x end_ARG by applying a deterministic (possibly non-invertible) transformation function 𝐱^=f(𝐲^)^𝐱𝑓^𝐲\widehat{\mathbf{x}}=f(\mathbf{\widehat{\mathbf{y}}})over^ start_ARG bold_x end_ARG = italic_f ( over^ start_ARG bold_y end_ARG ). Depending on the transformation function, we obtain different reconstruction losses:

  • If we apply an identity function f(𝐲^)=𝐲^𝑓^𝐲^𝐲f(\mathbf{\widehat{\mathbf{y}}})=\mathbf{\widehat{\mathbf{y}}}italic_f ( over^ start_ARG bold_y end_ARG ) = over^ start_ARG bold_y end_ARG, the reconstruction loss is the NLL of a Gaussian distribution and includes a mean-squared error weighted by 𝐬𝐬\mathbf{s}bold_s.

  • If we clip the output into physical bounds, e.g. f(𝐲^)=min(max(𝐲^,𝐱𝖫),𝐱𝖴)𝑓^𝐲minmax^𝐲subscript𝐱𝖫subscript𝐱𝖴f(\mathbf{\widehat{\mathbf{y}}})=\mathrm{min}(\mathrm{max}(\mathbf{\widehat{% \mathbf{y}}},\mathbf{x}_{\mathsf{L}}),\mathbf{x}_{\mathsf{U}})italic_f ( over^ start_ARG bold_y end_ARG ) = roman_min ( roman_max ( over^ start_ARG bold_y end_ARG , bold_x start_POSTSUBSCRIPT sansserif_L end_POSTSUBSCRIPT ) , bold_x start_POSTSUBSCRIPT sansserif_U end_POSTSUBSCRIPT ) by specifying a lower 𝐱𝖫subscript𝐱𝖫\mathbf{x}_{\mathsf{L}}bold_x start_POSTSUBSCRIPT sansserif_L end_POSTSUBSCRIPT and/or upper bound 𝐱𝖴subscript𝐱𝖴\mathbf{x}_{\mathsf{U}}bold_x start_POSTSUBSCRIPT sansserif_U end_POSTSUBSCRIPT, the reconstruction loss is the NLL of a censored Gaussian distribution. The censored distribution utilises the decoder output, as described in Eq. (2), to regress the reconstruction and to determine whether it will be clipped into the bounds. Hence, this NLL combines a regression with a classification task and can be seen as NLL of a type-I Tobit model (Tobin, 1958; Amemiya, 1984).

For a detailed treatment of the reconstruction loss, we refer to Appendix A.2.

After training the autoencoder with Eq. (1), we use the mean prediction of the encoder, μ𝖾𝗇𝖼,𝜽(𝐱)subscript𝜇𝖾𝗇𝖼𝜽𝐱\mu_{\mathsf{enc},\bm{\theta}}(\mathbf{x})italic_μ start_POSTSUBSCRIPT sansserif_enc , bold_italic_θ end_POSTSUBSCRIPT ( bold_x ), as deterministic mapping from data space into latent space and train a diffusion model in this space. The instantiated diffusion model (Sohl-Dickstein et al., 2015; Ho et al., 2020) can be seen as denoiser Dϕ(𝐳τ,τ)subscript𝐷bold-italic-ϕsubscript𝐳𝜏𝜏D_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)italic_D start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) with its weights and biases ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ (Karras et al., 2022), which takes a noised latent state 𝐳τsubscript𝐳𝜏\mathbf{z}_{\tau}bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT at a pseudo-time τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ] and should output the cleaned latent state 𝐳xsubscript𝐳𝑥\mathbf{z}_{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. During training, the noised latent state is produced by a variance-preserving diffusion process (Kingma et al., 2021; Song et al., 2021), under the assumption that the latent sample 𝐳xsubscript𝐳𝑥\mathbf{z}_{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT has been normalised to mean 00 and standard deviation 1111,

𝐳τsubscript𝐳𝜏\displaystyle\mathbf{z}_{\tau}bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT =ατ𝐳x+στϵ,absentsubscript𝛼𝜏subscript𝐳𝑥subscript𝜎𝜏bold-italic-ϵ\displaystyle=\alpha_{\tau}\mathbf{z}_{x}+\sigma_{\tau}\bm{\epsilon},= italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_ϵ , withϵ𝒩(𝟎,𝐈),similar-towithbold-italic-ϵ𝒩0𝐈\displaystyle\text{with}~{}\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I}),with bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) , (3)

where the signal ατsubscript𝛼𝜏\alpha_{\tau}italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and noise στsubscript𝜎𝜏\sigma_{\tau}italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT amplitudes are given in terms of logarithmic signal-to-noise ratio γ(τ)=log(ατ2στ2)𝛾𝜏superscriptsubscript𝛼𝜏2superscriptsubscript𝜎𝜏2\gamma(\tau)=\log\Big{(}\frac{\alpha_{\tau}^{2}}{\sigma_{\tau}^{2}}\Big{)}italic_γ ( italic_τ ) = roman_log ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ). The training noise scheduler specifies the dependency of the ratio on the pseudo-time and is adapted during the training process (Kingma & Gao, 2023) as further described in Appendix A.3.

Given a latent sample 𝐳xsubscript𝐳𝑥\mathbf{z}_{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and pseudo-time τ𝜏\tauitalic_τ, the loss function to train the diffusion model reads

(𝐳x,τ,ϕ)=w(γ)(dγdτ)eγ𝐳xDϕ(𝐳τ,τ)22,subscript𝐳𝑥𝜏bold-italic-ϕ𝑤𝛾𝑑𝛾𝑑𝜏superscript𝑒𝛾superscriptsubscriptdelimited-∥∥subscript𝐳𝑥subscript𝐷bold-italic-ϕsubscript𝐳𝜏𝜏22\displaystyle\mathcal{L}(\mathbf{z}_{x},\tau,\bm{\phi})=w(\gamma)\Big{(}-\frac% {d\gamma}{d\tau}\Big{)}e^{\gamma}\lVert\mathbf{z}_{x}-D_{\bm{\phi}}(\mathbf{z}% _{\tau},\tau)\rVert_{2}^{2},caligraphic_L ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_τ , bold_italic_ϕ ) = italic_w ( italic_γ ) ( - divide start_ARG italic_d italic_γ end_ARG start_ARG italic_d italic_τ end_ARG ) italic_e start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ∥ bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4)

with w(γ)=exp(γ2)𝑤𝛾𝛾2w(\gamma)=\exp{(-\frac{\gamma}{2})}italic_w ( italic_γ ) = roman_exp ( - divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ) as external weighting function (Kingma & Gao, 2023). The diffusion model is parameterised with a v𝑣vitalic_v-prediction (Salimans & Ho, 2022) and implemented as UViT architecture (Hoogeboom et al., 2023) combining a U-Net (Ronneberger et al., 2015) with a vision transformer (Dosovitskiy et al., 2021). The minimum value of the noise scheduler is set to γ𝗆𝗂𝗇=15subscript𝛾𝗆𝗂𝗇15\gamma_{\mathsf{min}}=-15italic_γ start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT = - 15 and the maximum value to γ𝗆𝖺𝗑=15subscript𝛾𝗆𝖺𝗑15\gamma_{\mathsf{max}}=15italic_γ start_POSTSUBSCRIPT sansserif_max end_POSTSUBSCRIPT = 15. We further describe the diffusion model in Appendix A.1 and its architecture in Appendix B.3.

To generate new data samples with the trained models, we draw a sample 𝐳^1𝒩(𝟎,𝐈)similar-tosubscript^𝐳1𝒩0𝐈\widehat{\mathbf{z}}_{1}\sim\mathcal{N}(\bm{0},\mathbf{I})over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ), and integrate an ordinary differential equation (Song et al., 2021) with our trained diffusion model Dϕ(𝐳^τ,τ)subscript𝐷bold-italic-ϕsubscript^𝐳𝜏𝜏D_{\bm{\phi}}(\widehat{\mathbf{z}}_{\tau},\tau)italic_D start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) in 20202020 integration steps. We use the second-order Heun integrator and sampling noise scheduler from Karras et al. (2022) to obtain a latent sample 𝐳^xsubscript^𝐳𝑥\widehat{\mathbf{z}}_{x}over^ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. We then apply the decoder to reconstruct a sample in data space 𝐱^^𝐱\widehat{\mathbf{x}}over^ start_ARG bold_x end_ARG. If the decoder is trained with censoring, we clip the SIT[0,)SIT0\text{SIT}\in[0,\infty)SIT ∈ [ 0 , ∞ ), SIC[0,1]SIC01\text{SIC}\in[0,1]SIC ∈ [ 0 , 1 ], and SID[0,1]SID01\text{SID}\in[0,1]SID ∈ [ 0 , 1 ] to their physical bounds.

3 Results

In our experiments, we evaluate how LDMs perform compared to generative diffusion trained in data space and how the censoring of the decoder improves the physical consistency of the generated samples. We train the models on simulations from the state-of-the-art Lagrangian sea-ice model neXtSIM (Rampal et al., 2016; Ólason et al., 2022) which has been coupled to the ocean component of the NEMO framework (Madec, 2008). The simulation data contains more than 20 simulation years (Boutin et al., 2023) with a six-hourly output on a 14superscript14\frac{1}{4}^{\circ}divide start_ARG 1 end_ARG start_ARG 4 end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT (12kmsimilar-to-or-equalsabsent12km\simeq 12\,\text{km}≃ 12 km in the Arctic) curvilinear grid. Omitting the first five years as spin-up, we train the models on data from 2000–2014, validate on 2015, and estimate the here-presented test scores on 2016–2018. All models are trained with ADAM (Kingma & Ba, 2017), a batch size of 16161616, and a learning rate of 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT iterations. For further experimental details, we refer to Appendix C.

Table 1: Evaluation of the autoencoders for reconstructing the testing dataset with the normalised root-mean-squared error (RMSE), the structural similarity index measure (SSIM), and the accuracy in the sea-ice extent (ACCSIEsubscriptACCSIE\text{ACC}_{\text{SIE}}ACC start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT). β𝛽\betaitalic_β is the regularisation factor.
β𝛽\betaitalic_β RMSE \downarrow SSIM \uparrow ACCSIEsubscriptACCSIE\text{ACC}_{\text{SIE}}ACC start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT \uparrow
1111 0.123 0.878 0.964
103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.076 0.939 0.986
103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (censored) 0.078 0.941 0.991

The autoencoders compress the input data from 5×512×51255125125\times 512\times 5125 × 512 × 512 to 16×64×6416646416\times 64\times 6416 × 64 × 64. We ablate design choices and evaluate the autoencoder performance of the deterministic mapping as used for diffusion in terms of reconstruction quality with the mean squared error (MSE), the structural similarity index measure (SSIM), and the accuracy in the sea-ice extent (ACCSIEsubscriptACCSIE\text{ACC}_{\text{SIE}}ACC start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT). The definition of the metrics and their calculations are explained in Appendix D. The results are shown in Table 1.

Since its regularisation is much lower, the VAE trained with β=103𝛽superscript103\beta=10^{-3}italic_β = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT reconstitutes into a better reconstruction than the VAE with β=1𝛽1\beta=1italic_β = 1, resulting into lower errors and higher similarities. Replacing the Gaussian log-likelihood in the reconstruction loss by a censored log-likelihood has a neutral impact on the RMSE and SSIM. However, the VAE with censoring can better reconstruct the marginal ice zone, which leads to a higher accuracy in the sea-ice extent. While higher accuracies therein might have only a small impact for generating new data, they might have a higher impact for other tasks where the output would be reused and were small errors can amplify, e.g., in surrogate modelling.

Refer to caption

Figure 2: Estimated spectral density of the sea-ice thickness in the central Arctic for the testing dataset (black), an autoencoder (β=103𝛽superscript103\beta=10^{-3}italic_β = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, no censoring, light blue), the LDM learned in the latent space of the autoencoder (dark blue), and the diffusion model directly learned in data space (red).

On top of the pre-trained autoencoder, we train LDMs as described in Section 2. We compare these LDMs to a diffusion model in data space, trained with Eq. (4) by setting 𝐳x=𝐱subscript𝐳𝑥𝐱\mathbf{z}_{x}=\mathbf{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = bold_x. As shown in Fig. 1, LDMs smooth the generated fields compared to diffusion in data space. In the following, we establish how and why this smoothing happens by showing estimated spectra for the sea-ice thickness in Fig. 2.

LDMs lose small-scale information compared to the testing dataset, while diffusion models in data space retain information across all scales. This loss of small-scale information results into a smoothing in the generated samples as shown in Fig. 1. Since we find the same for the autoencoder, this smoothing is a result of the pre-training as variational autoencoder and the data compression in latent space.

The two diffusion models additionally experience a shift towards larger energies than the test dataset. This shift might be caused by a discrepancy between training (2000–2014) and testing (2016–2018) or by the approximations during sampling.

Table 2: Evaluation of the statistics for samples generated by the diffusion models compared to the testing dataset with the same size. The Fréchet distance (FAEDFAED\mathrm{FAED}roman_FAED) estimates how well the structures align while the RMSE in the sea-ice extent (RMSESIEsubscriptRMSESIE\text{RMSE}_{\text{SIE}}RMSE start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT) shows how well ice edges are represented, both are multiplied for visualisation by 102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Diffusion corresponds to diffusion in data space, while LDM are the latent diffusion models. The validation dataset has only 1/3 of the testing dataset size.
Model FAEDFAED\mathrm{FAED}roman_FAED \downarrow RMSESIEsubscriptRMSESIE\text{RMSE}_{\text{SIE}}RMSE start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT \downarrow Speed (\units\unit𝑠\unit{s}italic_s) \downarrow
Validation 2.38 7.02
Diffusion 1.73 9.66 1.764
LDM 1.65 11.56 0.069
LDM (censored) 1.79 7.32 0.070

Inspired by the Fréchet inception distance (Heusel et al., 2017), we evaluate the diffusion models in comparison to the testing dataset with the Fréchet distance in the latent space of an independently trained VAE (FAEDFAED\mathrm{FAED}roman_FAED). We additionally compare the probability that a grid point is covered by sea ice with the root-mean-squared error RMSESIEsubscriptRMSESIE\text{RMSE}_{\text{SIE}}RMSE start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT, with results shown in Table 2. Whereas we evaluate on 4383438343834383 generated samples only, the results are robust to the dataset size as demonstrated in Appendix E.2. Additionally, we measure the speed (latency) of the diffusion models in seconds for a single generation.

The diffusion models consistently outperform the validation dataset for the FAEDFAED\mathrm{FAED}roman_FAED. Although the validation FAEDFAED\mathrm{FAED}roman_FAED is limited by its dataset size, this nevertheless indicates that diffusion models effectively capture large-scale structures and correlations, as observed in the testing dataset. The differences in the FAEDFAED\mathrm{FAED}roman_FAED between the diffusion models might be caused by the volatility during training.

Diffusion models are trained with a Gaussian diffusion process, and the data space model struggles to accurately represent the sea-ice extent, leading to a higher RMSE than for the validation dataset. The smoothing in the LDM exacerbates the inaccuracies in representing sea-ice extent, resulting in even higher RMSE values. However, as for the reconstruction, censoring improves the sea-ice extent representation, bringing the RMSE to the realm of the validation dataset. In addition to an improved representation, the latent diffusion models are 25×25\times25 × faster than the diffusion model in data space.

4 Conclusions

In this manuscript, we make the first step towards using latent diffusion models (LDMs) for generating multivariate, Arctic-wide sea-ice data. We compare LDMs with diffusion models in data space and find that LDMs lose small-scale information compared to the target fields, though their scores are similar to those of diffusion models in data space. This loss of information is due to the latent mapping process and its pre-training as variational autoencoder. To mitigate this, we could increase the number of channels in latent space (see Appendix E.1). Another approach would be to pre-train the autoencoder with an additional adversarial (Rombach et al., 2022) or spectral loss (Kochkov et al., 2023). These methods to try to mitigate the smoothing could complicate (e.g., Rombach et al., 2022) and notoriously destabilise (e.g., Karras et al., 2018) the training and tuning process.

Despite the loss of small-scale information, LDMs offer a significant advantage over diffusion models in data space: the ability to enforce physical bounds during data generation and their speed. By replacing the Gaussian data log-likelihood by a censored log-likelihood, we explicitly incorporate the clipping of the decoder output into the training process. The censoring has a neutral impact on the sample quality but leads to a more accurate representation of the sea-ice extent, indicating an increased physical consistency. This increased consistency could benefit other tasks as surrogate modelling or downscaling.

While we have tested LDMs only for sea-ice data, we anticipate similar smoothing issues for other Earth system components. Therefore, the smoothing issue remains a significant barrier to the broader application of the otherwise promising LDMs for large-scale Earth system modelling.

Acknowledgments

This manuscript is a contribution to the DeepGeneSIS project as financed by INSU/CNRS in the PNTS program and the SASIP project, supported under Grant no. 353 by Schmidt Science, LLC – a philanthropic initiative that seeks to improve societal outcomes through the development of emerging science and technologies. TSF, CD, AF, MB additionally acknowledge financial support by INSU/CNRS for the project GenD2M in the LEFE-MANU program. This work was granted access to the HPC resources of IDRIS under the allocations 2023-AD011013069R2 made by GENCI. We would like to thank Guillaume Boutin for providing access to the data and Laurent Bertino for helping to obtain the funding for the DeepGeneSIS project. With their reviews, four anonymous referees have improved the manuscript. CEREA is a member of the Institut Pierre-Simon Laplace (IPSL).

Impact Statement

We present here work whose goal is to advance the field of Earth system modelling and machine learning. While there are many potential societal consequences of our work, we will briefly discuss the impact on the energy consumption only.

Training neural networks consumes a lot of energy and in the future, the energy consumption of machine learning will rather increase than decrease, especially with the rise of generative models like diffusion models discussed here. Finding methods to reduce the computational costs of those models can be key to also reduce their energy consumption. While latent diffusion models might have the limitation of smoothing compared to diffusion models in data space, they offer an opportunity to decrease the costs and, thus, also the energy consumption. This might be especially helpful if we further strive towards larger and larger deep learning architectures, in the end possibly for coupled Earth system models.

References

  • Amemiya (1984) Amemiya, T. Tobit models: A survey. Journal of Econometrics, 24(1):3–61, January 1984. ISSN 0304-4076. doi: 10.1016/0304-4076(84)90074-5.
  • Anderson (1982) Anderson, B. D. O. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, May 1982. ISSN 0304-4149. doi: 10.1016/0304-4149(82)90051-5.
  • Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer Normalization. arXiv:1607.06450 [cs, stat], July 2016. arXiv: 1607.06450 tex.ids= ba2016a.
  • Bar-Tal et al. (2024) Bar-Tal, O., Chefer, H., Tov, O., Herrmann, C., Paiss, R., Zada, S., Ephrat, A., Hur, J., Liu, G., Raj, A., Li, Y., Rubinstein, M., Michaeli, T., Wang, O., Sun, D., Dekel, T., and Mosseri, I. Lumiere: A Space-Time Diffusion Model for Video Generation, February 2024. arXiv:2401.12945 [cs].
  • Bouchat et al. (2022) Bouchat, A., Hutter, N., Chanut, J., Dupont, F., Dukhovskoy, D., Garric, G., Lee, Y. J., Lemieux, J.-F., Lique, C., Losch, M., Maslowski, W., Myers, P. G., Ólason, E., Rampal, P., Rasmussen, T., Talandier, C., Tremblay, B., and Wang, Q. Sea Ice Rheology Experiment (SIREx): 1. Scaling and Statistical Properties of Sea-Ice Deformation Fields. Journal of Geophysical Research: Oceans, 127(4):e2021JC017667, 2022. ISSN 2169-9291. doi: 10.1029/2021JC017667. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1029/2021JC017667.
  • Boutin et al. (2023) Boutin, G., Ólason, E., Rampal, P., Regan, H., Lique, C., Talandier, C., Brodeau, L., and Ricker, R. Arctic sea ice mass balance in a new coupled ice–ocean model using a brittle rheology framework. The Cryosphere, 17(2):617–638, February 2023. ISSN 1994-0416. doi: 10.5194/tc-17-617-2023. Publisher: Copernicus GmbH.
  • Brajard et al. (2024) Brajard, J., Korosov, A., Davy, R., and Wang, Y. Super-resolution of satellite observations of sea ice thickness using diffusion models and physical modeling. Technical Report EGU24-3826, Copernicus Meetings, March 2024. Conference Name: EGU24.
  • Cipolla et al. (2018) Cipolla, R., Gal, Y., and Kendall, A. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7482–7491, Salt Lake City, UT, USA, June 2018. IEEE. ISBN 978-1-5386-6420-9. doi: 10.1109/CVPR.2018.00781.
  • Dai et al. (2023) Dai, X., Hou, J., Ma, C.-Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., Dubey, A., Yu, M., Kadian, A., Radenovic, F., Mahajan, D., Li, K., Zhao, Y., Petrovic, V., Singh, M. K., Motwani, S., Wen, Y., Song, Y., Sumbaly, R., Ramanathan, V., He, Z., Vajda, P., and Parikh, D. Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack, September 2023. arXiv:2309.15807 [cs].
  • Dansereau et al. (2016) Dansereau, V., Weiss, J., Saramito, P., and Lattes, P. A Maxwell elasto-brittle rheology for sea ice modelling. The Cryosphere, 10(3):1339–1359, July 2016. ISSN 1994-0416. doi: 10.5194/tc-10-1339-2016. publisher: Copernicus GmbH.
  • Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems, volume 34, pp.  8780–8794. Curran Associates, Inc., 2021.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. arXiv:2010.11929 [cs].
  • Durand et al. (2024) Durand, C., Finn, T. S., Farchi, A., Bocquet, M., Boutin, G., and Ólason, E. Data-driven surrogate modeling of high-resolution sea-ice thickness in the Arctic. The Cryosphere, 18(4):1791–1815, April 2024. ISSN 1994-0416. doi: 10.5194/tc-18-1791-2024. Publisher: Copernicus GmbH.
  • Falcon et al. (2020) Falcon, W., Borovec, J., Wälchli, A., Eggert, N., Schock, J., Jordan, J., Skafte, N., Ir1dXD, Bereznyuk, V., Harris, E., Murrell, T., Yu, P., Præsius, S., Addair, T., Zhong, J., Lipin, D., Uchida, S., Bapat, S., Schröter, H., Dayma, B., Karnachev, A., Kulkarni, A., Komatsu, S., Martin.B, SCHIRATTI, J.-B., Mary, H., Byrne, D., Eyzaguirre, C., cinjon, and Bakhtin, A. PyTorchLightning: 0.7.6 release, May 2020.
  • Finn et al. (2023) Finn, T. S., Durand, C., Farchi, A., Bocquet, M., Chen, Y., Carrassi, A., and Dansereau, V. Deep learning subgrid-scale parametrisations for short-term forecasting of sea-ice dynamics with a Maxwell elasto-brittle rheology. The Cryosphere, 17(7):2965–2991, July 2023. ISSN 1994-0416. doi: 10.5194/tc-17-2965-2023. Publisher: Copernicus GmbH.
  • Finn et al. (2024) Finn, T. S., Durand, C., Farchi, A., Bocquet, M., Rampal, P., and Carrassi, A. Generative diffusion for regional surrogate models from sea-ice simulations. April 2024. doi: 10.22541/au.171386536.64344222/v1.
  • Fishman et al. (2023) Fishman, N., Klarner, L., Bortoli, V. D., Mathieu, E., and Hutchinson, M. J. Diffusion models for constrained domains. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  • Girard et al. (2011) Girard, L., Bouillon, S., Weiss, J., Amitrano, D., Fichefet, T., and Legat, V. A new modeling framework for sea-ice mechanics based on elasto-brittle rheology. Annals of Glaciology, 52(57):123–132, January 2011. ISSN 0260-3055, 1727-5644. doi: 10.3189/172756411795931499.
  • Goessling et al. (2016) Goessling, H. F., Tietsche, S., Day, J. J., Hawkins, E., and Jung, T. Predictability of the Arctic sea ice edge. Geophysical Research Letters, 43(4):1642–1650, 2016. ISSN 1944-8007. doi: 10.1002/2015GL067232. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/2015GL067232.
  • Hendrycks & Gimpel (2016) Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus), 2016.
  • Hersbach et al. (2020) Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Simmons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., Chiara, G. D., Dahlgren, P., Dee, D., Diamantakis, M., Dragani, R., Flemming, J., Forbes, R., Fuentes, M., Geer, A., Haimberger, L., Healy, S., Hogan, R. J., Hólm, E., Janisková, M., Keeley, S., Laloyaux, P., Lopez, P., Lupu, C., Radnoti, G., Rosnay, P. d., Rozum, I., Vamborg, F., Villaume, S., and Thépaut, J.-N. The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730):1999–2049, 2020. ISSN 1477-870X. doi: https://doi.org/10.1002/qj.3803. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/qj.3803.
  • Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500 [cs, stat], June 2017. arXiv: 1706.08500.
  • Higgins et al. (2016) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. November 2016.
  • Ho et al. (2020) Ho, J., Jain, A., and Abbeel, P. Denoising Diffusion Probabilistic Models, December 2020. tex.ids= ho2020, ho2020a, ho2020b arXiv: 2006.11239 [cs, stat] number: arXiv:2006.11239.
  • Hoogeboom et al. (2023) Hoogeboom, E., Heek, J., and Salimans, T. simple diffusion: End-to-end diffusion for high resolution images, January 2023. arXiv:2301.11093 [cs, stat].
  • Huang et al. (2023) Huang, Q., Park, D. S., Wang, T., Denk, T. I., Ly, A., Chen, N., Zhang, Z., Zhang, Z., Yu, J., Frank, C., Engel, J., Le, Q. V., Chan, W., Chen, Z., and Han, W. Noise2Music: Text-conditioned Music Generation with Diffusion Models, March 2023. arXiv:2302.03917 [cs, eess].
  • Kadow et al. (2020) Kadow, C., Hall, D. M., and Ulbrich, U. Artificial intelligence reconstructs missing climate information. Nature Geoscience, 13(6):408–413, June 2020. ISSN 1752-0908. doi: 10.1038/s41561-020-0582-5. Number: 6 Publisher: Nature Publishing Group.
  • Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. February 2018.
  • Karras et al. (2022) Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models, October 2022. arXiv:2206.00364 [cs, stat].
  • Kingma et al. (2021) Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational Diffusion Models. In Advances in Neural Information Processing Systems, volume 34, pp.  21696–21707. Curran Associates, Inc., 2021. tex.ids= kingma_variational_2021-1.
  • Kingma & Ba (2017) Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], January 2017. arXiv: 1412.6980.
  • Kingma & Gao (2023) Kingma, D. P. and Gao, R. Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation, September 2023. arXiv:2303.00848 [cs, stat].
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat], December 2013. arXiv: 1312.6114.
  • Kochkov et al. (2023) Kochkov, D., Yuval, J., Langmore, I., Norgaard, P., Smith, J., Mooers, G., Lottes, J., Rasp, S., Düben, P., Klöwer, M., Hatfield, S., Battaglia, P., Sanchez-Gonzalez, A., Willson, M., Brenner, M. P., and Hoyer, S. Neural General Circulation Models, November 2023. arXiv:2311.07222 [physics].
  • Leinonen et al. (2023) Leinonen, J., Hamann, U., Nerini, D., Germann, U., and Franch, G. Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification, April 2023. arXiv:2304.12891 [physics].
  • Li et al. (2023) Li, L., Carver, R., Lopez-Gomez, I., Sha, F., and Anderson, J. SEEDS: Emulation of Weather Forecast Ensembles with Diffusion Models, October 2023. arXiv:2306.14066 [physics].
  • Liu et al. (2018) Liu, G., Reda, F. A., Shih, K. J., Wang, T.-C., Tao, A., and Catanzaro, B. Image Inpainting for Irregular Holes Using Partial Convolutions. pp.  85–100, 2018.
  • Liu et al. (2022) Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. A ConvNet for the 2020s, March 2022.
  • Lou & Ermon (2023) Lou, A. and Ermon, S. Reflected Diffusion Models. In Proceedings of the 40th International Conference on Machine Learning, pp.  22675–22701. PMLR, July 2023. ISSN: 2640-3498.
  • Madec (2008) Madec, G. NEMO ocean engine. Project report 1288-1619, Institut Pierre-Simon Laplace (IPSL), 2008. Series: 27.
  • Mardani et al. (2023) Mardani, M., Brenowitz, N., Cohen, Y., Pathak, J., Chen, C.-Y., Liu, C.-C., Vahdat, A., Kashinath, K., Kautz, J., and Pritchard, M. Generative Residual Diffusion Modeling for Km-scale Atmospheric Downscaling, September 2023. arXiv:2309.15214 [physics].
  • Odena et al. (2016) Odena, A., Dumoulin, V., and Olah, C. Deconvolution and Checkerboard Artifacts. Distill, 1(10):e3, October 2016. ISSN 2476-0757. doi: 10.23915/distill.00003.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp.  8024–8035. Curran Associates, Inc., 2019.
  • Perez et al. (2017) Perez, E., Strub, F., de Vries, H., Dumoulin, V., and Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer, December 2017. arXiv:1709.07871 [cs, stat].
  • Price et al. (2024) Price, I., Sanchez-Gonzalez, A., Alet, F., Andersson, T. R., El-Kadi, A., Masters, D., Ewalds, T., Stott, J., Mohamed, S., Battaglia, P., Lam, R., and Willson, M. GenCast: Diffusion-based ensemble forecasting for medium-range weather, May 2024. arXiv:2312.15796 [physics] version: 2.
  • Rampal et al. (2016) Rampal, P., Bouillon, S., Ólason, E., and Morlighem, M. neXtSIM: a new Lagrangian sea ice model. The Cryosphere, 10(3):1055–1073, May 2016. ISSN 1994-0416. doi: 10.5194/tc-10-1055-2016. publisher: Copernicus GmbH.
  • Rampal et al. (2019) Rampal, P., Dansereau, V., Olason, E., Bouillon, S., Williams, T., Korosov, A., and Samaké, A. On the multi-fractal scaling properties of sea ice deformation. The Cryosphere, 13(9):2457–2474, September 2019. ISSN 1994-0416. doi: 10.5194/tc-13-2457-2019. Publisher: Copernicus GmbH.
  • Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. pp.  10684–10695, 2022.
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., and Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597 [cs], May 2015. arXiv: 1505.04597.
  • Rybkin et al. (2020) Rybkin, O., Daniilidis, K., and Levine, S. Simple and Effective VAE Training with Calibrated Decoders. arXiv:2006.13202 [cs, eess, stat], June 2020. arXiv: 2006.13202.
  • Saharia et al. (2022a) Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Advances in Neural Information Processing Systems, 35:36479–36494, December 2022a.
  • Saharia et al. (2022b) Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image Super-Resolution Via Iterative Refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.  1–14, 2022b. ISSN 1939-3539. doi: 10.1109/TPAMI.2022.3204461. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Salimans & Ho (2022) Salimans, T. and Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models, June 2022. arXiv:2202.00512 [cs, stat].
  • Shmakov et al. (2023) Shmakov, A., Greif, K., Fenton, M., Ghosh, A., Baldi, P., and Whiteson, D. End-To-End Latent Variational Diffusion Models for Inverse Problems in High Energy Physics, May 2023. arXiv:2305.10399 [hep-ex].
  • Sinha et al. (2021) Sinha, A., Song, J., Meng, C., and Ermon, S. D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation. In Advances in Neural Information Processing Systems, volume 34, pp.  12533–12548. Curran Associates, Inc., 2021.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585 [cond-mat, q-bio, stat], November 2015. arXiv: 1503.03585.
  • Song & Ermon (2020) Song, Y. and Ermon, S. Improved Techniques for Training Score-Based Generative Models. In Advances in Neural Information Processing Systems, volume 33, pp.  12438–12448. Curran Associates, Inc., 2020.
  • Song et al. (2021) Song, Y., Durkan, C., Murray, I., and Ermon, S. Maximum Likelihood Training of Score-Based Diffusion Models, October 2021. arXiv:2101.09258 [cs, stat].
  • Talandier & Lique (2021) Talandier, C. and Lique, C. CREG025.L75-NEMO_r3.6.0, December 2021.
  • Tobin (1958) Tobin, J. Estimation of Relationships for Limited Dependent Variables. Econometrica, 26(1):24–36, 1958. ISSN 0012-9682. doi: 10.2307/1907382. Publisher: [Wiley, Econometric Society].
  • Vahdat et al. (2021) Vahdat, A., Kreis, K., and Kautz, J. Score-based Generative Modeling in Latent Space, December 2021. arXiv:2106.05931 [cs, stat].
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention Is All You Need. arXiv:1706.03762 [cs], December 2017. tex.ids: vaswani_attention_2017-1 arXiv: 1706.03762.
  • Wan et al. (2023) Wan, Z. Y., Baptista, R., Chen, Y.-f., Anderson, J., Boral, A., Sha, F., and Zepeda-Núñez, L. Debias Coarsely, Sample Conditionally: Statistical Downscaling through Optimal Transport and Probabilistic Diffusion Models, October 2023. arXiv:2305.15618 [physics].
  • Xiong et al. (2020) Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On Layer Normalization in the Transformer Architecture. In International Conference on Machine Learning, pp.  10524–10533. PMLR, November 2020. ISSN: 2640-3498.
  • Yadan (2019) Yadan, O. Hydra - A framework for elegantly configuring complex applications, 2019. tex.howpublished: Github.
  • Ólason et al. (2021) Ólason, E., Rampal, P., and Dansereau, V. On the statistical properties of sea-ice lead fraction and heat fluxes in the Arctic. The Cryosphere, 15(2):1053–1064, February 2021. ISSN 1994-0416. doi: 10.5194/tc-15-1053-2021. Publisher: Copernicus GmbH.
  • Ólason et al. (2022) Ólason, E., Boutin, G., Korosov, A., Rampal, P., Williams, T., Kimmritz, M., Dansereau, V., and Samaké, A. A New Brittle Rheology and Numerical Framework for Large-Scale Sea-Ice Models. Journal of Advances in Modeling Earth Systems, 14(8):e2021MS002685, 2022. ISSN 1942-2466. doi: 10.1029/2021MS002685. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1029/2021MS002685.

Appendix A Additional methods

In the main manuscript, we shortly explained the methods around our main message: we can train latent diffusion models for geophysics end-to-end and incorporate physical bounds into the latent diffusion model. In the following, we explain the methods more in detail. We elucidate on our variational diffusion formulation in Appendix A.1. We introduce censored Gaussian distributions for the data log-likelihood in Appendix A.2, and we discuss our noise scheduler in Appendix A.3.

A.1 Latent diffusion formulation

In our formulation of generative diffusion, we rely on variational diffusion models as introduced in Kingma et al. (2021). These models allow us to naturally make use of encoder and decoder structures to reduce the dimensionality of the system.

Given data drawn from a dataset 𝐱𝒟similar-to𝐱𝒟\mathbf{x}\sim\mathcal{D}bold_x ∼ caligraphic_D, the encoder produces the corresponding latent state 𝐳x=enc(𝐱)subscript𝐳𝑥enc𝐱\mathbf{z}_{x}=\mathrm{enc}(\mathbf{x})bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = roman_enc ( bold_x ). This latent state is noised by a variance-preserving diffusion process,

𝐳τsubscript𝐳𝜏\displaystyle\mathbf{z}_{\tau}bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT =ατ𝐳x+στϵ,absentsubscript𝛼𝜏subscript𝐳𝑥subscript𝜎𝜏bold-italic-ϵ\displaystyle=\alpha_{\tau}\mathbf{z}_{x}+\sigma_{\tau}\bm{\epsilon},= italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_ϵ , withϵ𝒩(𝟎,𝐈),similar-towithbold-italic-ϵ𝒩0𝐈\displaystyle\text{with}~{}\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I}),with bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) , (3)

where the pseudo-time τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ] determines the temporal position of the process. The signal amplitude ατsubscript𝛼𝜏\alpha_{\tau}italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT specifies how much signal is contained in the noised sample, and the noise amplitude στ=1ατ2subscript𝜎𝜏1superscriptsubscript𝛼𝜏2\sigma_{\tau}=\sqrt{1-\alpha_{\tau}^{2}}italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG how much signal has been replaced by noise. The dependency of the (noise) scheduling of ατsubscript𝛼𝜏\alpha_{\tau}italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and στsubscript𝜎𝜏\sigma_{\tau}italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT on the pseudo-time is formulated as logarithmic signal-to-noise ratio γ(τ)=log(ατ2στ2)𝛾𝜏superscriptsubscript𝛼𝜏2superscriptsubscript𝜎𝜏2\gamma(\tau)=\log(\frac{\alpha_{\tau}^{2}}{\sigma_{\tau}^{2}})italic_γ ( italic_τ ) = roman_log ( divide start_ARG italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) such that we can recover ατ2=11+exp(γ(τ))superscriptsubscript𝛼𝜏211𝛾𝜏\alpha_{\tau}^{2}=\frac{1}{1+\exp(-\gamma(\tau))}italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_γ ( italic_τ ) ) end_ARG and στ2=11+exp(γ(τ))superscriptsubscript𝜎𝜏211𝛾𝜏\sigma_{\tau}^{2}=\frac{1}{1+\exp(\gamma(\tau))}italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( italic_γ ( italic_τ ) ) end_ARG.

As in the variational autoencoder framework (Kingma & Welling, 2013), the data log-likelihood log(p(𝐱))𝑝𝐱\log(p(\mathbf{x}))roman_log ( italic_p ( bold_x ) ) is lower bounded by the evidence lower bound (ELBO). In equivalence, we can write an upper bound on the negative data log-likelihood (Kingma et al., 2021),

log(p(𝐱))𝔼𝐳0q(𝐳0𝐱)[logp(𝐱𝐳0)]+DKL(q(𝐳1𝐱)p(𝐳1))+𝔼𝐳xq(𝐳x𝐱),τ[0,1](ϕ,𝐳x,τ).𝑝𝐱subscript𝔼similar-tosubscript𝐳0𝑞conditionalsubscript𝐳0𝐱delimited-[]𝑝conditional𝐱subscript𝐳0subscript𝐷𝐾𝐿conditional𝑞conditionalsubscript𝐳1𝐱𝑝subscript𝐳1subscript𝔼formulae-sequencesimilar-tosubscript𝐳𝑥𝑞conditionalsubscript𝐳𝑥𝐱similar-to𝜏01italic-ϕsubscript𝐳𝑥𝜏-\log(p(\mathbf{x}))\leq-\mathbb{E}_{\mathbf{z}_{0}\sim q(\mathbf{z}_{0}\mid% \mathbf{x})}\Big{[}\log p(\mathbf{x}\mid\mathbf{z}_{0})\Big{]}+D_{KL}\bigg{(}q% (\mathbf{z}_{1}\mid\mathbf{x})\|p(\mathbf{z}_{1})\bigg{)}+\mathbb{E}_{\mathbf{% z}_{x}\sim q(\mathbf{z}_{x}\mid\mathbf{x}),\tau\sim[0,1]}\mathcal{L}(\phi,% \mathbf{z}_{x},\tau).- roman_log ( italic_p ( bold_x ) ) ≤ - blackboard_E start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x ∣ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ bold_x ) ∥ italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + blackboard_E start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∼ italic_q ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∣ bold_x ) , italic_τ ∼ [ 0 , 1 ] end_POSTSUBSCRIPT caligraphic_L ( italic_ϕ , bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_τ ) . (5)

The first term is the negative log-likelihood of the data given a sample 𝐳0subscript𝐳0\mathbf{z}_{0}bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT drawn in latent space, and also called reconstruction loss, measuring how well we can reconstruct the data from the latent space. The likelihood of the data p(𝐱𝐳0)𝑝conditional𝐱subscript𝐳0p(\mathbf{x}\mid\mathbf{z}_{0})italic_p ( bold_x ∣ bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) maps back from latent space into data space and, hence, contains the decoder. The posterior distribution q(𝐳0𝐱)𝑞conditionalsubscript𝐳0𝐱q(\mathbf{z}_{0}\mid\mathbf{x})italic_q ( bold_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_x ) specifies the first noised state given a data sample and includes the encoder, mapping from data space to latent space. We will further discuss different options for the data log-likelihood in Appendix A.2.

The second term is the Kullback-Leibler divergence between the distribution at τ=1𝜏1\tau=1italic_τ = 1, the end of the diffusion process, and a prior distribution for the diffusion process. Given the variance-preserving diffusion process from Eq. (3) and assuming a Gaussian prior distribution with mean 𝟎0\bm{0}bold_0 and covariance 𝐈𝐈\mathbf{I}bold_I, p(𝐳1)=𝒩(𝟎,𝐈)𝑝subscript𝐳1𝒩0𝐈p(\mathbf{z}_{1})=\mathcal{N}(\bm{0},\mathbf{I})italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_0 , bold_I ), we obtain for the second term,

DKL(q(𝐳1𝐱)p(𝐳1))=12n=1N𝗅𝖺𝗍𝖾𝗇𝗍[σ12+(α1zx,n)21log(σ12)],subscript𝐷𝐾𝐿conditional𝑞conditionalsubscript𝐳1𝐱𝑝subscript𝐳112subscriptsuperscriptsubscript𝑁𝗅𝖺𝗍𝖾𝗇𝗍𝑛1delimited-[]subscriptsuperscript𝜎21superscriptsubscript𝛼1subscript𝑧𝑥𝑛21subscriptsuperscript𝜎21D_{KL}\bigg{(}q(\mathbf{z}_{1}\mid\mathbf{x})\|p(\mathbf{z}_{1})\bigg{)}=\frac% {1}{2}\sum^{N_{\mathsf{latent}}}_{n=1}\Big{[}\sigma^{2}_{1}+(\alpha_{1}z_{x,n}% )^{2}-1-\log(\sigma^{2}_{1})\Big{]},italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∣ bold_x ) ∥ italic_p ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT sansserif_latent end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT [ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_x , italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 - roman_log ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] , (6)

over the N𝗅𝖺𝗍𝖾𝗇𝗍subscript𝑁𝗅𝖺𝗍𝖾𝗇𝗍N_{\mathsf{latent}}italic_N start_POSTSUBSCRIPT sansserif_latent end_POSTSUBSCRIPT dimensions of the latent space.

The last term is the diffusion loss, the only term that depends on the denoising NN Dϕ(𝐳τ,τ)subscript𝐷italic-ϕsubscript𝐳𝜏𝜏D_{\phi}(\mathbf{z}_{\tau},\tau)italic_D start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ). Given a sample in latent space 𝐳xsubscript𝐳𝑥\mathbf{z}_{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and a drawn pseudo-time τ𝜏\tauitalic_τ, we can calculate a noised sample 𝐳τsubscript𝐳𝜏\mathbf{z}_{\tau}bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and the loss reads

(ϕ,𝐳x,τ)=(dγdτ)eγ𝐳xDϕ(𝐳τ,τ)22.bold-italic-ϕsubscript𝐳𝑥𝜏𝑑𝛾𝑑𝜏superscript𝑒𝛾superscriptsubscriptdelimited-∥∥subscript𝐳𝑥subscript𝐷bold-italic-ϕsubscript𝐳𝜏𝜏22\mathcal{L}(\bm{\phi},\mathbf{z}_{x},\tau)=\Big{(}-\frac{d\gamma}{d\tau}\Big{)% }\cdot e^{\gamma}\lVert\mathbf{z}_{x}-D_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)% \rVert_{2}^{2}.caligraphic_L ( bold_italic_ϕ , bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_τ ) = ( - divide start_ARG italic_d italic_γ end_ARG start_ARG italic_d italic_τ end_ARG ) ⋅ italic_e start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ∥ bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (7)

Our loss function used to train the diffusion models, Eq. 4, corresponds to this loss term by setting w(γ)=1𝑤𝛾1w(\gamma)=1italic_w ( italic_γ ) = 1. As shown in Kingma & Gao (2023), using a monotonic weighting, like w(γ)=exp(γ2)𝑤𝛾𝛾2w(\gamma)=\exp(-\frac{\gamma}{2})italic_w ( italic_γ ) = roman_exp ( - divide start_ARG italic_γ end_ARG start_ARG 2 end_ARG ), corresponds to the ELBO with data augmentation. The only variable in Eq. (7) are the weights and biases of the diffusion model and the noise scheduler that specifies γ(τ)𝛾𝜏\gamma(\tau)italic_γ ( italic_τ ). Using the observation that the noise scheduling can be seen as importance sampling, we use an adaptive noise scheduler as explained in Appendix A.3.

In our experiments with diffusion models, we fix the end points of the noise scheduler to γ(0)=γmin=15𝛾0subscript𝛾𝑚𝑖𝑛15\gamma(0)=\gamma_{min}=-15italic_γ ( 0 ) = italic_γ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = - 15 and γ(1)=γmax=15𝛾1subscript𝛾𝑚𝑎𝑥15\gamma(1)=\gamma_{max}=15italic_γ ( 1 ) = italic_γ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 15, and use a fixed or pre-trained encoder and decoder. The first two terms of Eq. (5) are then constant with respect to the trained diffusion model and can be neglected in the optimisation. We obtain Eq. (7) as the only loss to optimise the diffusion model.

Instead of directly predicting the state 𝐳xsubscript𝐳𝑥\mathbf{z}_{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we predict with our diffusion model a surrogate target 𝐯τατϵστ𝐳xsubscript𝐯𝜏subscript𝛼𝜏bold-italic-ϵsubscript𝜎𝜏subscript𝐳𝑥\mathbf{v}_{\tau}\coloneqq\alpha_{\tau}\bm{\epsilon}-\sigma_{\tau}\mathbf{z}_{x}bold_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ≔ italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_ϵ - italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, as this tends to improve the convergence and the stability of the training (Salimans & Ho, 2022). We can recover the denoised state by setting Dϕ(𝐳τ,τ)=ατ𝐳xστvϕ(𝐳τ,τ)subscript𝐷bold-italic-ϕsubscript𝐳𝜏𝜏subscript𝛼𝜏subscript𝐳𝑥subscript𝜎𝜏subscript𝑣bold-italic-ϕsubscript𝐳𝜏𝜏D_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)=\alpha_{\tau}\mathbf{z}_{x}-\sigma_{\tau% }v_{\bm{\phi}}(\mathbf{z}_{\tau},\tau)italic_D start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) = italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ). Our loss function reads then

(ϕ,𝐳τ,τ)=w(γ)(dγdτ)(eγ+1)1𝐯τvϕ(𝐳τ,τ)22,bold-italic-ϕsubscript𝐳𝜏𝜏𝑤𝛾𝑑𝛾𝑑𝜏superscriptsuperscript𝑒𝛾11subscriptsuperscriptdelimited-∥∥subscript𝐯𝜏subscript𝑣bold-italic-ϕsubscript𝐳𝜏𝜏22\mathcal{L}(\bm{\phi},\mathbf{z}_{\tau},\tau)=w(\gamma)\cdot\Big{(}-\frac{d% \gamma}{d\tau}\Big{)}\cdot(e^{-\gamma}+1)^{-1}\lVert\mathbf{v}_{\tau}-v_{\bm{% \phi}}(\mathbf{z}_{\tau},\tau)\rVert^{2}_{2},caligraphic_L ( bold_italic_ϕ , bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) = italic_w ( italic_γ ) ⋅ ( - divide start_ARG italic_d italic_γ end_ARG start_ARG italic_d italic_τ end_ARG ) ⋅ ( italic_e start_POSTSUPERSCRIPT - italic_γ end_POSTSUPERSCRIPT + 1 ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ bold_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (8)

which corresponds to the ELBO with data augmentation and 𝐯τsubscript𝐯𝜏\mathbf{v}_{\tau}bold_v start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT-prediction.

To optimise the encoder and decoder together with the diffusion in an end-to-end training scheme (Vahdat et al., 2021; Shmakov et al., 2023), we can make use of Eq. (5). However, pre-training the encoder and decoder simplifies and stratifies the training process. To pre-train the encoder and decoder, we apply a variational autoencoder loss function,

(𝜽)=𝜽absent\displaystyle\mathcal{L}(\bm{\theta})=caligraphic_L ( bold_italic_θ ) = 𝔼q(𝐳x𝐱)[log(p(𝐱𝐳x))]subscript𝔼𝑞conditionalsubscript𝐳𝑥𝐱delimited-[]𝑝conditional𝐱subscript𝐳𝑥\displaystyle-\mathbb{E}_{q(\mathbf{z}_{x}\mid\mathbf{x})}\Big{[}\log\big{(}p(% \mathbf{x}\mid\mathbf{z}_{x})\big{)}\Big{]}- blackboard_E start_POSTSUBSCRIPT italic_q ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∣ bold_x ) end_POSTSUBSCRIPT [ roman_log ( italic_p ( bold_x ∣ bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ) ] (1a)
+βDKL(q(𝐳x𝐱)p(𝐳x)).𝛽subscript𝐷𝐾𝐿conditional𝑞conditionalsubscript𝐳𝑥𝐱𝑝subscript𝐳𝑥\displaystyle+\beta D_{KL}\Big{(}q(\mathbf{z}_{x}\mid\mathbf{x})\|p(\mathbf{z}% _{x})\Big{)}.+ italic_β italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∣ bold_x ) ∥ italic_p ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) ) . (1b)

Comparing Eq. (5) with this pre-training loss, we can see that the reconstruction loss appears in both formulations, while the loss terms in latent/noised space are combined into a single loss, and the Eq. (1b) should prepare the latent space for its use in diffusion models. The reconstruction loss is discussed more in detail in Appendix A.2.

We train two different version of diffusion models:

  • The encoder and decoder are pre-trained with Eq. (1) as 𝐳x=enc(𝐱)subscript𝐳𝑥enc𝐱\mathbf{z}_{x}=\mathrm{enc}(\mathbf{x})bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = roman_enc ( bold_x ) and 𝐱^=dec(𝐳x)^𝐱decsubscript𝐳𝑥\hat{\mathbf{x}}=\mathrm{dec}(\mathbf{z}_{x})over^ start_ARG bold_x end_ARG = roman_dec ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ), which results into a latent diffusion model.

  • The encoder and decoder are fixed to the identity functions 𝐳x=𝐱subscript𝐳𝑥𝐱\mathbf{z}_{x}=\mathbf{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = bold_x and 𝐱^=𝐳x^𝐱subscript𝐳𝑥\hat{\mathbf{x}}=\mathbf{z}_{x}over^ start_ARG bold_x end_ARG = bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, which leads to diffusion models in data space.

During their training, we draw a mini-batch of data samples 𝐱𝒟similar-to𝐱𝒟\mathbf{x}\sim\mathcal{D}bold_x ∼ caligraphic_D, converted into latent space 𝐳xsubscript𝐳𝑥\mathbf{z}_{x}bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, and pseudo-times τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ]. We reduce the variance of the sampling in the pseudo-times by using stratified sampling as proposed in Kingma et al. (2021). Afterwards, we estimate the diffusion loss, Eq. (4), for the current parameters ϕitalic-ϕ\phiitalic_ϕ of the diffusion model. This loss is then used in Adam (Kingma & Ba, 2017) to make a gradient descent step.

The diffusion process can be seen as stochastic differential equation (SDE) that is integrated in pseudo-time (Song et al., 2021). The reversion of this process is again a SDE (Anderson, 1982) and we can find an ordinary differential equation (ODE) that has the same marginal distribution as the reverse SDE (Song et al., 2021),

d𝐳~τ=[𝐟(𝐳~τ,τ)12g(τ)2𝐬ϕ(𝐳~τ,τ)]dτ.𝑑subscript~𝐳𝜏delimited-[]𝐟subscript~𝐳𝜏𝜏12𝑔superscript𝜏2subscript𝐬bold-italic-ϕsubscript~𝐳𝜏𝜏𝑑𝜏d\widetilde{\mathbf{z}}_{\tau}=\Big{[}\mathbf{f}(\widetilde{\mathbf{z}}_{\tau}% ,\tau)-\frac{1}{2}g(\tau)^{2}\mathbf{s}_{\bm{\phi}}(\widetilde{\mathbf{z}}_{% \tau},\tau)\Big{]}d\tau.italic_d over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = [ bold_f ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_g ( italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_s start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) ] italic_d italic_τ . (10)

For the variance-preserving diffusion process as defined in Eq. (3), we obtain for the drift and diffusion term

𝐟(𝐳~τ,τ)𝐟subscript~𝐳𝜏𝜏\displaystyle\mathbf{f}(\widetilde{\mathbf{z}}_{\tau},\tau)bold_f ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) =12(ddτlog(1+eγ(τ)))𝐳~τabsent12𝑑𝑑𝜏1superscript𝑒𝛾𝜏subscript~𝐳𝜏\displaystyle=-\frac{1}{2}\Big{(}\frac{d}{d\tau}\log(1+e^{-\gamma(\tau)})\Big{% )}\widetilde{\mathbf{z}}_{\tau}= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG italic_d end_ARG start_ARG italic_d italic_τ end_ARG roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_γ ( italic_τ ) end_POSTSUPERSCRIPT ) ) over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT (11)
g(τ)2𝑔superscript𝜏2\displaystyle g(\tau)^{2}italic_g ( italic_τ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =ddτlog(1+eγ(τ)).absent𝑑𝑑𝜏1superscript𝑒𝛾𝜏\displaystyle=\frac{d}{d\tau}\log(1+e^{-\gamma(\tau)}).= divide start_ARG italic_d end_ARG start_ARG italic_d italic_τ end_ARG roman_log ( 1 + italic_e start_POSTSUPERSCRIPT - italic_γ ( italic_τ ) end_POSTSUPERSCRIPT ) . (12)

Consequently, the only variable in the ODE, Eq. (10), is the score function 𝐬ϕ(𝐳~τ,τ)subscript𝐬bold-italic-ϕsubscript~𝐳𝜏𝜏\mathbf{s}_{\bm{\phi}}(\widetilde{\mathbf{z}}_{\tau},\tau)bold_s start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) which we approximate with our trained NN,

𝐬ϕ(𝐳~τ,τ)=(𝐳~τ+ατστvϕ(𝐳~τ,τ)).subscript𝐬bold-italic-ϕsubscript~𝐳𝜏𝜏subscript~𝐳𝜏subscript𝛼𝜏subscript𝜎𝜏subscript𝑣bold-italic-ϕsubscript~𝐳𝜏𝜏\mathbf{s}_{\bm{\phi}}(\widetilde{\mathbf{z}}_{\tau},\tau)=-\Big{(}\widetilde{% \mathbf{z}}_{\tau}+\frac{\alpha_{\tau}}{\sigma_{\tau}}v_{\bm{\phi}}(\widetilde% {\mathbf{z}}_{\tau},\tau)\Big{)}.bold_s start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) = - ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT + divide start_ARG italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG italic_v start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) ) . (13)

Given Eq. (11)–(13), we can integrate (10) from τ=1𝜏1\tau=1italic_τ = 1 to τ=0𝜏0\tau=0italic_τ = 0 to find a solution and generate new samples; the initial conditions for the ODE are given by 𝐳~1𝒩(𝟎,𝐈)similar-tosubscript~𝐳1𝒩0𝐈\widetilde{\mathbf{z}}_{1}\sim\mathcal{N}(\bm{0},\mathbf{I})over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ). For the integration of the ODE, we use a second-order Heun sampler (Karras et al., 2022) with 20 integration steps. The time steps for the integration are given by the inference noise scheduler as proposed by Karras et al. (2022). We extend the noise scheduler to a wider range of signal-to-noise ratio by setting γmin=15subscript𝛾𝑚𝑖𝑛15\gamma_{min}=-15italic_γ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = - 15 and γmax=15subscript𝛾𝑚𝑎𝑥15\gamma_{max}=15italic_γ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 15 with truncation (Kingma & Gao, 2023). Additionally, we adapt the sampling procedure: we apply the Heun procedure for all 20 integration steps, while we use a single Euler step to denoise the output from 𝐳~0subscript~𝐳0\widetilde{\mathbf{z}}_{0}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝐳~xsubscript~𝐳𝑥\widetilde{\mathbf{z}}_{x}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. The integration hence includes 41 iterations with the trained neural network. After obtaining 𝐳~xsubscript~𝐳𝑥\widetilde{\mathbf{z}}_{x}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we apply the decoder to map back into data space 𝐱~=dec(𝐳~x)~𝐱decsubscript~𝐳𝑥\widetilde{\mathbf{x}}=\mathrm{dec}(\widetilde{\mathbf{z}}_{x})over~ start_ARG bold_x end_ARG = roman_dec ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) which gives us then the generated sample.

A.2 Data log-likelihood with censored distributions

The autoencoder should map from latent space to data space by taking physical bounds into account, e.g. the non-negativity of the sea-ice thickness. To achieve such bounds, we can clip the decoder output and set values below (over) the lower (upper) bound explicitly to the bound; similar to what the rectified linear unit (relu) activation function does for the negative bound. Hence, all values below (above) the bound are projected to the bound. In this clipping case, the output of the decoder is no longer the physical quantity itself but another latent variable, which is converted into the physical quantity in a post-processing step. To differentiate between physical quantities and additional latent variable, we denote the decoder output as predicted latent variable 𝐲^5×512×512^𝐲superscript5512512\widehat{\mathbf{y}}\in\mathbb{R}^{5\times 512\times 512}over^ start_ARG bold_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 5 × 512 × 512 end_POSTSUPERSCRIPT in the following. We assume that this predicted latent variable is Gaussian distributed where the decoder predicts the mean 𝝁=dec(𝐳x)𝝁decsubscript𝐳𝑥\bm{\mu}=\mathrm{dec}(\mathbf{z}_{x})bold_italic_μ = roman_dec ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) and the standard deviation 𝒔𝒔\bm{s}bold_italic_s is spatially shared with one parameter per variable,

𝐲^𝒩(𝝁^,𝒔2𝐈).similar-to^𝐲𝒩^𝝁superscript𝒔2𝐈\widehat{\mathbf{y}}\sim\mathcal{N}\big{(}\widehat{\bm{\mu}},\bm{s}^{2}\mathbf% {I}\big{)}.over^ start_ARG bold_y end_ARG ∼ caligraphic_N ( over^ start_ARG bold_italic_μ end_ARG , bold_italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) . (14)

Caused by the diagonal covariance, we make out of the multivariate data log-likelihood a univariate estimation, which we can simply sum over the variables and grid points. Given this Gaussian assumption, the data log-likelihood would read for K𝐾Kitalic_K variables and L𝐿Litalic_L grid points

logp(𝐱𝐳x)=12k=1Kl=1L(xk,lμ^k,l)2(sk)2+log((sk)2)+log(2π).𝑝conditional𝐱subscript𝐳𝑥12subscriptsuperscript𝐾𝑘1subscriptsuperscript𝐿𝑙1superscriptsubscript𝑥𝑘𝑙subscript^𝜇𝑘𝑙2superscriptsubscript𝑠𝑘2superscriptsubscript𝑠𝑘22𝜋-\log p(\mathbf{x}\mid\mathbf{z}_{x})=\frac{1}{2}\sum^{K}_{k=1}\sum^{L}_{l=1}% \frac{(x_{k,l}-\widehat{\mu}_{k,l})^{2}}{(s_{k})^{2}}+\log\big{(}(s_{k})^{2}% \big{)}+\log(2\cdot\pi).- roman_log italic_p ( bold_x ∣ bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT divide start_ARG ( italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_log ( ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + roman_log ( 2 ⋅ italic_π ) . (15)

The log-likelihood is proportional to a mean-squared error (MSE) weighted by the scale parameter. Although the MSE is often employed to train an autoencoder, we make the assumption that the decoder output is the physical quantity itself and we neglect the clipping in the training of the autoencoder. Consequently, if we clip the output into physical bounds, we make a wrong assumption and introduce a bias into the decoder and, thus, the latent space. This bias depends on the number of cases that the lower (upper) bound is reached and is difficult to quantify. To explicitly bake the clipping into the cost function and treating the decoder output as latent variable, we censor the assumed distribution.

To optimise the autoencoder, we derive the data log-likelihood where variables are lower and upper bounded with 𝒙𝖫subscript𝒙𝖫\bm{x}_{\mathsf{L}}bold_italic_x start_POSTSUBSCRIPT sansserif_L end_POSTSUBSCRIPT as lower and 𝒙𝖴subscript𝒙𝖴\bm{x}_{\mathsf{U}}bold_italic_x start_POSTSUBSCRIPT sansserif_U end_POSTSUBSCRIPT as upper bound, e.g., x𝖫=0subscript𝑥𝖫0x_{\mathsf{L}}=0italic_x start_POSTSUBSCRIPT sansserif_L end_POSTSUBSCRIPT = 0 as lower and x𝖴=1subscript𝑥𝖴1x_{\mathsf{U}}=1italic_x start_POSTSUBSCRIPT sansserif_U end_POSTSUBSCRIPT = 1 as upper bound for the sea-ice concentration. For variables with only one bound, the case of the missing bound can be simply omitted.

The predicted latent variable for the k𝑘kitalic_k-th variable and the l𝑙litalic_l-th grid point is clipped into the physical bounds by

𝐱^k,l={x𝖫,k,if μ^k,lx𝖫,kx𝖴,k,if μ^k,lx𝖴,kμ^k,l,otherwise.subscript^𝐱𝑘𝑙casessubscript𝑥𝖫𝑘if subscript^𝜇𝑘𝑙subscript𝑥𝖫𝑘subscript𝑥𝖴𝑘if subscript^𝜇𝑘𝑙subscript𝑥𝖴𝑘subscript^𝜇𝑘𝑙otherwise\widehat{\mathbf{x}}_{k,l}=\begin{cases}x_{\mathsf{L},k},&\text{if }\widehat{% \mu}_{k,l}\leq x_{\mathsf{L},k}\\ x_{\mathsf{U},k},&\text{if }\widehat{\mu}_{k,l}\geq x_{\mathsf{U},k}\\ \widehat{\mu}_{k,l},&\text{otherwise}.\end{cases}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = { start_ROW start_CELL italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT , end_CELL start_CELL if over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ≤ italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT , end_CELL start_CELL if over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ≥ italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT , end_CELL start_CELL otherwise . end_CELL end_ROW (16)

This clipping operation is non-invertible such that we cannot recover the true latent variable yk,lsubscript𝑦𝑘𝑙y_{k,l}italic_y start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT from observing the physical quantity xk,lsubscript𝑥𝑘𝑙x_{k,l}italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT. However, we know that if xk,l=x𝖫,ksubscript𝑥𝑘𝑙subscript𝑥𝖫𝑘x_{k,l}=x_{\mathsf{L},k}italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT or xk,l=x𝖴,ksubscript𝑥𝑘𝑙subscript𝑥𝖴𝑘x_{k,l}=x_{\mathsf{U},k}italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT, the true latent variable was clipped. In such cases, rationally, the data log-likelihood hence should increase the probability that the decoder output is clipped.

In the following, we denote the cumulative distribution function (CDF) of the normal Gaussian distribution as ΦΦ\Phiroman_Φ and its probability density function (PDF) as φ𝜑\varphiitalic_φ. Given the assumed Gaussian distribution of the latent variables, Eq. (14), the physical quantity is presumably distributed by a censored Gaussian distribution with the following PDF for the k𝑘kitalic_k-th variable and l𝑙litalic_l-th grid point,

p(xk,l𝐳x)={Φ(x𝖫,kμ^k,lsk),if xk,l=x𝖫,k,Φ(μ^k,lx𝖴,ksk),if xk,l=x𝖴,k,1skφ(xk,lμ^k,lsk),if x𝖫,k<xk,l<x𝖴,k,0,otherwise.𝑝conditionalsubscript𝑥𝑘𝑙subscript𝐳𝑥casesΦsubscript𝑥𝖫𝑘subscript^𝜇𝑘𝑙subscript𝑠𝑘if subscript𝑥𝑘𝑙subscript𝑥𝖫𝑘Φsubscript^𝜇𝑘𝑙subscript𝑥𝖴𝑘subscript𝑠𝑘if subscript𝑥𝑘𝑙subscript𝑥𝖴𝑘1subscript𝑠𝑘𝜑subscript𝑥𝑘𝑙subscript^𝜇𝑘𝑙subscript𝑠𝑘if subscript𝑥𝖫𝑘subscript𝑥𝑘𝑙subscript𝑥𝖴𝑘0otherwisep(x_{k,l}\mid\mathbf{z}_{x})=\begin{cases}\Phi(\frac{x_{\mathsf{L},k}-\widehat% {\mu}_{k,l}}{s_{k}}),&\text{if }x_{k,l}=x_{\mathsf{L},k},\\ \Phi(\frac{\widehat{\mu}_{k,l}-x_{\mathsf{U},k}}{s_{k}}),&\text{if }x_{k,l}=x_% {\mathsf{U},k},\\ \frac{1}{s_{k}}\varphi(\frac{x_{k,l}-\widehat{\mu}_{k,l}}{s_{k}}),&\text{if }x% _{\mathsf{L},k}<x_{k,l}<x_{\mathsf{U},k},\\ 0,&\text{otherwise}.\end{cases}italic_p ( italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT ∣ bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_Φ ( divide start_ARG italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL roman_Φ ( divide start_ARG over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_φ ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) , end_CELL start_CELL if italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise . end_CELL end_ROW (17)

Different from a truncated Gaussian distribution, Eq. (17) gives values at the bounds a probability larger than zero, and all the density of the latent variable exceeding the bounds is folded to the bounds.

Using the PDF of the censored distribution from Eq. (17) for the data log-likelihood, we obtain

logp(𝐱𝐳x)=k=1Kl=1LI(xk,l=x𝖫,k)log(Φ(x𝖫,kμ^k,lsk))+I(xk,l=x𝖴,k)log(Φ(μ^k,lx𝖴,ksk))+I(x𝖫,k<xk,l<x𝖴,k)log(1skφ(xk,lμ^k,lsk))𝑝conditional𝐱subscript𝐳𝑥subscriptsuperscript𝐾𝑘1subscriptsuperscript𝐿𝑙1𝐼subscript𝑥𝑘𝑙subscript𝑥𝖫𝑘Φsubscript𝑥𝖫𝑘subscript^𝜇𝑘𝑙subscript𝑠𝑘𝐼subscript𝑥𝑘𝑙subscript𝑥𝖴𝑘Φsubscript^𝜇𝑘𝑙subscript𝑥𝖴𝑘subscript𝑠𝑘𝐼subscript𝑥𝖫𝑘subscript𝑥𝑘𝑙subscript𝑥𝖴𝑘1subscript𝑠𝑘𝜑subscript𝑥𝑘𝑙subscript^𝜇𝑘𝑙subscript𝑠𝑘\begin{split}-\log p(\mathbf{x}\mid\mathbf{z}_{x})=\sum^{K}_{k=1}\sum^{L}_{l=1% }\phantom{+}&I(x_{k,l}=x_{\mathsf{L},k})\,\log\bigg{(}\Phi\bigg{(}\frac{x_{% \mathsf{L},k}-\widehat{\mu}_{k,l}}{s_{k}}\bigg{)}\bigg{)}\\ +&I(x_{k,l}=x_{\mathsf{U},k})\,\log\bigg{(}\Phi\bigg{(}\frac{\widehat{\mu}_{k,% l}-x_{\mathsf{U},k}}{s_{k}}\bigg{)}\bigg{)}\\ +&I(x_{\mathsf{L},k}<x_{k,l}<x_{\mathsf{U},k})\,\log\bigg{(}\frac{1}{s_{k}}% \varphi\bigg{(}\frac{x_{k,l}-\widehat{\mu}_{k,l}}{s_{k}}\bigg{)}\bigg{)}\end{split}start_ROW start_CELL - roman_log italic_p ( bold_x ∣ bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ) = ∑ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_I ( italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT ) roman_log ( roman_Φ ( divide start_ARG italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) ) end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL italic_I ( italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT ) roman_log ( roman_Φ ( divide start_ARG over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) ) end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL italic_I ( italic_x start_POSTSUBSCRIPT sansserif_L , italic_k end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT < italic_x start_POSTSUBSCRIPT sansserif_U , italic_k end_POSTSUBSCRIPT ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_φ ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT - over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_k , italic_l end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) ) end_CELL end_ROW (18)

as cost function, where I()𝐼I(\cdot)italic_I ( ⋅ ) is the indicator function. The first two parts of the cost function are the bounded cases and correspond to a Gaussian classification model, using as probability the log-CDFs at the bounds. The last part is the log-likelihood of a Gaussian distribution as in Eq. (15) and includes the weighted MSE. The here derived log-likelihood thus combines a regression with a classification and is the negative log-likelihood of a type I Tobit model (Tobin, 1958). Whereas the optimisation of Eq. (18) seems more complicated than the optimisation of a log-likelihood from a Gaussian distribution, we have experienced no optimisation issues with variants of gradients descent.

Caused by the regression-classification mixture, the decoder output specifies not only the variable of interest but combined with the scale also the probability that the output will be clipped. This censored distribution approach works because we neglect correlations between variables and grid points and convert the multivariate into a univariate prediction problem. Nevertheless, we are free to chose any neural network architecture to predict the mean of the distribution in Eq. (14). Consequently, by e.g. using a fully convolutional decoder, we can still represent correlations in the decoder output.

A.3 Adaptive noise scheduler

The noise scheduler maps a given pseudo-time τ[0,1]𝜏01\tau\in[0,1]italic_τ ∈ [ 0 , 1 ] to a log-signal-to-noise ratio γ(τ)𝛾𝜏\gamma(\tau)italic_γ ( italic_τ ) and determines how much time is spent at a given ratio. In the manuscript, we use different schedulers during training and generation. While we generate the data with a fixed scheduler as proposed by Karras et al. (2022), we make use of an adaptive scheduler (Kingma & Gao, 2023) during training. In the following, we briefly introduce the principles of the adaptive scheduler, and we refer to Kingma & Gao (2023) for a longer treatment.

As shortly shown in Appendix A.1, the loss function of the diffusion model optimises the evidence lower bound (ELBO). Consequently, we can extend the variational parameters by the parameters of the training noise scheduler to further tighten the bound (Kingma et al., 2021) which gives us the optimal scheduler in the ELBO sense. Extending on this idea, the distribution given after transforming a drawn pseudo-time into γ𝛾\gammaitalic_γ can be seen as importance sampling distribution p(γ)𝑝𝛾p(\gamma)italic_p ( italic_γ ) (Kingma & Gao, 2023), and we can write

dγdτ=1p(γ).𝑑𝛾𝑑𝜏1𝑝𝛾-\frac{d\gamma}{d\tau}=\frac{1}{p(\gamma)}.- divide start_ARG italic_d italic_γ end_ARG start_ARG italic_d italic_τ end_ARG = divide start_ARG 1 end_ARG start_ARG italic_p ( italic_γ ) end_ARG . (19)

Interpreting the output of the noise scheduler as a drawn from an importance sampling distribution allows us to formulate the optimal noise scheduler which should fulfil

p(γ)𝔼𝐱𝒟,ϵ𝒩(𝟎,𝐈)[w(γ)(eγ+1)1𝐯γvϕ(𝐳γ,γ)22],proportional-to𝑝𝛾subscript𝔼formulae-sequencesimilar-to𝐱𝒟similar-toitalic-ϵ𝒩0𝐈delimited-[]𝑤𝛾superscriptsuperscript𝑒𝛾11subscriptsuperscriptdelimited-∥∥subscript𝐯𝛾subscript𝑣bold-italic-ϕsubscript𝐳𝛾𝛾22p(\gamma)\propto\mathbb{E}_{\mathbf{x}\sim\mathcal{D},\epsilon\sim\mathcal{N}(% \bm{0},\mathbf{I})}\Big{[}w(\gamma)\cdot(e^{-\gamma}+1)^{-1}\lVert\mathbf{v}_{% \gamma}-v_{\bm{\phi}}(\mathbf{z}_{\gamma},\gamma)\rVert^{2}_{2}\Big{]},italic_p ( italic_γ ) ∝ blackboard_E start_POSTSUBSCRIPT bold_x ∼ caligraphic_D , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) end_POSTSUBSCRIPT [ italic_w ( italic_γ ) ⋅ ( italic_e start_POSTSUPERSCRIPT - italic_γ end_POSTSUPERSCRIPT + 1 ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ bold_v start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , italic_γ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , (20)

where we made a change-of-variables from τ𝜏\tauitalic_τ to γ𝛾\gammaitalic_γ to signify the dependency on γ𝛾\gammaitalic_γ.

As Eq. (20) is infeasible, we approximate the optimal importance sampling distribution. Using 100 equal-distant bins between γmin=15subscript𝛾min15\gamma_{\mathrm{min}}=-15italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = - 15 and γmax=15subscript𝛾max15\gamma_{\mathrm{max}}=15italic_γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 15, we keep track of an exponential moving average of Eq. (20) for each bin. The density is constant within each bin, which allows us to estimate an empirical cumulative distribution function from γmaxsubscript𝛾max\gamma_{\mathrm{max}}italic_γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT to γminsubscript𝛾min\gamma_{\mathrm{min}}italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. The training noise scheduler is then given as empirical quantile function mapping from [0,1]01[0,1][ 0 , 1 ] to [γmax,γmin]subscript𝛾maxsubscript𝛾min[\gamma_{\mathrm{max}},\gamma_{\mathrm{min}}][ italic_γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ]. In practice, this adaptive noise scheduler reduces the number of needed hyperparameters for the diffusion model and improves its convergence (Kingma & Gao, 2023). It was recently similarly used in Finn et al. (2024).

Appendix B Architectures

In the following, we will describe our neural network architectures. The autoencoder is purely based on a convolutional architecture, while the diffusion models are based on a mixture between a convolutional U-Net (Ronneberger et al., 2015) and a vision transformer (ViT, Dosovitskiy et al. (2021)), termed UVit (Hoogeboom et al., 2023).

B.1 Masked convolutions

As visible in Fig. 1, sea ice only exists on grid points covered by ocean while grid points with land are omitted. To take the masking of land pixels into account, we apply convolutional operations as inspired by partial convolutions (Liu et al., 2018; Kadow et al., 2020; Durand et al., 2024).

Before each convolutional operation, the data is multiplied by the mask 𝐦[0,1]𝐦01\mathbf{m}\in[0,1]bold_m ∈ [ 0 , 1 ], where valid grid points are 1111. The masking sets all grid points with land to zero such that they act like zero padding and interactions with land grid points are deactivated. Consequently, grid points near land exchange less information with the surrounding grid points which in fact imitates how land masking is implemented in numerical models.

B.2 Autoencoder

The autoencoder maps from 𝐱5×512×512𝐱superscript5512512\mathbf{x}\in\mathbb{R}^{5\times 512\times 512}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 5 × 512 × 512 end_POSTSUPERSCRIPT to 𝐳z16×64×64subscript𝐳𝑧superscript166464\mathbf{z}_{z}\in\mathbb{R}^{16\times 64\times 64}bold_z start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 16 × 64 × 64 end_POSTSUPERSCRIPT and vice-versa. Its encoder includes a series of downsampling operations with ConvNeXt blocks, while the decoder combines upsampling operations with ConvNeXt blocks.

The main computing block are ConvNeXt blocks (Liu et al., 2022): a wide masked convolution with a 7×7777\times 77 × 7 kernel extracts channel-wise spatial information. This spatial information is normalised with layer normalisation (Ba et al., 2016). Afterwards, a multi-layered perceptron (MLP) with a Gaussian error linear unit (Gelu, Hendrycks & Gimpel (2016)) activation mixes up the normalised information, before the output gets added again to the input of the block in a residual connection. Throughout the ConvNeXt block, the number of channels remain the same.

The downsampling in the encoder combines layer normalisation with a masked convolution and a 2×2222\times 22 × 2 kernel and a stride of 2222, which also doubles the number of channels. To downsample the mask, we apply max pooling, i.e. if there is an ocean grid point in a 2×2222\times 22 × 2 window, the output grid point is set to 1111. This increases the number of ocean grid points compared to a strided downsampling.

In the upsampling of the decoder, a layer normalisation is followed by a nearest neighbour interpolation, which doubles the number of grid points in y𝑦yitalic_y- and x𝑥xitalic_x-direction. Afterwards, a masked convolution with a 3×3333\times 33 × 3 kernel smooths the interpolated fields and halves the number of channels. This combination of interpolation with convolution results into less checkerboard effects compared to a transposed convolution (Odena et al., 2016).

The architectures for the encoder and decoder read (the number in brackets represents the number of channels):

  • Encoder: Input(5) \rightarrow Point-wise convolution (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Downsampling (128) \rightarrow ConvNeXt (128) \rightarrow ConvNeXt (128) \rightarrow Downsampling (256) \rightarrow ConvNeXt (256) \rightarrow ConvNeXt (256) \rightarrow Downsampling (512) \rightarrow ConvNeXt (512) \rightarrow ConvNeXt (512) \rightarrow Layer normalisation (512) \rightarrow rectified linear unit (relu, 512) \rightarrow Point-wise convolution (32)

  • Decoder: Input(16) \rightarrow Point-wise convolution (512) \rightarrow ConvNeXt (512) \rightarrow ConvNeXt (512) \rightarrow Upsampling (256) \rightarrow ConvNeXt (256) \rightarrow ConvNeXt (256) \rightarrow Upsampling (128) \rightarrow ConvNeXt (128) \rightarrow ConvNeXt (128) \rightarrow rightarrow (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Layer normalisation (64) \rightarrow relu (64) \rightarrow Point-wise convolution (5).

Note, the encoder outputs the mean and standard deviation in latent space, while the decoder gets only a latent sample where the mean and standard deviation are combined. The relu activation before the last point-wise convolution helps to improve the representation of continuous-discrete sea-ice processes.

In total, the encoder has 2.2×1062.2superscript1062.2\times 10^{6}2.2 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT parameters and the decoder has 3.1×1063.1superscript1063.1\times 10^{6}3.1 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT parameters.

B.3 Diffusion architecture

The diffusion models modify the UViT architectures as introduced in Hoogeboom et al. (2023). The encoding and decoding part of the architecture are implemented with convolutional operations, similar to how we implemented the autoencoder. In the bottleneck, the architecture uses a vision transformer (Dosovitskiy et al., 2021).

In addition to the latent sample 𝐳τsubscript𝐳𝜏\mathbf{z}_{\tau}bold_z start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT, the diffusion models get the pseudo-time τ𝜏\tauitalic_τ as input. This pseudo-time is handled as conditioning information and embedded with a sinusoidal embedding (Vaswani et al., 2017) to extract 512 Fourier features. The embedding is followed by a MLP which reduces the number of features to 256 and where we apply a Gelu for feature activation. These extracted features are fed into the adaptive layer normalisation layers (Perez et al., 2017) with conditioned affine parameters.

The main computing blocks of the encoding and decoding part are again ConvNeXt blocks (see also Appendix B.2), with an additional conditioning of the residual connections on the pseudo-time features. The downsampling in the encoding part remains the same as for the encoder. For both blocks, we simply replace layer normalisation by adaptive layer normalisation conditioned on the pseudo-time.

The upsampling is similar to the upsampling for the decoder (see also Appendix B.2). However, we have also skip connections, transferring information from before the downsampling to the upsampling at the same level. These features are concatenated with the interpolated features before the convolution of the upsampling is applied. Consequently, the convolution has 1.5×1.5\times1.5 × more input channels than the convolution in the upsampling of the decoder. Additionally, we again replace layer normalisation by adaptive layer normalisation conditioned on the pseudo-time.

The transformer blocks (Vaswani et al., 2017) closely resemble the default implementation of ViT transformer blocks (Dosovitskiy et al., 2021). A self-attention block with 8 heads is followed by a MLP. Throughout the transformer block, the number of channels remain the same, even in the multi-layered perceptron. We apply adaptive layer normalisation at the beginning of each residual layer in the self-attention block and MLP (Xiong et al., 2020) and additionally conditioned the residual connection on the pseudo-time. We remove land grid points before applying the transformer and add them afterwards again.

The architectures for the diffusion model in data space and in latent space read then (the number in brackets represents the number of channels):

  • Data space: Input(5) \rightarrow Point-wise convolution (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Downsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Downsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Downsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Downsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Downsampling (128) \rightarrow ConvNeXt (128) \rightarrow ConvNeXt (128) \rightarrow Downsampling (256) \rightarrow Transformer (256) \rightarrow Transformer (256) \rightarrow Transformer (256) \rightarrow Transformer (256) \rightarrow Transformer (256) \rightarrow Transformer (256) \rightarrow Transformer (256) \rightarrow Transformer (256) \rightarrow Upsampling (128) \rightarrow ConvNeXt (128) \rightarrow ConvNeXt (128) \rightarrow Upsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Upsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Upsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Upsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Upsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Layer normalisation (64) \rightarrow relu (64) \rightarrow Point-wise convolution (5)

  • Latent space: Input(16) \rightarrow Point-wise convolution (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Downsampling (128) \rightarrow ConvNeXt (128) \rightarrow ConvNeXt (128) \rightarrow Downsampling (256) \rightarrow ConvNeXt (256) \rightarrow ConvNeXt (256) \rightarrow Downsampling (512) \rightarrow Transformer (512) \rightarrow Transformer (512) \rightarrow Transformer (512) \rightarrow Transformer (512) \rightarrow Transformer (512) \rightarrow Transformer (512) \rightarrow Transformer (512) \rightarrow Transformer (512) \rightarrow Upsampling (256) \rightarrow ConvNeXt (256) \rightarrow ConvNeXt (256) \rightarrow Upsampling (128) \rightarrow ConvNeXt (128) \rightarrow ConvNeXt (128) \rightarrow Upsampling (64) \rightarrow ConvNeXt (64) \rightarrow ConvNeXt (64) \rightarrow Layer normalisation (64) \rightarrow relu (64) \rightarrow Point-wise convolution (5).

Note, caused by the high dimensionality of the data, the diffusion model in data space has a high memory consumption, and we kept 64646464 channels for longer than in the latent diffusion model. The diffusion model has 15.3×10615.3superscript10615.3\times 10^{6}15.3 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT parameters and the latent diffusion model 13.4×10613.4superscript10613.4\times 10^{6}13.4 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT parameters.

Appendix C Experiments

In our experiments, we train autoencoders and diffusion models with the architectures as described in Appendix B. As we train different types of models, we designed our experiments with a unified setup to achieve a fair comparison.

We use sea-ice simulation data from the state-of-the-art sea-ice model neXtSIM (Rampal et al., 2016; Ólason et al., 2022). This sea-ice model uses an advanced brittle rheology (Girard et al., 2011; Dansereau et al., 2016; Ólason et al., 2022) which introduces a damage variable to represent subgrid-scale dynamics. This way, the model can represent sea-ice dynamics at the mesoscale resolution 10kmsimilar-to-or-equalsabsent10km\simeq 10\,\text{km}≃ 10 km as they are observed with satellites and buoys (Rampal et al., 2019; Ólason et al., 2021; Bouchat et al., 2022). In the here used simulations (Boutin et al., 2023), neXtSIM has been coupled to the ocean component OPA of the NEMO modelling framework (Madec, 2008). NeXtSIM runs on a Lagrangian mesh with adaptive remeshing, while OPA uses a curvilinear mesh, here in the regional CREG025 configuration (Talandier & Lique, 2021). Both models are run at a similar resolution of 1412kmsimilar-to-or-equalssuperscript1412km\frac{1}{4}^{\circ}\simeq 12\,\text{km}divide start_ARG 1 end_ARG start_ARG 4 end_ARG start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ≃ 12 km. The output of neXtSIM is conservatively interpolated to the curvilinear OPA mesh. NeXtSIM is driven by ERA5 atmospheric forcings (Hersbach et al., 2020). Since we use unconditional generation of fields, we neglect model output from OPA and the atmospheric forcing for the training of the neural networks.

We target to generate samples for five different variables, the sea-ice thickness, the sea-ice concentration, the two sea-ice velocities in x𝑥xitalic_x- and y𝑦yitalic_y-direction, and the sea-ice damage, a special variable introduced for brittle rheologies which represents the subgrid-scale fragmentation of sea ice. The model output for all variables is a six-hourly average, while the damage corresponds to a six-hourly instantaneous output. In correspondence with Boutin et al. (2023), we use data from 2000–2018, omitting the first five years (1995-1999) as spin-up. The first 14 years (2000-2014, 21916 samples) are used as training dataset, 2015 as validation dataset (1460 samples), and 2016–2018 for testing (4383 samples). All datasets are normalised by a per-variable global mean and standard deviation estimated over the training dataset.

Given these datasets, we optimise all models with Adam (Kingma & Ba, 2017) with a batch size of 16. The learning rate is linearly increased from 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT to 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT within 5×1035superscript1035\times 10^{3}5 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT iterations. Afterwards a cosine scheduler is used to decrease the learning rate again up to 1×1061superscript1061\times 10^{-6}1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT after 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT iterations, where the training is stopped. All models are trained without weight decay and other regularisation techniques like dropout, and we select the models that achieve the lowest validation loss. We train the models with mixed precision in bfloat16 and evaluate in single precision. All models are trained on either a NVIDIA RTX5000 GPU (24 GB) or a NVIDIA RTX6000 GPU (48 GB); to train on the RTX5000, we use a batch size of 8 and accumulate the gradient of two iterations. All models are implemented in PyTorch (Paszke et al., 2019) with PyTorch lightning (Falcon et al., 2020) and Hydra (Yadan, 2019). The code is available under https://github.com/cerea-daml/ldm_nextsim.

To train the latent diffusion models (LDMs), we use the mean prediction of the pre-trained encoder as deterministic mapping from pixel into latent space. Before training of the LDMs, we normalise the latent space by a per-channel global mean and standard deviation estimated over the training dataset. Although the LDM needs much less memory than the autoencoder and pixel diffusion model during training, we stick to a batch size of 16161616 for a fair comparison. All diffusion models use the same fixed log-signal-to-ratio bounds γmin=15subscript𝛾min15\gamma_{\mathrm{min}}=-15italic_γ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = - 15 and γmax=15subscript𝛾max15\gamma_{\mathrm{max}}=15italic_γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 15 during training and sampling. We perform the sampling with a second-order Heun sampler in 20 integration steps and the EDM sampling noise scheduler (Karras et al., 2022), extended to our bounds by truncation (Kingma & Gao, 2023).

Both, the reconstructions and the generated data are compared to the testing dataset. For the reconstructions, we perform a one-to-one comparison, while we evaluate the statistics of our generated data. In total, we hence generate 4383438343834383 samples with the diffusion models. Although a smaller number than the 50000500005000050000 samples normally used in validating image diffusion models, it results into a fair comparison to the testing dataset. In Appendix E.2, we demonstrate that the dataset size is big enough to see differences in the models.

Appendix D Metrics

In the following we introduce the metrics to evaluate our neural networks. To evaluate the autoencoders, we apply point-wise measures, while we quantify the quality of the generated fields in a transformed space. The reconstructed samples 𝐱~N×K×L~𝐱superscript𝑁𝐾𝐿\widetilde{\mathbf{x}}\in\mathbb{R}^{N\times K\times L}over~ start_ARG bold_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × italic_L end_POSTSUPERSCRIPT, and generated samples 𝐱^N×K×L^𝐱superscript𝑁𝐾𝐿\widehat{\mathbf{x}}\in\mathbb{R}^{N\times K\times L}over^ start_ARG bold_x end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × italic_L end_POSTSUPERSCRIPT are compared to the samples in the testing dataset 𝐱N×K×L𝐱superscript𝑁𝐾𝐿\mathbf{x}\in\mathbb{R}^{N\times K\times L}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_K × italic_L end_POSTSUPERSCRIPT for N=4383𝑁4383N=4383italic_N = 4383 samples, K=5𝐾5K=5italic_K = 5 variables, and L=127152𝐿127152L=127152italic_L = 127152 grid points without land.

D.1 Root-mean-squared error

To estimate one averaged root-mean-squared error (RMSE), we normalise the RMSE by the per-variable standard deviation σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, globally calculated for the k𝑘kitalic_k-th variable over the full training dataset. The normalised RMSE is then calculated as

RMSE(𝐱,𝐱~)=1NKLn=1Nk=1Kl=1L(xn,k,lx~n,k,l)2(σk)2.RMSE𝐱~𝐱1𝑁𝐾𝐿superscriptsubscript𝑛1𝑁superscriptsubscript𝑘1𝐾superscriptsubscript𝑙1𝐿superscriptsubscript𝑥𝑛𝑘𝑙subscript~𝑥𝑛𝑘𝑙2superscriptsubscript𝜎𝑘2\mathrm{RMSE}(\mathbf{x},\widetilde{\mathbf{x}})=\sqrt{\frac{1}{N\cdot K\cdot L% }\sum_{n=1}^{N}\sum_{k=1}^{K}\sum_{l=1}^{L}\frac{(x_{n,k,l}-\widetilde{x}_{n,k% ,l})^{2}}{(\sigma_{k})^{2}}}.roman_RMSE ( bold_x , over~ start_ARG bold_x end_ARG ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_N ⋅ italic_K ⋅ italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG ( italic_x start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT - over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . (21)

The lowest (best) possible RMSE is 00 and its maximum value is unbounded.

D.2 Structural similarity index measure

The structural similarity index measure (SSIMSSIM\mathrm{SSIM}roman_SSIM) takes information from a window into account and is hence a more fuzzy verification metric than the MAE. The SSIM for the n𝑛nitalic_n-th sample, the k𝑘kitalic_k-th variable, and the l𝑙litalic_l-th is given as

SSIM(xn,k,l,yn,k,l)=(2μn,k,l(x)μn,k,l(y))(2σn,k,l(xy)+c2,k)((μn,k,l(x))2+(μn,k,l(y))2+c1,k)((σn,k,l(x))2+(σn,k,l(y))2+c2,k).SSIMsubscript𝑥𝑛𝑘𝑙subscript𝑦𝑛𝑘𝑙2subscriptsuperscript𝜇𝑥𝑛𝑘𝑙subscriptsuperscript𝜇𝑦𝑛𝑘𝑙2subscriptsuperscript𝜎𝑥𝑦𝑛𝑘𝑙subscript𝑐2𝑘superscriptsubscriptsuperscript𝜇𝑥𝑛𝑘𝑙2superscriptsubscriptsuperscript𝜇𝑦𝑛𝑘𝑙2subscript𝑐1𝑘superscriptsubscriptsuperscript𝜎𝑥𝑛𝑘𝑙2superscriptsubscriptsuperscript𝜎𝑦𝑛𝑘𝑙2subscript𝑐2𝑘\mathrm{SSIM}(x_{n,k,l},y_{n,k,l})=\frac{\Big{(}2\mu^{(x)}_{n,k,l}\mu^{(y)}_{n% ,k,l}\Big{)}\Big{(}2\sigma^{(xy)}_{n,k,l}+c_{2,k}\Big{)}}{\Big{(}(\mu^{(x)}_{n% ,k,l})^{2}+(\mu^{(y)}_{n,k,l})^{2}+c_{1,k}\Big{)}\Big{(}(\sigma^{(x)}_{n,k,l})% ^{2}+(\sigma^{(y)}_{n,k,l})^{2}+c_{2,k}\Big{)}}.roman_SSIM ( italic_x start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) = divide start_ARG ( 2 italic_μ start_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ( italic_y ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) ( 2 italic_σ start_POSTSUPERSCRIPT ( italic_x italic_y ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG ( ( italic_μ start_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_μ start_POSTSUPERSCRIPT ( italic_y ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT ) ( ( italic_σ start_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_σ start_POSTSUPERSCRIPT ( italic_y ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ) end_ARG . (22)

The means μn,k,l(x)subscriptsuperscript𝜇𝑥𝑛𝑘𝑙\mu^{(x)}_{n,k,l}italic_μ start_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT and μn,k,l(y)subscriptsuperscript𝜇𝑦𝑛𝑘𝑙\mu^{(y)}_{n,k,l}italic_μ start_POSTSUPERSCRIPT ( italic_y ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT, variances σn,k,l(x)subscriptsuperscript𝜎𝑥𝑛𝑘𝑙\sigma^{(x)}_{n,k,l}italic_σ start_POSTSUPERSCRIPT ( italic_x ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT and σn,k,l(y)subscriptsuperscript𝜎𝑦𝑛𝑘𝑙\sigma^{(y)}_{n,k,l}italic_σ start_POSTSUPERSCRIPT ( italic_y ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT, and covariance σn,k,l(xy)subscriptsuperscript𝜎𝑥𝑦𝑛𝑘𝑙\sigma^{(xy)}_{n,k,l}italic_σ start_POSTSUPERSCRIPT ( italic_x italic_y ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT between x𝑥xitalic_x and y𝑦yitalic_y, respectively are estimated in the given 7×7777\times 77 × 7 window for each sample, variable, and grid point. The stabilisation values c1,k=(0.01rk)2subscript𝑐1𝑘superscript0.01subscript𝑟𝑘2c_{1,k}=(0.01\cdot r_{k})^{2}italic_c start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT = ( 0.01 ⋅ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and c2=(0.03rk)2subscript𝑐2superscript0.03subscript𝑟𝑘2c_{2}=(0.03\cdot r_{k})^{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 0.03 ⋅ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT depend on the dynamic range for the k𝑘kitalic_k variable with rk=max(𝐱k)min(𝐱k)subscript𝑟𝑘𝑚𝑎𝑥subscript𝐱𝑘𝑚𝑖𝑛subscript𝐱𝑘r_{k}=max(\mathbf{x}_{k})-min(\mathbf{x}_{k})italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_m italic_a italic_x ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_m italic_i italic_n ( bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as the difference between the maximum and minimum in the training dataset.

The global score for the SSIM is the average of Eq. (22) across all samples, variables, and grid points

SSIM(𝐱,𝐱~)=1NKLn=1Nk=1Kl=1LSSIM(xn,k,l,x~n,k,l).SSIM𝐱~𝐱1𝑁𝐾𝐿superscriptsubscript𝑛1𝑁superscriptsubscript𝑘1𝐾superscriptsubscript𝑙1𝐿SSIMsubscript𝑥𝑛𝑘𝑙subscript~𝑥𝑛𝑘𝑙\mathrm{SSIM}(\mathbf{x},\widetilde{\mathbf{x}})=\frac{1}{N\cdot K\cdot L}\sum% _{n=1}^{N}\sum_{k=1}^{K}\sum_{l=1}^{L}\mathrm{SSIM}(x_{n,k,l},\widetilde{x}_{n% ,k,l}).roman_SSIM ( bold_x , over~ start_ARG bold_x end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_N ⋅ italic_K ⋅ italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_SSIM ( italic_x start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k , italic_l end_POSTSUBSCRIPT ) . (23)

The highest possible SSIM is 1111 and the lowest 00.

D.3 Sea-ice extent accuracy

By applying diffusion models in a latent space, we can incorporate prior knowledge into the autoencoder which maps to and from the latent space. To determine the need of incorporating this knowledge into the autoencoder, we evaluate the accuracy of reconstructing where the sea ice is. While the accuracy might be unneeded for data generation, it can be a necessity for different tasks like surrogate modelling.

In neXtSIM, sea-ice processes are activated or deactivated based on the sea-ice thickness, and we estimate the sea-ice extent based on the sea-ice thickness. Choosing a small threshold value of SIT𝗆𝗂𝗇=0.01m𝑆𝐼subscript𝑇𝗆𝗂𝗇0.01mSIT_{\mathsf{min}}=0.01\,\text{m}italic_S italic_I italic_T start_POSTSUBSCRIPT sansserif_min end_POSTSUBSCRIPT = 0.01 m, we classify values above this threshold as sea ice and below as no ice (similar results can be achieved by using the sea-ice concentration). Based on this classification, we estimate the average accuracy across all samples and grid points. The accuracy is the sum of the true positive and true negative divided by the total number of cases. This accuracy is the same as 1IIEE1IIEE1-\mathrm{IIEE}1 - roman_IIEE, where IIEEIIEE\mathrm{IIEE}roman_IIEE is the integrated ice edge error (Goessling et al., 2016) estimated based on the sea-ice thickness instead of the sea-ice concentration. The highest possible accuracy is 1111 and the lowest 00.

D.4 Fréchet distance in latent space

If we generate data, we have no one-to-one correspondence between samples from the testing dataset and generated samples. To evaluate the generated statistics, we employ a latent space spanned by an independently trained variational autoencoder. The variational autoencoder follows the general structure of our trained autoencoders with β=1𝛽1\beta=1italic_β = 1. As the evidence lower bound is maximised, we can expect that this variational autoencoder encodes spatial and semantic information into the latent space. Additionally, the KL-divergence DKL(q(𝐳x𝐱)𝒩(𝟎,𝐈))subscript𝐷𝐾𝐿conditional𝑞conditionalsubscript𝐳𝑥𝐱𝒩0𝐈D_{KL}\Big{(}q(\mathbf{z}_{x}\mid\mathbf{x})\|\mathcal{N}\big{(}\mathbf{0},% \mathbf{I}\big{)}\Big{)}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_z start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∣ bold_x ) ∥ caligraphic_N ( bold_0 , bold_I ) ) regularises the latent space towards an isotropic Gaussian distribution. Hence, we assume that the latent space is distributed with an isotropic Gaussian.

To emulate how image generators are validated with the Fréchet Inception distance (FID, Heusel et al. (2017)), we employ the Fréchet distance in latent space and call this Fréchet autoencoder distance (FAEDFAED\mathrm{FAED}roman_FAED). Specifically, we use the mean prediction of the encoding part enc(𝐱)enc𝐱\mathrm{enc}(\mathbf{x})roman_enc ( bold_x ) as deterministic mapping into latent space. The Fréchet autoencoder distance reads then

FAED(𝐱,𝐱~)=μ(enc(𝐱))μ(enc(𝐱~))22+Tr(Σ(enc(𝐱))+Σ(enc(𝐱~))2(Σ(enc(𝐱))Σ(enc(𝐱~)))12),FAED𝐱~𝐱superscriptsubscriptdelimited-∥∥𝜇enc𝐱𝜇enc~𝐱22TrΣenc𝐱Σenc~𝐱2superscriptΣenc𝐱Σenc~𝐱12\mathrm{FAED}(\mathbf{x},\widetilde{\mathbf{x}})=\lVert\mu(\mathrm{enc}(% \mathbf{x}))-\mu(\mathrm{enc}(\widetilde{\mathbf{x}}))\rVert_{2}^{2}+\mathrm{% Tr}\Bigg{(}\Sigma(\mathrm{enc}(\mathbf{x}))+\Sigma(\mathrm{enc}(\widetilde{% \mathbf{x}}))-2\Big{(}\Sigma(\mathrm{enc}(\mathbf{x}))\cdot\Sigma(\mathrm{enc}% (\widetilde{\mathbf{x}}))\Big{)}^{\frac{1}{2}}\Bigg{)},roman_FAED ( bold_x , over~ start_ARG bold_x end_ARG ) = ∥ italic_μ ( roman_enc ( bold_x ) ) - italic_μ ( roman_enc ( over~ start_ARG bold_x end_ARG ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Tr ( roman_Σ ( roman_enc ( bold_x ) ) + roman_Σ ( roman_enc ( over~ start_ARG bold_x end_ARG ) ) - 2 ( roman_Σ ( roman_enc ( bold_x ) ) ⋅ roman_Σ ( roman_enc ( over~ start_ARG bold_x end_ARG ) ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) , (24)

where μ()𝜇\mu(\cdot)italic_μ ( ⋅ ) is the point-wise sample mean, Σ()Σ\Sigma(\cdot)roman_Σ ( ⋅ ) the sample covariance, Tr()Tr\mathrm{Tr}(\cdot)roman_Tr ( ⋅ ) the trace of a matrix, and ()12superscript12(\cdot)^{\frac{1}{2}}( ⋅ ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT the matrix square-root. The isotropic Gaussian assumption in latent space allows us to further simplify the estimated covariances matrix to variances, and the Fréchet autoencoder distance reduces to

FAED(𝐱,𝐱~)=μ(enc(𝐱))μ(enc(𝐱~))22+σ(enc(𝐱))σ(enc(𝐱~))22,FAED𝐱~𝐱superscriptsubscriptdelimited-∥∥𝜇enc𝐱𝜇enc~𝐱22superscriptsubscriptdelimited-∥∥𝜎enc𝐱𝜎enc~𝐱22\mathrm{FAED}(\mathbf{x},\widetilde{\mathbf{x}})=\lVert\mu(\mathrm{enc}(% \mathbf{x}))-\mu(\mathrm{enc}(\widetilde{\mathbf{x}}))\rVert_{2}^{2}+\lVert% \sigma(\mathrm{enc}(\mathbf{x}))-\sigma(\mathrm{enc}(\widetilde{\mathbf{x}}))% \rVert_{2}^{2},roman_FAED ( bold_x , over~ start_ARG bold_x end_ARG ) = ∥ italic_μ ( roman_enc ( bold_x ) ) - italic_μ ( roman_enc ( over~ start_ARG bold_x end_ARG ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_σ ( roman_enc ( bold_x ) ) - italic_σ ( roman_enc ( over~ start_ARG bold_x end_ARG ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (25)

with σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) as point-wise standard deviation. The minimum (best) value of the Fréchet autoencoder distance is 00 while its maximum value is unbounded.

D.5 Root-mean-squared error of the sea-ice extent

To evaluate the climatological consistency of the generated samples for the sea-ice extent, we can estimate the probability p~lsubscript~𝑝𝑙\widetilde{p}_{l}over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT that the l𝑙litalic_l-th grid point is covered by sea ice. To classify for the n𝑛nitalic_n-th sample if a grid point is covered, we again apply the threshold of SITmin=0.01msubscriptSITmin0.01m\mathrm{SIT}_{\mathrm{min}}=0.01\,\text{m}roman_SIT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 0.01 m. The probability is given as

p~l=1Nn=1NI(SIT~n,lSITmin)subscript~𝑝𝑙1𝑁subscriptsuperscript𝑁𝑛1Isubscript~SIT𝑛𝑙subscriptSITmin\widetilde{p}_{l}=\frac{1}{N}\sum^{N}_{n=1}\mathrm{I}(\widetilde{\mathrm{SIT}}% _{n,l}\geq\mathrm{SIT}_{\mathrm{min}})over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT roman_I ( over~ start_ARG roman_SIT end_ARG start_POSTSUBSCRIPT italic_n , italic_l end_POSTSUBSCRIPT ≥ roman_SIT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) (26)

with I()I\mathrm{I}(\cdot)roman_I ( ⋅ ) as indicator function and SIT~n,lsubscript~SIT𝑛𝑙\widetilde{\mathrm{SIT}}_{n,l}over~ start_ARG roman_SIT end_ARG start_POSTSUBSCRIPT italic_n , italic_l end_POSTSUBSCRIPT the sea-ice thickness of the n𝑛nitalic_n-th generated sample for the l𝑙litalic_l-th grid point.

The root-mean-squared error (RMSE) in the sea-ice extent reads then

RMSESIE(𝐱,𝐱~)=1Ll=1L(plp~l)2subscriptRMSESIE𝐱~𝐱1𝐿subscriptsuperscript𝐿𝑙1superscriptsubscript𝑝𝑙subscript~𝑝𝑙2\mathrm{RMSE}_{\mathrm{SIE}}(\mathbf{x},\widetilde{\mathbf{x}})=\sqrt{\frac{1}% {L}\sum^{L}_{l=1}(p_{l}-\widetilde{p}_{l})^{2}}roman_RMSE start_POSTSUBSCRIPT roman_SIE end_POSTSUBSCRIPT ( bold_x , over~ start_ARG bold_x end_ARG ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (27)

with plsubscript𝑝𝑙p_{l}italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as probability in the testing dataset. The minimum (best) RMSE is 00, while the worst possible RMSE is 1111.

Appendix E Additional results

In the Appendix, we present additional results that signify the results we have found. We will show how the hyperparameters influence the autoencoder reconstruction, how censoring helps to improve the sea-ice extent representation, and how the diffusion model generalises across experiments.

E.1 Ablation for autoencoder

In our formulation for the pre-training of the autoencoder, the autoencoder has several hyperparameters. One of the them is the number of channels Nlatentsubscript𝑁latentN_{\text{latent}}italic_N start_POSTSUBSCRIPT latent end_POSTSUBSCRIPT in latent space. The more channels, the more information can be stored in the latent space. However, more channels can make the diffusion model more difficult to train. Another hyperparameter is the factor β𝛽\betaitalic_β which determines the strength of the regularisation in latent space. The larger the factor, the smoother the latent space, but also the higher the semantic compression therein. Throughout the main manuscript, we have made the decision of Nlatent=16subscript𝑁latent16N_{\text{latent}}=16italic_N start_POSTSUBSCRIPT latent end_POSTSUBSCRIPT = 16 and β=103𝛽superscript103\beta=10^{-3}italic_β = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT to strike the right balance. In Tab. 3, we show an ablation study on these factors. Different from the experiments in the main manuscript, these experiments were performed with NVIDIA A100 GPUs on the Jean-Zay supercomputing facilities, provided by GENCI.

Table 3: Evaluation of the autoencoders for reconstructing the testing dataset with the normalised root-mean-squared error (RMSE), the structural similarity index measure (SSIM), and the accuracy in the sea-ice extent (ACCSIEsubscriptACCSIE\text{ACC}_{\text{SIE}}ACC start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT). Nlatentsubscript𝑁latentN_{\text{latent}}italic_N start_POSTSUBSCRIPT latent end_POSTSUBSCRIPT is the number of channels in the latent space and β𝛽\betaitalic_β is the regularisation factor. The two rows marked in bold are used in the main manuscript. Note, the numbers can be different than in Tab. 1 as these autoencoders are trained independently on another computing system.
Nlatentsubscript𝑁latentN_{\text{latent}}italic_N start_POSTSUBSCRIPT latent end_POSTSUBSCRIPT β𝛽\betaitalic_β RMSE SSIM ACCSIEsubscriptACCSIE\text{ACC}_{\text{SIE}}ACC start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT
8 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.094 0.916 0.982
16 𝟏𝟎𝟑superscript103\mathbf{10^{-3}}bold_10 start_POSTSUPERSCRIPT - bold_3 end_POSTSUPERSCRIPT 0.076 0.937 0.981
32 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.061 0.953 0.984
16 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT 0.077 0.935 0.971
16 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 0.075 0.939 0.984
16 𝟏𝟎𝟑superscript103\mathbf{10^{-3}}bold_10 start_POSTSUPERSCRIPT - bold_3 end_POSTSUPERSCRIPT censored 0.078 0.942 0.992

With our chosen hyperparameters, the autoencoder results into a 16.416.416.416.4-fold compression, which is lower than the 20202020-fold compression from data dimensions only as the dimensionality of the masking is also reduced. Altering this compression rate by changing the number of channels in latent space has a larger impact on the RMSE and the SSIM than the regularisation factor.

Refer to caption

Figure 3: Estimated spectral density of the sea-ice thickness in the central Arctic for the testing dataset (black), and three different hyperparameter configurations for the autoencoder.

In Fig. 3, we additionally show the energy spectrum for the sea-ice thickness if we increase the number of channels or lower the regularisation factor. Once again, changing the regularisation has almost no impact on the averaged spectrum, while a larger latent space can seemingly retain more small-scale features. Hence, if the regularisation is small enough, the compression rate determines how much small-scale features are retained through the latent space. To reduce the smoothing, we can reduce the compression rate. Since the smoothing is a result of double penalty effects caused by the point-wise comparison in the reconstruction loss, we however need other tools to completely remedy the smoothing. Furthermore, by having more channels in latent space, the latent diffusion model can become more difficult to optimise (Rombach et al., 2022).

E.2 Influence of the generated dataset size on the diffusion scores

To evaluate our diffusion models, we use 4383438343834383 samples, while the recommendation to evaluate generative models for images is 50000500005000050000 samples. In Table 4, we compare the performance of the same latent diffusion model without censoring if we alter the random seed for the generation of the dataset.

Table 4: Evaluation of the statistics for samples generated by the latent diffusion model without censoring compared to the testing dataset with the same size.
Seed FAEDFAED\mathrm{FAED}roman_FAED \downarrow RMSESIEsubscriptRMSESIE\text{RMSE}_{\text{SIE}}RMSE start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT \downarrow
0 1.65 11.56
10 1.66 11.79
11 1.68 11.39
12 1.66 11.76
13 1.64 11.36

While there are some small differences between different seeds, these differences are smaller than the differences shown in Tab. 2. Based on this analysis, we can conclude that the dataset size is large enough to evaluate differences between diffusion models with the proposed scores.

Diffusion models tend to have a high volatility in the scores during training (Song & Ermon, 2020). Normally used to stabilise the diffusion model, we restrain from applying exponential moving averages to estimate the scores of the diffusion model. On the one hand, this allows us to establish the performance of diffusion models without such tricks. On the other hand, the differences between the models for the FAEDFAED\mathrm{FAED}roman_FAED in Tab. 2 might be due to the high volatility.

E.3 Clipping of the generative diffusion model without censored distributions

To avoid unphysical samples during generation, the values from the denoiser are commonly clipped into either fixed (e.g., Ho et al., 2020) or dynamical bounds (Saharia et al., 2022b). However, as shown in Appendix A.1, generative diffusion is built around the assumption of unbounded Gaussian diffusion, and the clipping would violate the Gaussian assumption, which has been applied during training. This could introduce biases into the generation.

In Table 5, we show the results of such a clipped diffusion model in addition to the results of the diffusion models as presented in the main manuscript. For this clipped experiment, we use the pre-trained data-space diffusion model and clip the output of its denoiser into the physical bounds (SIT[0,)SIT0\text{SIT}\in[0,\infty)SIT ∈ [ 0 , ∞ ), SIC[0,1]SIC01\text{SIC}\in[0,1]SIC ∈ [ 0 , 1 ], and SID[0,1]SID01\text{SID}\in[0,1]SID ∈ [ 0 , 1 ]). Additionally, we show what happens if we clip the output of the latent diffusion model (LDM) trained without the censored distribution.

Table 5: Comparison of the diffusion models tested in Sec. 3 with diffusion model where the output of the denoiser is clipped into physical bounds or a latent diffusion model where the output of the autoencoder is clipped without a censored distribution.
Model FAEDFAED\mathrm{FAED}roman_FAED \downarrow RMSESIEsubscriptRMSESIE\text{RMSE}_{\text{SIE}}RMSE start_POSTSUBSCRIPT SIE end_POSTSUBSCRIPT \downarrow
Validation 2.38 7.02
Diffusion 1.73 9.66
Diffusion (denoiser clipped) 1.76 12.82
LDM 1.65 11.56
LDM (censored) 1.79 7.32
LDM (output clipped) 1.81 11.04

Including a clipping operator into the denoiser that defines the diffusion model in data space has a negative impact on the physical representation of the sea ice, the RMSE in the sea-ice extent is increased compared to the unclipped diffusion model. While this technique is efficient in image generation, it introduces a bias into the denoiser, which seemingly hurts its performance in our sea-ice generator. Because of biasing the diffusion model, we suspect that clipping introduces a degradation in the cross-correlation between variables, which can have a higher impact on geophysical applications than on image generation.

Clipping the output of a LDM trained with a Gaussian distribution has a neutral impact on the physical representation. Since the model is trained without accounting for clipping during training, the model still produces spurious values for target values where the bounds are reached, only values exceeding the bounds are clipped. For the targets that lay on the bounds, the censored distribution, introduced in Appendix A.2, pushes the output of the LDM towards an extreme exceedance of the bounds, increasing the probability that the values is clipped into the bounds. Hence, censored distributions can unlock an effective incorporation of physical bounds into the training of deep learning algorithms.

This results suggests that we need special methods to properly treat physical bounds in diffusion models that work in data space. We can define a log-barrier that pushes the diffusion process away from the bounds (Fishman et al., 2023). Additionally, we can define a reflected diffusion process, where the Brownian motion is reflected at the bounds (Lou & Ermon, 2023). However, with both methods, we would never generate values that exactly on the bounds. If the result from the diffusion model is further used in downstream applications, e.g., by cycling in a surrogate model (Price et al., 2024; Finn et al., 2024), such small errors can amplify and make the pipeline unstable. Contrastingly, censoring of the autoencoder provides a clean and systematic way to incorporate bounds so that the generated data can lie exactly on the bounds.

E.4 Censoring

In Table 2, we show that censoring helps to improve the representation of the sea-ice extent and reduces the RMSE of the sea-ice extent. Here, we disentangle a bit how we obtain this improvement.

In Fig. 4, we show the binarized probability that a grid point is covered by sea ice in the generated dataset dependent on the probability in the testing dataset. The optimal probability would be on the one-to-one line, reaching for each grid point the same probability in the generated dataset.

Refer to caption

Figure 4: Comparison of the probabilities that a grid point is covered by sea-ice as seen in the testing dataset or as predicted by the validation dataset (gray), the diffusion model in data space (red, Diff), the latent diffusion model (blue, LDM), and the latent diffusion model with censoring (yellow). The shown probabilities represent the averaged predicted probabilities, grouped into 5%percent55\,\%5 % intervals based on the test probabilities.

In the validation dataset, the probabilities are slightly overestimated. The difference is likely caused by the different dataset size (one year in validation and three years in testing). As for the validation dataset, the diffusion model in data space and the latent diffusion model both overestimate the probabilities. Since this overestimation is larger than for the validation dataset, it is likely because of the failure to represent the sea-ice extent correctly. In contrast to the other diffusion models, the diffusion model with censoring tends to underestimate the probabilities. The RMSE in Tab. 2 is reduced as the underestimation for the censored model is smaller than the overestimation for the other diffusion models. Consequently, we have established here that censoring helps to alleviate the overestimation bias of the diffusion models. Its further use for other downstream tasks remains to be seen.

E.5 Uncurated data samples

In Fig. 5, we show uncurated samples from our diffusion models and reference samples from neXtSIM. As discussed in the main manuscript and shown in Fig. 1, the latent diffusion models result into smoothed fields compared to the diffusion model in data space and neXtSIM.

Refer to caption

Figure 5: Uncurated samples from the diffusion model in data space (a–d), the latent diffusion model without censoring (e–h), and the latent diffusion model with censoring (i–l), and the neXtSIM simulations. As in Fig. 1, the thickness and concentration are directly generated, while the speed and deformation are derived from the generated velocities. Best to view in digital and colour.

While the derived deformation field of the diffusion model (d) shows small-scale structures as they can be seen in neXtSIM (p), the thickness and concentration remain too noisy. One possible explanation could be that the velocities are easier to generate as they are continuous, whereas the sea-ice thickness can be rather represented by discrete-continuous behaviour. Since we used a rather low number of integration steps (20) without much tuning, we could expect that such fine-scale structure can be better represented with a better tuned sampling and/or with a diffusion model trained on more data.

Appendix F Changes to the Version 1

  • We added the comparison in speed to Table 2 indicating that latent diffusion models are indeed much faster than diffusion models in data space.

  • We strengthened the proof-of-concept character of the study, indicating that we see this work as a first step towards the training of generative diffusion for e.g., Arctic-wide surrogate models.

  • We added Appendix E.3 with additional results if we add clipping to the denoiser in data space. While in image generation, this clipping is very important, we find a negative impact on the physical representation in the diffusion model. We additionally provide insides why a censored distribution helps to improve the physical representation.

  • We added the GitHub link to the code.