Score-Based Generative Modeling through Stochastic Differential Equations(附录_英文原文_2: E-I)

E REVERSE DIFFUSION SAMPLING

Given a forward SDE

dx=f(x,t)dt+G(t)dw, \mathrm{dx} = \mathbf{f}(\mathbf{x}, t) \mathrm{d} t + \mathbf{G}(t) \mathrm{d} \mathbf{w}, dx=f(x,t)dt+G(t)dw,

and suppose the following iteration rule is a discretization of it:

xi+1=xi+fi(xi)+Gizi,i=0,1,⋯ ,N−1,(45) \mathbf{x}_{i+1} = \mathbf{x}_i + \mathbf{f}_i(\mathbf{x}_i) + \mathbf{G}_i \mathbf{z}_i, \quad i=0,1, \cdots, N-1, (45) xi+1=xi+fi(xi)+Gizi,i=0,1,,N1,(45)

where zi∼N(0,I)\mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})ziN(0,I). Here we assume the discretization schedule of time is fixed beforehand, and thus we can absorb it into the notations of fi\mathbf{f}_ifi and Gi\mathbf{G}_iGi.

Based on Eq. (45), we propose to discretize the reverse-time SDE

dx=[f(x,t)−G(t)G(t)⊤∇xlog⁡pt(x)]dt+G(t)dw, \mathrm{dx} = \left[ \mathbf{f}(\mathbf{x}, t) - \mathbf{G}(t) \mathbf{G}(t)^{\top} \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] \mathrm{d} t + \mathbf{G}(t) \mathrm{d} \mathbf{w}, dx=[f(x,t)G(t)G(t)xlogpt(x)]dt+G(t)dw,

with a similar functional form, which gives the following iteration rule for i∈{0,1,⋯ ,N−1}i \in \{0,1, \cdots, N-1\}i{0,1,,N1}:

xi=xi+1−fi+1(xi+1)+Gi+1Gi+1⊤sθ∗(xi+1,i+1)+Gi+1zi+1,(46) \mathbf{x}_i = \mathbf{x}_{i+1} - \mathbf{f}_{i+1}(\mathbf{x}_{i+1}) + \mathbf{G}_{i+1} \mathbf{G}_{i+1}^{\top} \mathbf{s}_{\theta^*}(\mathbf{x}_{i+1}, i+1) + \mathbf{G}_{i+1} \mathbf{z}_{i+1}, (46) xi=xi+1fi+1(xi+1)+Gi+1Gi+1sθ(xi+1,i+1)+Gi+1zi+1,(46)

where our trained score-based model sθ∗(xi,i)\mathbf{s}_{\theta^*}(\mathbf{x}_i, i)sθ(xi,i) is conditioned on iteration number iii.

When applying Eq. (46) to Eqs. (10) and (20), we obtain a new set of numerical solvers for the reverse-time VE and VP SDEs, resulting in sampling algorithms as shown in the “predictor” part of Algorithms 2 and 3. We name these sampling methods (that are based on the discretization strategy in Eq. (46)) reverse diffusion samplers.

As expected, the ancestral sampling of DDPM (Ho et al., 2020) (Eq. (4)) matches its reverse diffusion counterpart when βi→0\beta_i \rightarrow 0βi0 for all iii (which happens when Δt→0\Delta t \rightarrow 0Δt0 since βi=β~iΔt\beta_i = \tilde{\beta}_i \Delta tβi=β~iΔt, see Appendix B),

because

xi=11−βi+1(xi+1+βi+1Sθ∗(xi+1,i+1))+βi+1zi+1x_i = \frac{1}{\sqrt{1 - \beta_{i+1}}} (x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1)) + \sqrt{\beta_{i+1}} z_{i+1}xi=1βi+11(xi+1+βi+1Sθ(xi+1,i+1))+βi+1zi+1
=(1+12βi+1+o(βi+1))(xi+1+βi+1Sθ∗(xi+1,i+1))+βi+1zi+1=\left(1 + \frac{1}{2} \beta_{i+1} + o(\beta_{i+1})\right) (x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1)) + \sqrt{\beta_{i+1}} z_{i+1}=(1+21βi+1+o(βi+1))(xi+1+βi+1Sθ(xi+1,i+1))+βi+1zi+1
≈(1+12βi+1)(xi+1+βi+1Sθ∗(xi+1,i+1))+βi+1zi+1\approx \left(1 + \frac{1}{2} \beta_{i+1}\right) (x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1)) + \sqrt{\beta_{i+1}} z_{i+1}(1+21βi+1)(xi+1+βi+1Sθ(xi+1,i+1))+βi+1zi+1
=(1+12βi+1)xi+1+βi+1Sθ∗(xi+1,i+1)+12βi+12Sθ∗(xi+1,i+1)+βi+1zi+1=\left(1 + \frac{1}{2} \beta_{i+1}\right) x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1) + \frac{1}{2} \beta_{i+1}^2 S_{\theta*} (x_{i+1}, i + 1) + \sqrt{\beta_{i+1}} z_{i+1}=(1+21βi+1)xi+1+βi+1Sθ(xi+1,i+1)+21βi+12Sθ(xi+1,i+1)+βi+1zi+1
≈(1+12βi+1)xi+1+βi+1Sθ∗(xi+1,i+1)+βi+1zi+1\approx \left(1 + \frac{1}{2} \beta_{i+1}\right) x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1) + \sqrt{\beta_{i+1}} z_{i+1}(1+21βi+1)xi+1+βi+1Sθ(xi+1,i+1)+βi+1zi+1
=[2−(1−12βi+1)]xi+1+βi+1Sθ∗(xi+1,i+1)+βi+1zi+1=\left[2 - \left(1 - \frac{1}{2} \beta_{i+1}\right)\right] x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1) + \sqrt{\beta_{i+1}} z_{i+1}=[2(121βi+1)]xi+1+βi+1Sθ(xi+1,i+1)+βi+1zi+1
≈[2−(1−12βi+1)+o(βi+1)]xi+1+βi+1Sθ∗(xi+1,i+1)+βi+1zi+1\approx \left[2 - \left(1 - \frac{1}{2} \beta_{i+1}\right) + o(\beta_{i+1})\right] x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1) + \sqrt{\beta_{i+1}} z_{i+1}[2(121βi+1)+o(βi+1)]xi+1+βi+1Sθ(xi+1,i+1)+βi+1zi+1
=(2−1−βi+1)xi+1+βi+1Sθ∗(xi+1,i+1)+βi+1zi+1=(2 - \sqrt{1 - \beta_{i+1}}) x_{i+1} + \beta_{i+1} S_{\theta*} (x_{i+1}, i + 1) + \sqrt{\beta_{i+1}} z_{i+1}=(21βi+1)xi+1+βi+1Sθ(xi+1,i+1)+βi+1zi+1.

Therefore, the original ancestral sampler of Eq. (4) is essentially a different discretization to the same reverse-time SDE. This unifies the sampling method in Ho et al. (2020) as a numerical solver to the reverse-time VP SDE in our continuous framework.

Therefore, the original ancestral sampler of Eq. (4) is essentially a different discretization to the same reverse-time SDE. This unifies the sampling method in Ho et al. (2020) as a numerical solver to the reverse-time VP SDE in our continuous framework.

F ANCESTRAL SAMPLING FOR SMLD MODELS

The ancestral sampling method for DDPM models can also be adapted to SMLD models. Consider a sequence of noise scales σ1<σ2<⋯<σN\sigma_1 < \sigma_2 < \cdots < \sigma_Nσ1<σ2<<σN as in SMLD. By perturbing a data point x0\mathbf{x}_0x0 with these noise scales sequentially, we obtain a Markov chain x0→x1→⋯→xN\mathbf{x}_0 \rightarrow \mathbf{x}_1 \rightarrow \cdots \rightarrow \mathbf{x}_Nx0x1xN, where

p(xi∣xi−1)=N(xi;xi−1,(σi2−σi−12)I),i=1,2,⋯ ,N. p(\mathbf{x}_i \mid \mathbf{x}_{i-1}) = \mathcal{N}\left( \mathbf{x}_i; \mathbf{x}_{i-1}, \left( \sigma_i^2 - \sigma_{i-1}^2 \right) \mathbf{I} \right), \quad i=1,2, \cdots, N. p(xixi1)=N(xi;xi1,(σi2σi12)I),i=1,2,,N.

Here we assume σ0=0\sigma_0 = 0σ0=0 to simplify notations. Following Ho et al. (2020), we can compute

q(xi−1∣xi,x0)=N(xi−1;σi−12σi2xi+(1−σi−12σi2)x0,σi−12(σi2−σi−12)σi2I). q(\mathbf{x}_{i-1} \mid \mathbf{x}_i, \mathbf{x}_0) = \mathcal{N}\left( \mathbf{x}_{i-1}; \frac{\sigma_{i-1}^2}{\sigma_i^2} \mathbf{x}_i + \left( 1 - \frac{\sigma_{i-1}^2}{\sigma_i^2} \right) \mathbf{x}_0, \frac{\sigma_{i-1}^2 \left( \sigma_i^2 - \sigma_{i-1}^2 \right)}{\sigma_i^2} \mathbf{I} \right). q(xi1xi,x0)=N(xi1;σi2σi12xi+(1σi2σi12)x0,σi2σi12(σi2σi12)I).

If we parameterize the reverse transition kernel as pθ(xi−1∣xi)=N(xi−1;μθ(xi,i),τi2I)p_\theta(\mathbf{x}_{i-1} \mid \mathbf{x}_i) = \mathcal{N}\left( \mathbf{x}_{i-1}; \mu_\theta(\mathbf{x}_i, i), \tau_i^2 \mathbf{I} \right)pθ(xi1xi)=N(xi1;μθ(xi,i),τi2I), then

Lt−1=Eq[DKL(q(xi−1∣xi,x0)∥pθ(xi−1∣xi))]=Eq[12τi2∥σi−12σi2xi+(1−σi−12σi2)x0−μθ(xi,i)∥22]+C=Ex0,z[12τi2∥xi(x0,z)−σi2−σi−12σiz−μθ(xi(x0,z),i)∥22]+C, \begin{aligned} L_{t-1} &= \mathbb{E}_q \left[ D_{\text{KL}} \left( q(\mathbf{x}_{i-1} \mid \mathbf{x}_i, \mathbf{x}_0) \| p_\theta(\mathbf{x}_{i-1} \mid \mathbf{x}_i) \right) \right] \\ &= \mathbb{E}_q \left[ \frac{1}{2 \tau_i^2} \left\| \frac{\sigma_{i-1}^2}{\sigma_i^2} \mathbf{x}_i + \left( 1 - \frac{\sigma_{i-1}^2}{\sigma_i^2} \right) \mathbf{x}_0 - \mu_\theta(\mathbf{x}_i, i) \right\|_2^2 \right] + C \\ &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \left[ \frac{1}{2 \tau_i^2} \left\| \mathbf{x}_i(\mathbf{x}_0, \mathbf{z}) - \frac{\sigma_i^2 - \sigma_{i-1}^2}{\sigma_i} \mathbf{z} - \mu_\theta(\mathbf{x}_i(\mathbf{x}_0, \mathbf{z}), i) \right\|_2^2 \right] + C, \end{aligned} Lt1=Eq[DKL(q(xi1xi,x0)pθ(xi1xi))]=Eq[2τi21σi2σi12xi+(1σi2σi12)x0μθ(xi,i)22]+C=Ex0,z[2τi21xi(x0,z)σiσi2σi12zμθ(xi(x0,z),i)22]+C,

where Lt−1L_{t-1}Lt1 is one representative term in the ELBO objective (see Eq. (8) in Ho et al. (2020)), CCC is a constant that does not depend on θ\boldsymbol{\theta}θ, z∼N(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})zN(0,I), and xi(x0,z)=x0+σiz\mathbf{x}_i(\mathbf{x}_0, \mathbf{z}) = \mathbf{x}_0 + \sigma_i \mathbf{z}xi(x0,z)=x0+σiz. We can therefore parameterize μθ(xi,i)\mu_\theta(\mathbf{x}_i, i)μθ(xi,i) via

μθ(xi,i)=xi+(σi2−σi−12)sθ(xi,i), \mu_\theta(\mathbf{x}_i, i) = \mathbf{x}_i + \left( \sigma_i^2 - \sigma_{i-1}^2 \right) \mathbf{s}_\theta(\mathbf{x}_i, i), μθ(xi,i)=xi+(σi2σi12)sθ(xi,i),

where sθ(xi,i)\mathbf{s}_\theta(\mathbf{x}_i, i)sθ(xi,i) is to estimate z/σi\mathbf{z} / \sigma_iz/σi. As in Ho et al. (2020), we let τi=σi−12(σi2−σi−12)σi2\tau_i = \sqrt{\frac{\sigma_{i-1}^2 \left( \sigma_i^2 - \sigma_{i-1}^2 \right)}{\sigma_i^2}}τi=σi2σi12(σi2σi12). Through ancestral sampling on ∏i=1Npθ(xi−1∣xi)\prod_{i=1}^N p_\theta(\mathbf{x}_{i-1} \mid \mathbf{x}_i)i=1Npθ(xi1xi), we obtain the following iteration rule

xi−1=xi+(σi2−σi−12)sθ∗(xi,i)+σi−12(σi2−σi−12)σi2zi,i=1,2,⋯ ,N, \mathbf{x}_{i-1} = \mathbf{x}_i + \left( \sigma_i^2 - \sigma_{i-1}^2 \right) \mathbf{s}_{\theta^*}(\mathbf{x}_i, i) + \sqrt{\frac{\sigma_{i-1}^2 \left( \sigma_i^2 - \sigma_{i-1}^2 \right)}{\sigma_i^2}} \mathbf{z}_i, \quad i=1,2, \cdots, N, xi1=xi+(σi2σi12)sθ(xi,i)+σi2σi12(σi2σi12)zi,i=1,2,,N,

where xN∼N(0,σN2I)\mathbf{x}_N \sim \mathcal{N}(\mathbf{0}, \sigma_N^2 \mathbf{I})xNN(0,σN2I), θ∗\boldsymbol{\theta}^*θ denotes the optimal parameter of sθ\mathbf{s}_\thetasθ, and zi∼N(0,I)\mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})ziN(0,I). We call Eq. (47) the ancestral sampling method for SMLD models.

Algorithm 1 Predictor-Corrector (PC) sampling

Require:

  • NNN: Number of discretization steps for the reverse-time SDE
  • MMM: Number of corrector steps

Initialize xN∼pT(x)\mathbf{x}_N \sim p_T(\mathbf{x})xNpT(x)
for i=N−1i=N-1i=N1 to 0 do
xi←Predictor(xi+1)\mathbf{x}_i \leftarrow \text{Predictor}\left(\mathbf{x}_{i+1}\right)xiPredictor(xi+1)
for j=1j=1j=1 to MMM do
xi←Corrector(xi)\mathbf{x}_i \leftarrow \text{Corrector}\left(\mathbf{x}_i\right)xiCorrector(xi)
return x0\mathbf{x}_0x0

G PREDICTOR-CORRECTOR SAMPLERS

Predictor-Corrector (PC) sampling

The predictor can be any numerical solver for the reverse-time SDE with a fixed discretization strategy. The corrector can be any score-based MCMC approach. In PC sampling, we alternate between the predictor and corrector, as described in Algorithm 1. For example, when using the reverse diffusion SDE solver (Appendix E) as the predictor, and annealed Langevin dynamics (Song & Ermon, 2019) as the corrector, we have Algorithms 2 and 3 for VE and VP SDEs respectively, where {ϵi}i=0N−1\left\{\epsilon_i\right\}_{i=0}^{N-1}{ϵi}i=0N1 are step sizes for Langevin dynamics as specified below.

The corrector algorithms

We take the schedule of annealed Langevin dynamics in Song & Ermon (2019), but re-frame it with slight modifications in order to get better interpretability and empirical performance. We provide the corrector algorithms in Algorithms 4 and 5 respectively, where we call rrr the “signal-to-noise” ratio. We determine the step size ϵ\epsilonϵ using the norm of the Gaussian noise ∥z∥2\|\mathbf{z}\|_2z2, norm of the score-based model ∥sθ∗∥2\left\|\mathbf{s}_{\theta^*}\right\|_2sθ2 and the signal-to-noise ratio rrr. When sampling a large batch of samples together, we replace the norm ∥⋅∥2\|\cdot\|_22 with the average norm across the mini-batch. When the batch size is small, we suggest replacing ∥z∥2\|\mathbf{z}\|_2z2 with d\sqrt{d}d, where ddd is the dimensionality of z\mathbf{z}z.

Algorithm 2 PC sampling (VE SDE)Algorithm 3 PC sampling (VP SDE)
1: xN∼N(0,σmax⁡2I)\mathbf{x}_N \sim \mathcal{N}\left(\mathbf{0}, \sigma_{\max }^2 \mathbf{I}\right)xNN(0,σmax2I)1: xN∼N(0,I)\mathbf{x}_N \sim \mathcal{N}(\mathbf{0}, \mathbf{I})xNN(0,I)
2: for i=N−1i=N-1i=N1 to 0 do2: for i=N−1i=N-1i=N1 to 0 do
3: xi′←xi+1+(σi+12−σi2)s0∗(xi+1,σi+1)\mathbf{x}_i^{\prime} \leftarrow \mathbf{x}_{i+1}+\left(\sigma_{i+1}^2-\sigma_i^2\right) \mathbf{s}_{0 *}\left(\mathbf{x}_{i+1}, \sigma_{i+1}\right)xixi+1+(σi+12σi2)s0(xi+1,σi+1)3: xi′←(2−1−βi+1)xi+1+βi+1s0∗(xi+1,i+1)\mathbf{x}_i^{\prime} \leftarrow\left(2-\sqrt{1-\beta_{i+1}}\right) \mathbf{x}_{i+1}+\beta_{i+1} \mathbf{s}_{0 *}\left(\mathbf{x}_{i+1}, i+1\right)xi(21βi+1)xi+1+βi+1s0(xi+1,i+1)
4: z∼N(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})zN(0,I)4: z∼N(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})zN(0,I)
5: xi←xi′+σi+12−σi2z\mathbf{x}_i \leftarrow \mathbf{x}_i^{\prime}+\sqrt{\sigma_{i+1}^2-\sigma_i^2} \mathbf{z}xixi+σi+12σi2z5: xi←xi′+βi+1z\mathbf{x}_i \leftarrow \mathbf{x}_i^{\prime}+\sqrt{\beta_{i+1}} \mathbf{z}xixi+βi+1z (Predictor)
6: for j=1j=1j=1 to MMM do6: for j=1j=1j=1 to MMM do (Correc)
7: z∼N(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})zN(0,I)7: z∼N(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})zN(0,I)
8: xi←xi+ϵis0∗(xi,σi)+2ϵiz\mathbf{x}_i \leftarrow \mathbf{x}_i+\epsilon_i \mathbf{s}_{0 *}\left(\mathbf{x}_i, \sigma_i\right)+\sqrt{2 \epsilon_i} \mathbf{z}xixi+ϵis0(xi,σi)+2ϵiz8: xi←xi+ϵis0∗(xi,i)+2ϵiz\mathbf{x}_i \leftarrow \mathbf{x}_i+\epsilon_i \mathbf{s}_{0 *}\left(\mathbf{x}_i, i\right)+\sqrt{2 \epsilon_i} \mathbf{z}xixi+ϵis0(xi,i)+2ϵiz
9: return x0\mathbf{x}_0x09: return x0\mathbf{x}_0x0

Algorithm 4 Corrector algorithm (VE SDE)
Require: {σi}i=1N,r,N,M\left\{\sigma_i\right\}_{i=1}^N, r, N, M{σi}i=1N,r,N,M.
x00∼N(0,σmax⁡2I)\mathbf{x}_0^0 \sim \mathcal{N}\left(\mathbf{0}, \sigma_{\max }^2 \mathbf{I}\right)x00N(0,σmax2I)
for i←Ni \leftarrow NiN to 1 do
for j←1j \leftarrow 1j1 to MMM do
z∼N(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})zN(0,I)
g←s0∗(xij−1,σi)\mathbf{g} \leftarrow \mathbf{s}_{0 *}\left(\mathbf{x}_i^{j-1}, \sigma_i\right)gs0(xij1,σi)
ϵ←2(r∥z∥2/∥g∥2)2\epsilon \leftarrow 2\left(r\|\mathbf{z}\|_2 /\|\mathbf{g}\|_2\right)^2ϵ2(rz2/∥g2)2
xij←xij−1+ϵg+2ϵz\mathbf{x}_i^j \leftarrow \mathbf{x}_i^{j-1}+\epsilon \mathbf{g}+\sqrt{2 \epsilon} \mathbf{z}xijxij1+ϵg+2ϵz
xii←xii+1\mathbf{x}_i^i \leftarrow \mathbf{x}_i^{i+1}xiixii+1
return x00\mathbf{x}_0^0x00

Algorithm 5 Corrector algorithm (VP SDE)
Require: {βi}i=1N,{αi}i=1N,r,N,M\left\{\beta_i\right\}_{i=1}^N,\left\{\alpha_i\right\}_{i=1}^N, r, N, M{βi}i=1N,{αi}i=1N,r,N,M.
xi0∼N(0,I)\mathbf{x}_i^0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})xi0N(0,I)
for i←Ni \leftarrow NiN to 1 do
for j←1j \leftarrow 1j1 to MMM do
z∼N(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})zN(0,I)
g←s0∗(xij−1,i)\mathbf{g} \leftarrow \mathbf{s}_{0 *}\left(\mathbf{x}_i^{j-1}, i\right)gs0(xij1,i)
ϵ←2(r∥z∥2/∥g∥2)2\epsilon \leftarrow 2\left(r\|\mathbf{z}\|_2 /\|\mathbf{g}\|_2\right)^2ϵ2(rz2/∥g2)2
xij←xij−1+ϵg+2ϵz\mathbf{x}_i^j \leftarrow \mathbf{x}_i^{j-1}+\epsilon \mathbf{g}+\sqrt{2 \epsilon} \mathbf{z}xijxij1+ϵg+2ϵz
xii←xii+1\mathbf{x}_i^i \leftarrow \mathbf{x}_i^{i+1}xiixii+1
return x00\mathbf{x}_0^0x00

Denoising

For both SMLD and DDPM models, the generated samples typically contain small noise that is hard to detect by humans. As noted by Jolicocur-Martineau et al. (2020), FIDs can be significantly worse without removing this noise. This unfortunate sensitivity to noise is also part of the reason why NCSN models trained with SMLD has been performing worse than DDPM models in terms of FID, because the former does not use a denoising step at the end of sampling, while the latter does. In all experiments of this paper we ensure there is a single denoising step at the end of sampling, using Tweedie’s formula (Efron, 2011).

Figure 9: PC sampling for LSUN bedroom and church. The vertical axis corresponds to the total computation, and the horizontal axis represents the amount of computation allocated to the corrector. Samples are the best when computation is split between the predictor and corrector.

Training

We use the same architecture in Ho et al. (2020) for our score-based models. For the VE SDE, we train a model with the original SMLD objective in Eq. (1); similarly for the VP SDE, we use the original DDPM objective in Eq. (3). We apply a total number of 1000 noise scales for training both models. For results in Fig. 9, we train an NCSN++ model (definition in Appendix H) on 256×256256 \times 256256×256 LSUN bedroom and church_outdoor (Yu et al., 2015) datasets with the VE SDE and our continuous objective Eq. (7). The batch size is fixed to 128 on CIFAR-10 and 64 on LSUN.

Ad-hoc interpolation methods for noise scales

Models in this experiment are all trained with 1000 noise scales. To get results for P2000 (predictor-only sampler using 2000 steps) which requires 2000 noise scales, we need to interpolate between 1000 noise scales at test time. The specific architecture of the noise-conditional score-based model in Ho et al. (2020) uses sinusoidal positional embeddings for conditioning on integer time steps. This allows us to interpolate between noise scales at test time in an ad-hoc way (while hard to do so for other architectures like the one in Song & Ermon (2019)). Specifically, for SMLD models, we keep σmin⁡\sigma_{\min}σmin and σmax⁡\sigma_{\max}σmax fixed and double the number of time steps. For DDPM models, we halve βmin⁡\beta_{\min}βmin and βmax⁡\beta_{\max}βmax before doubling the number of time steps. Suppose {sθ(x,i)}i=0N−1\left\{s_\theta(\mathbf{x}, i)\right\}_{i=0}^{N-1}{sθ(x,i)}i=0N1 is a score-based model trained on NNN time steps, and let {sθ′(x,i)}i=02N−1\left\{s_\theta^{\prime}(\mathbf{x}, i)\right\}_{i=0}^{2 N-1}{sθ(x,i)}i=02N1 denote the corresponding interpolated score-based model at 2N2 N2N time steps. We test two different interpolation strategies for time steps: linear interpolation where sθ′(x,i)=sθ(x,i/2)\mathrm{s}_\theta^{\prime}(\mathrm{x}, i)=\mathrm{s}_\theta(\mathrm{x}, i / 2)sθ(x,i)=sθ(x,i/2) and rounding interpolation where sθ′(x,i)=sθ(x,⌊i/2⌋)\mathrm{s}_\theta^{\prime}(\mathrm{x}, i)=\mathrm{s}_\theta(\mathrm{x}, \lfloor i / 2 \rfloor)sθ(x,i)=sθ(x,i/2⌋). We provide results with linear interpolation in Table 1, and give results of rounding interpolation in Table 4. We observe that different interpolation methods result in performance differences but maintain the general trend of predictor-corrector methods performing on par or better than predictor-only or corrector-only samplers.

Hyper-parameters of the samplers

For Predictor-Corrector and corrector-only samplers on CIFAR-10, we search for the best signal-to-noise ratio (r)(r)(r) over a grid that increments at 0.01. We report the best rrr in Table 5. For LSUN bedroom/church_outdoor, we fix rrr to 0.075. Unless otherwise noted, we use one corrector step per noise scale for all PC samplers. We use two corrector steps per noise scale for corrector-only samplers on CIFAR-10. For sample generation, the batch size is 1024 on CIFAR-10 and 8 on LSUN bedroom/church_outdoor.

Table 4: Comparing different samplers on CIFAR-10, where “P2000” uses the rounding interpolation between noise scales. Shaded regions are obtained with the same computation (number of score function evaluations). Mean and standard deviation are reported over five sampling runs.

Variance Exploding SDE (SMLD)Variance Preserving SDE (DDPM)
P1000P2000C2000PC1000P1000P2000C2000PC1000
ancestral sampling4.98±0.064.98 \pm 0.064.98±0.064.92±0.824.92 \pm 0.824.92±0.823.62±0.303.62 \pm 0.303.62±0.303.24±0.023.24 \pm 0.023.24±0.023.11±0.333.11 \pm 0.333.11±0.333.21±0.823.21 \pm 0.823.21±0.82
reverse diffusion4.79±0.074.79 \pm 0.074.79±0.074.72±0.824.72 \pm 0.824.72±0.8220.43±0.0720.43 \pm 0.0720.43±0.073.60±0.323.60 \pm 0.323.60±0.323.21±0.023.21 \pm 0.023.21±0.023.10±0.333.10 \pm 0.333.10±0.3319.05±0.0619.05 \pm 0.0619.05±0.063.18±0.813.18 \pm 0.813.18±0.81
probability flow15.41±0.1515.41 \pm 0.1515.41±0.1512.87±0.8312.87 \pm 0.8312.87±0.833.51±0.343.51 \pm 0.343.51±0.343.59±0.043.59 \pm 0.043.59±0.043.25±0.843.25 \pm 0.843.25±0.843.06±0.333.06 \pm 0.333.06±0.33

Table 5: Optimal signal-to-noise ratios of different samplers. “P1000” or “P2000”: predictor-only samplers using 1000 or 2000 steps. “C2000”: corrector-only samplers using 2000 steps. “PC1000”: PC samplers using 1000 predictor and 1000 corrector steps.

VE SDE (SMLD)VP SDE (DDPM)
P1000P2000C2000PC1000P1000P2000C2000PC1000
ancestral sampling--0.17--0.01
reverse diffusion--0.220.16--0.270.01
probability flow--0.17--0.04

H ARCHITECTURE IMPROVEMENTS

We explored several architecture designs to improve score-based models for both VE and VP SDEs. Our endeavor gives rise to new state-of-the-art sample quality on CIFAR-10, new state-of-the-art likelihood on uniformly dequantized CIFAR-10, and enables the first high-fidelity image samples of resolution 1024×10241024 \times 10241024×1024 from score-based generative models. Code and checkpoints are open-sourced at https://github.com/yang-song/score_sde.

H.1 SETTINGS FOR ARCHITECTURE EXPLORATION

Unless otherwise noted, all models are trained for 1.3M1.3 \mathrm{M}1.3M iterations, and we save one checkpoint per 50k50 \mathrm{k}50k iterations. For VE SDEs, we consider two datasets: 32×3232 \times 3232×32 CIFAR-10 (Krizhevsky et al., 2009) and 64×6464 \times 6464×64 CelebA (Liu et al., 2015), pre-processed following Song & Ermon (2020). We compare different configurations based on their FID scores averaged over checkpoints after 0.5M0.5 \mathrm{M}0.5M iterations. For VP SDEs, we only consider the CIFAR-10 dataset to save computation, and compare models based on the average FID scores over checkpoints obtained between 0.25M0.25 \mathrm{M}0.25M and 0.5M0.5 \mathrm{M}0.5M iterations, because FIDs turn to increase after 0.5M0.5 \mathrm{M}0.5M iterations for VP SDEs.
All FIDs are computed on 50k50 \mathrm{k}50k samples with tensorflow_gan. For sampling, we use the PC sampler discretized at 1000 time steps. We choose reverse diffusion (see Appendix E) as the predictor. We use one corrector step per update of the predictor for VE SDEs with a signal-to-noise ratio of 0.16, but save the corrector step for VP SDEs since correctors there only give slightly better results but require double computation. We follow Ho et al. (2020) for optimization, including the learning rate, gradient clipping, and learning rate warm-up schedules. Unless otherwise noted, models are trained with the original discrete SMLD and DDPM objectives in Eqs. (1) and (3) and use a batch size of 128. The optimal architectures found under these settings are subsequently transferred to continuous objectives and deeper models. We also directly transfer the best architecture for VP SDEs to sub-VP SDEs, given the similarity of these two SDEs.

Figure 10: The effects of different architecture components for score-based models trained with VE perturbations.

Our architecture is mostly based on Ho et al. (2020). We additionally introduce the following components to maximize the potential improvement of score-based models.

  1. Upsampling and downsampling images with anti-aliasing based on Finite Impulse Response (FIR) (Zhang, 2019). We follow the same implementation and hyper-parameters in StyleGAN-2 (Karras et al., 2020b).
  2. Rescaling all skip connections by 1/21 / \sqrt{2}1/2. This has been demonstrated effective in several best-in-class GAN models, including ProgressiveGAN (Karras et al., 2018), StyleGAN (Karras et al., 2019) and StyleGAN-2 (Karras et al., 2020b).
  3. Replacing the original residual blocks in DDPM with residual blocks from BigGAN (Brock et al., 2018).
  4. Increasing the number of residual blocks per resolution from 2 to 4.
  5. Incorporating progressive growing architectures. We consider two progressive architectures for input: “input skip” and “residual”, and two progressive architectures for output: “output skip” and “residual”. These progressive architectures are defined and implemented according to StyleGAN-2.

We also tested equalized learning rates, a trick used in very successful models like ProgressiveGAN (Karras et al., 2018) and StyleGAN (Karras et al., 2019). However, we found it harmful at an early stage of our experiments, and therefore decided not to explore more on it.
The exponential moving average (EMA) rate has a significant impact on performance. For models trained with VE perturbations, we notice that 0.999 works better than 0.9999, whereas for models trained with VP perturbations it is the opposite. We therefore use an EMA rate of 0.999 and 0.9999 for VE and VP models respectively.

H.2 RESULTS ON CIFAR-10

All architecture components introduced above can improve the performance of score-based models trained with VE SDEs, as shown in Fig. 10. The box plots demonstrate the importance of each component when other components can vary freely. On both CIFAR-10 and CelebA, the additional components that we explored always improve the performance on average for VE SDEs. For progressive growing, it is not clear which combination of configurations consistently performs the best, but the results are typically better than when no progressive growing architecture is used. Our best score-based model for VE SDEs 1) uses FIR upsampling/downsampling, 2) rescales skip connections, 3) employs BigGAN-type residual blocks, 4) uses 4 residual blocks per resolution instead of 2, and 5) uses “residual” for input and no progressive growing architecture for output. We name this model “NCSN++”, following the naming convention of previous SMLD models (Song & Ermon, 2019; 2020).

We followed a similar procedure to examine these architecture components for VP SDEs, except that we skipped experiments on CelebA due to limited computing resources. The NCSN++ architecture worked decently well for VP SDEs, ranked 4th place over all 144 possible configurations. The top configuration, however, has a slightly different structure, which uses no FIR upsampling/downsampling and no progressive growing architecture compared to NCSN++. We name this model “DDPM++”, following the naming convention of Ho et al. (2020).

The basic NCSN++ model with 4 residual blocks per resolution achieves an FID of 2.45 on CIFAR-10, whereas the basic DDPM++ model achieves an FID of 2.78. Here in order to match the convention used in Karras et al. (2018); Song & Ermon (2019) and Ho et al. (2020), we report the lowest FID value over the course of training, rather than the average FID value over checkpoints after 0.5M0.5 \mathrm{M}0.5M iterations (used for comparing different models of VE SDEs) or between 0.25M0.25 \mathrm{M}0.25M and 0.5M0.5 \mathrm{M}0.5M iterations (used for comparing VP SDE models) in our architecture exploration.

Switching from discrete training objectives to continuous ones in Eq. (7) further improves the FID values for all SDEs. To condition the NCSN++ model on continuous time variables, we change positional embeddings, the layers in Ho et al. (2020) for conditioning on discrete time steps, to random Fourier feature embeddings (Tancik et al., 2020). The scale parameter of these random Fourier feature embeddings is fixed to 16. We also reduce the number of training iterations to 0.95M0.95 \mathrm{M}0.95M to suppress overfitting. These changes improve the FID on CIFAR-10 from 2.45 to 2.38 for NCSN++ trained with the VE SDE, resulting in a model called “NCSN++ cont.”. In addition, we can further improve the FID from 2.38 to 2.20 by doubling the number of residual blocks per resolution for NCSN++ cont., resulting in the model denoted as “NCSN++ cont. (deep)”. All quantitative results are summarized in Table 3, and we provide random samples from our best model in Fig. 11.

Similarly, we can also condition the DDPM++ model on continuous time steps, resulting in a model “DDPM++ cont.”. When trained with the VP SDE, it improves the FID of 2.78 from DDPM++ to 2.55. When trained with the sub-VP SDE, it achieves an FID of 2.61. To get better performance, we used the Euler-Maruyama solver as the predictor for continuously-trained models, instead of the ancestral sampling predictor or the reverse diffusion predictor. This is because the discretization strategy of the original DDPM method does not match the variance of the continuous process well when t→0t \rightarrow 0t0, which significantly hurts FID scores. As shown in Table 2, the likelihood values are 3.21 and 3.05 bits/dim for VP and sub-VP SDEs respectively. Doubling the depth, and training with 0.95M0.95 \mathrm{M}0.95M iterations, we can improve both FID and bits/dim for both VP and sub-VP SDEs, leading to a model “DDPM++ cont. (deep)”. Its FID score is 2.41, same for both VP and sub-VP SDEs. When trained with the sub-VP SDE, it can achieve a likelihood of 2.99 bits/dim. Here all likelihood values are reported for the last checkpoint during training.

Figure 11: Unconditional CIFAR-10 samples from NCSN++ cont. (deep, VE).

H.3 HIGH RESOLUTION IMAGES

Encouraged by the success of NCSN++ on CIFAR-10, we proceed to test it on 1024×10241024 \times 10241024×1024 CelebA-HQ (Karras et al., 2018), a task that was previously only achievable by some GAN models and VQ-VAE-2 (Razavi et al., 2019). We used a batch size of 8, increased the EMA rate to 0.9999, and trained a model similar to NCSN++ with the continuous objective (Eq. (7)) for around 2.4M iterations (please find the detailed architecture in our code release.) We use the PC sampler discretized at 2000 steps with the reverse diffusion predictor, one Langevin step per predictor update and a signal-to-noise ratio of 0.15. The scale parameter for the random Fourier feature embeddings is fixed to 16. We use the “input skip” progressive architecture for the input, and “output skip” progressive architecture for the output. We provide samples in Fig. 12. Although these samples are not perfect (e.g., there are visible flaws on facial symmetry), we believe these results are encouraging and can demonstrate the scalability of our approach. Future work on more effective architectures are likely to significantly advance the performance of score-based generative models on this task.

Figure 12: Samples on 1024×10241024 \times 10241024×1024 CelebA-HQ from a modified NCSN++ model trained with the VE SDE.

I CONTROLLABLE GENERATION

Consider a forward SDE with the following general form

dx=f(x,t)dt+G(x,t)dw, \mathrm{dx}=\mathbf{f}(\mathbf{x}, t) \mathrm{d} t+\mathbf{G}(\mathbf{x}, t) \mathrm{d} \mathbf{w}, dx=f(x,t)dt+G(x,t)dw,

and suppose the initial state distribution is p0(x(0)∣y)p_0(\mathbf{x}(0) \mid \mathbf{y})p0(x(0)y). The density at time ttt is pt(x(t)∣y)p_t(\mathbf{x}(t) \mid \mathbf{y})pt(x(t)y) when conditioned on y\mathbf{y}y. Therefore, using Anderson (1982), the reverse-time SDE is given by

dx={f(x,t)−∇⋅[G(x,t)G(x,t)⊤]−G(x,t)G(x,t)⊤∇xlog⁡pt(x∣y)}dt+G(x,t)dw.(48) \mathrm{dx}=\left\{\mathbf{f}(\mathbf{x}, t)-\nabla \cdot\left[\mathbf{G}(\mathbf{x}, t) \mathbf{G}(\mathbf{x}, t)^{\top}\right]-\mathbf{G}(\mathbf{x}, t) \mathbf{G}(\mathbf{x}, t)^{\top} \nabla_{\mathbf{x}} \log p_t(\mathbf{x} \mid \mathbf{y})\right\} \mathrm{d} t+\mathbf{G}(\mathbf{x}, t) \mathrm{d} \mathbf{w}. (48) dx={f(x,t)[G(x,t)G(x,t)]G(x,t)G(x,t)xlogpt(xy)}dt+G(x,t)dw.(48)

Since pt(x(t)∣y)∝pt(x(t))p(y∣x(t))p_t(\mathbf{x}(t) \mid \mathbf{y}) \propto p_t(\mathbf{x}(t)) p(\mathbf{y} \mid \mathbf{x}(t))pt(x(t)y)pt(x(t))p(yx(t)), the score ∇xlog⁡pt(x(t)∣y)\nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t) \mid \mathbf{y})xlogpt(x(t)y) can be computed easily by

∇xlog⁡pt(x(t)∣y)=∇xlog⁡pt(x(t))+∇xlog⁡p(y∣x(t)).(49) \nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t) \mid \mathbf{y})=\nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t))+\nabla_{\mathbf{x}} \log p(\mathbf{y} \mid \mathbf{x}(t)). (49) xlogpt(x(t)y)=xlogpt(x(t))+xlogp(yx(t)).(49)

This subsumes the conditional reverse-time SDE in Eq. (14) as a special case. All sampling methods we have discussed so far can be applied to the conditional reverse-time SDE for sample generation.

I.1 Class-conditional sampling

When y\mathbf{y}y represents class labels, we can train a time-dependent classifier pt(y∣x(t))p_t(\mathbf{y} \mid \mathbf{x}(t))pt(yx(t)) for class-conditional sampling. Since the forward SDE is tractable, we can easily create a pair of training data (x(t),y)(\mathbf{x}(t), \mathbf{y})(x(t),y) by first sampling (x(0),y)(\mathbf{x}(0), \mathbf{y})(x(0),y) from a dataset and then obtaining x(t)∼p0t(x(t)∣x(0))\mathbf{x}(t) \sim p_{0t}(\mathbf{x}(t) \mid \mathbf{x}(0))x(t)p0t(x(t)x(0)). Afterwards, we may employ a mixture of cross-entropy losses over different time steps, like Eq. (7), to train the time-dependent classifier pt(y∣x(t))p_t(\mathbf{y} \mid \mathbf{x}(t))pt(yx(t)).

To test this idea, we trained a Wide ResNet (Zagoruyko & Komodakis, 2016) (Wide-ResNet-28-10) on CIFAR-10 with VE perturbations. The classifier is conditioned on log⁡σi\log \sigma_ilogσi using random Fourier features (Tancik et al., 2020), and the training objective is a simple sum of cross-entropy losses sampled at different scales. We provide a plot to show the accuracy of this classifier over noise scales in Fig. 13. The score-based model is an unconditional NCSN++ (4 blocks/resolution) in Table 3, and we generate samples using the PC algorithm with 2000 discretization steps. The class-conditional samples are provided in Fig. 4, and an extended set of conditional samples is given in Fig. 13.

Figure 13: Class-conditional image generation by solving the conditional reverse-time SDE with PC. The curve shows the accuracy of our noise-conditional classifier over different noise scales.

I.2 IMPUTATION

Imputation is a special case of conditional sampling. Denote by Ω(x)\Omega(\mathbf{x})Ω(x) and Ωˉ(x)\bar{\Omega}(\mathbf{x})Ωˉ(x) the known and unknown dimensions of x\mathbf{x}x respectively, and let fΩ(⋅,t)\mathbf{f}_{\Omega}(\cdot, t)fΩ(,t) and GΩˉ(⋅,t)\mathbf{G}_{\bar{\Omega}}(\cdot, t)GΩˉ(,t) denote f(⋅,t)\mathbf{f}(\cdot, t)f(,t) and G(⋅,t)\mathbf{G}(\cdot, t)G(,t) restricted to the unknown dimensions. For VE/VP SDEs, the drift coefficient f(⋅,t)\mathbf{f}(\cdot, t)f(,t) is element-wise, and the diffusion coefficient G(⋅,t)\mathbf{G}(\cdot, t)G(,t) is diagonal. When f(⋅,t)\mathbf{f}(\cdot, t)f(,t) is element-wise, fΩ(⋅,t)\mathbf{f}_{\Omega}(\cdot, t)fΩ(,t) denotes the same element-wise function applied only to the unknown dimensions. When G(⋅,t)\mathbf{G}(\cdot, t)G(,t) is diagonal, GΩ(⋅,t)\mathbf{G}_{\Omega}(\cdot, t)GΩ(,t) denotes the sub-matrix restricted to unknown dimensions.

For imputation, our goal is to sample from p(Ω(x(0))∣Ω(x(0))=y)p(\Omega(\mathbf{x}(0)) \mid \Omega(\mathbf{x}(0))=\mathbf{y})p(Ω(x(0))Ω(x(0))=y). Define a new diffusion process z(t)=Ω(x(t))\mathbf{z}(t)=\Omega(\mathbf{x}(t))z(t)=Ω(x(t)), and note that the SDE for z(t)\mathbf{z}(t)z(t) can be written as

dz=fΩ(z,t)dt+GΩ(z,t)dw. \mathrm{d} \mathbf{z}=\mathbf{f}_{\Omega}(\mathbf{z}, t) \mathrm{d} t+\mathbf{G}_{\Omega}(\mathbf{z}, t) \mathrm{d} \mathbf{w}. dz=fΩ(z,t)dt+GΩ(z,t)dw.

The reverse-time SDE, conditioned on Ω(x(0))=y\Omega(\mathbf{x}(0)) = \mathbf{y}Ω(x(0))=y, is given by

dz={fΩ(z,t)−∇⋅[GΩ(z,t)GΩ(z,t)⊤]−GΩ(z,t)GΩ(z,t)⊤∇zlog⁡pt(z∣Ω(z(0))=y)}dt+GΩ(z,t)dw. \mathrm{d}\mathbf{z} = \left\{ \mathbf{f}_{\Omega}(\mathbf{z}, t) - \nabla \cdot \left[ \mathbf{G}_{\Omega}(\mathbf{z}, t) \mathbf{G}_{\Omega}(\mathbf{z}, t)^{\top} \right] - \mathbf{G}_{\Omega}(\mathbf{z}, t) \mathbf{G}_{\Omega}(\mathbf{z}, t)^{\top} \nabla_{\mathbf{z}} \log p_t(\mathbf{z} \mid \Omega(\mathbf{z}(0)) = \mathbf{y}) \right\} \mathrm{d}t + \mathbf{G}_{\Omega}(\mathbf{z}, t) \mathrm{d}\mathbf{w}. dz={fΩ(z,t)[GΩ(z,t)GΩ(z,t)]GΩ(z,t)GΩ(z,t)zlogpt(zΩ(z(0))=y)}dt+GΩ(z,t)dw.

Although pt(z(t)∣Ω(x(0))=y)p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(0))=\mathbf{y})pt(z(t)Ω(x(0))=y) is in general intractable, it can be approximated. Let AAA denote the event Ω(x(0))=y\Omega(\mathbf{x}(0))=\mathbf{y}Ω(x(0))=y. We have

pt(z(t)∣Ω(x(0))=y)=pt(z(t)∣A)=∫pt(z(t)∣Ω(x(t)),A)pt(Ω(x(t))∣A)dΩ(x(t))=Ept(Ω(x(t))∣A)[pt(z(t)∣Ω(x(t)),A)]≈Ept(Ω(x(t))∣A)[pt(z(t)∣Ω(x(t)))]≈pt(z(t)∣Ω(x(t))), \begin{aligned} p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(0))=\mathbf{y})=p_t(\mathbf{z}(t) \mid A) & =\int p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(t)), A) p_t(\Omega(\mathbf{x}(t)) \mid A) \mathrm{d}\Omega(\mathbf{x}(t)) \\ & =\mathbb{E}_{p_t(\Omega(\mathbf{x}(t)) \mid A)}\left[p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(t)), A)\right] \\ & \approx \mathbb{E}_{p_t(\Omega(\mathbf{x}(t)) \mid A)}\left[p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(t)))\right] \\ & \approx p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(t))), \end{aligned} pt(z(t)Ω(x(0))=y)=pt(z(t)A)=pt(z(t)Ω(x(t)),A)pt(Ω(x(t))A)dΩ(x(t))=Ept(Ω(x(t))A)[pt(z(t)Ω(x(t)),A)]Ept(Ω(x(t))A)[pt(z(t)Ω(x(t)))]pt(z(t)Ω(x(t))),

where Ω(x(t))\Omega(\mathbf{x}(t))Ω(x(t)) is a random sample from pt(Ω(x(t))∣A)p_t(\Omega(\mathbf{x}(t)) \mid A)pt(Ω(x(t))A), which is typically a tractable distribution. Therefore,

∇zlog⁡pt(z(t)∣Ω(x(0))=y)≈∇zlog⁡pt(z(t)∣Ω(x(t)))=∇zlog⁡pt([z(t);Ω(x(t))]), \begin{aligned} \nabla_{\mathbf{z}} \log p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(0)) & =\mathbf{y}) \approx \nabla_{\mathbf{z}} \log p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(t))) \\ & =\nabla_{\mathbf{z}} \log p_t([\mathbf{z}(t) ; \Omega(\mathbf{x}(t))]), \end{aligned} zlogpt(z(t)Ω(x(0))=y)zlogpt(z(t)Ω(x(t)))=zlogpt([z(t);Ω(x(t))]),

where [z(t);Ω(x(t))][\mathbf{z}(t) ; \Omega(\mathbf{x}(t))][z(t);Ω(x(t))] denotes a vector u(t)\mathbf{u}(t)u(t) such that Ω(u(t))=Ω(x(t))\Omega(\mathbf{u}(t))=\Omega(\mathbf{x}(t))Ω(u(t))=Ω(x(t)) and Ωˉ(u(t))=z(t)\bar{\Omega}(\mathbf{u}(t))=\mathbf{z}(t)Ωˉ(u(t))=z(t), and the identity holds because ∇zlog⁡pt([z(t);Ω(x(t))])=∇zlog⁡pt(z(t)∣Ω(x(t)))+∇zlog⁡p(Ω(x(t)))=∇zlog⁡pt(z(t)∣Ω(x(t)))\nabla_{\mathbf{z}} \log p_t([\mathbf{z}(t) ; \Omega(\mathbf{x}(t))])=\nabla_{\mathbf{z}} \log p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(t)))+\nabla_{\mathbf{z}} \log p(\Omega(\mathbf{x}(t)))=\nabla_{\mathbf{z}} \log p_t(\mathbf{z}(t) \mid \Omega(\mathbf{x}(t)))zlogpt([z(t);Ω(x(t))])=zlogpt(z(t)Ω(x(t)))+zlogp(Ω(x(t)))=zlogpt(z(t)Ω(x(t))).

We provided an extended set of inpainting results in Figs. 14 and 15.

I.3 COLORIZATION

Colorization is a special case of imputation, except that the known data dimensions are coupled. We can decouple these data dimensions by using an orthogonal linear transformation to map the gray-scale image to a separate channel in a different space, and then perform imputation to complete the other channels before transforming everything back to the original image space. The orthogonal matrix we used to decouple color channels is

(0.577−0.81600.5770.4080.7070.5770.408−0.707). \begin{pmatrix} 0.577 & -0.816 & 0 \\ 0.577 & 0.408 & 0.707 \\ 0.577 & 0.408 & -0.707 \end{pmatrix}. 0.5770.5770.5770.8160.4080.40800.7070.707.

Because the transformations are all orthogonal matrices, the standard Wiener process w(t)\mathbf{w}(t)w(t) will still be a standard Wiener process in the transformed space, allowing us to build an SDE and use the same imputation method in Appendix I.2. We provide an extended set of colorization results in Figs. 16 and 17.

I.4 SOLVING GENERAL INVERSE PROBLEMS

Suppose we have two random variables x\mathbf{x}x and y\mathbf{y}y, and we know the forward process of generating y\mathbf{y}y from x\mathbf{x}x, given by p(y∣x)p(\mathbf{y} \mid \mathbf{x})p(yx). The inverse problem is to obtain x\mathbf{x}x from y\mathbf{y}y, that is, generating samples from p(x∣y)p(\mathbf{x} \mid \mathbf{y})p(xy). In principle, we can estimate the prior distribution p(x)p(\mathbf{x})p(x) and obtain p(x∣y)p(\mathbf{x} \mid \mathbf{y})p(xy) using Bayes’ rule: p(x∣y)=p(x)p(y∣x)/p(y)p(\mathbf{x} \mid \mathbf{y})=p(\mathbf{x}) p(\mathbf{y} \mid \mathbf{x}) / p(\mathbf{y})p(xy)=p(x)p(yx)/p(y). In practice, however, both estimating the prior and performing Bayesian inference are non-trivial.

Leveraging Eq. (48), score-based generative models provide one way to solve the inverse problem. Suppose we have a diffusion process {x(t)}t=0t\{\mathbf{x}(t)\}_{t=0}^t{x(t)}t=0t generated by perturbing x\mathbf{x}x with an SDE, and a time-dependent score-based model sθ∗(x(t),t)\mathrm{s}_{\theta *}(\mathbf{x}(t), t)sθ(x(t),t) trained to approximate ∇xlog⁡pt(x(t))\nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t))xlogpt(x(t)). Once we have an estimate of ∇xlog⁡pt(x(t)∣y)\nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t) \mid \mathbf{y})xlogpt(x(t)y), we can simulate the reverse-time SDE in Eq. (48) to sample from p0(x(0)∣y)=p(x∣y)p_0(\mathbf{x}(0) \mid \mathbf{y})=p(\mathbf{x} \mid \mathbf{y})p0(x(0)y)=p(xy). To obtain this estimate, we first observe that

∇xlog⁡pt(x(t)∣y)=∇xlog⁡∫pt(x(t)∣y(t),y)p(y(t)∣y)dy(t), \nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t) \mid \mathbf{y})=\nabla_{\mathbf{x}} \log \int p_t(\mathbf{x}(t) \mid \mathbf{y}(t), \mathbf{y}) p(\mathbf{y}(t) \mid \mathbf{y}) \mathrm{d} \mathbf{y}(t), xlogpt(x(t)y)=xlogpt(x(t)y(t),y)p(y(t)y)dy(t),

where y(t)\mathbf{y}(t)y(t) is defined via x(t)\mathbf{x}(t)x(t) and the forward process p(y(t)∣x(t))p(\mathbf{y}(t) \mid \mathbf{x}(t))p(y(t)x(t)). Now assume two conditions:

  • p(y(t)∣y)p(\mathbf{y}(t) \mid \mathbf{y})p(y(t)y) is tractable. We can often derive this distribution from the interaction between the forward process and the SDE, like in the case of image imputation and colorization.
  • pt(x(t)∣y(t),y)≈pt(x(t)∣y(t))p_t(\mathbf{x}(t) \mid \mathbf{y}(t), \mathbf{y}) \approx p_t(\mathbf{x}(t) \mid \mathbf{y}(t))pt(x(t)y(t),y)pt(x(t)y(t)). For small t,y(t)t, \mathbf{y}(t)t,y(t) is almost the same as y\mathbf{y}y so the approximation holds. For large t,yt, \mathbf{y}t,y becomes further away from x(t)\mathbf{x}(t)x(t) in the Markov chain, and thus have smaller impact on x(t)\mathbf{x}(t)x(t). Moreover, the approximation error for large ttt matters less for the final sample, since it is used early in the sampling process.

Given these two assumptions, we have

∇xlog⁡pt(x(t)∣y)≈∇xlog⁡∫pt(x(t)∣y(t))p(y(t)∣y)dy≈∇xlog⁡pt(x(t)∣y^(t))=∇xlog⁡pt(x(t))+∇xlog⁡pt(y^(t)∣x(t))≈sθ∗(x(t),t)+∇xlog⁡pt(y^(t)∣x(t)),(50) \begin{aligned} \nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t) \mid \mathbf{y}) & \approx \nabla_{\mathbf{x}} \log \int p_t(\mathbf{x}(t) \mid \mathbf{y}(t)) p(\mathbf{y}(t) \mid \mathbf{y}) \mathrm{d} \mathbf{y} \\ & \approx \nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t) \mid \hat{\mathbf{y}}(t)) \\ & =\nabla_{\mathbf{x}} \log p_t(\mathbf{x}(t))+\nabla_{\mathbf{x}} \log p_t(\hat{\mathbf{y}}(t) \mid \mathbf{x}(t)) \\ & \approx \mathbf{s}_{\theta *}(\mathbf{x}(t), t)+\nabla_{\mathbf{x}} \log p_t(\hat{\mathbf{y}}(t) \mid \mathbf{x}(t)), \end{aligned} (50) xlogpt(x(t)y)xlogpt(x(t)y(t))p(y(t)y)dyxlogpt(x(t)y^(t))=xlogpt(x(t))+xlogpt(y^(t)x(t))sθ(x(t),t)+xlogpt(y^(t)x(t)),(50)

where y^(t)\hat{\mathbf{y}}(t)y^(t) is a sample from p(y(t)∣y)p(\mathbf{y}(t) \mid \mathbf{y})p(y(t)y). Now we can plug Eq. (50) into Eq. (48) and solve the resulting reverse-time SDE to generate samples from p(x∣y)p(\mathbf{x} \mid \mathbf{y})p(xy).

Figure 14: Extended inpainting results for 256×256256 \times 256256×256 bedroom images.

Figure 15: Extended inpainting results for 256×256256 \times 256256×256 church images.

Figure 16: Extended colorization results for 256×256256 \times 256256×256 bedroom images.

Figure 17: Extended colorization results for 256×256256 \times 256256×256 church images.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值