详细地拆解扩散模型的优化目标，深入探讨为什么选择变分下界（VLB）以及它是如何一步步推导得到的-CSDN博客

1. 目标：最大化生成数据的似然

我们的终极目标是让模型生成的数据分布 $pθ(x0)p_\theta(\mathbf{x}_0)$ 尽可能接近真实数据分布 $q(x0)q(\mathbf{x}_0)$ 。一个自然的衡量标准是最大化模型对真实数据的对数似然（log-likelihood）：

$\max_\theta \log p_\theta(\mathbf{x}_0)$

然而，对于扩散这种隐变量模型，直接计算 $pθ(x0)p_\theta(\mathbf{x}_0)$ 是难以处理（Intractable） 的，因为它需要对所有可能的隐变量（即所有可能的噪声轨迹 $x1:T\mathbf{x}_{1:T}$ ）进行积分：

$p_\theta(\mathbf{x}_0) = \int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} = \int p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) d\mathbf{x}_{1:T}$

这个积分在高维空间（ $\times \text{维度}$ ）上是无法直接计算的。

2. 解决方案：引入变分下界 (VLB)

当我们无法直接优化一个目标（如 $log⁡pθ(x0)\log p_\theta(\mathbf{x}_0)$ ）时，在变分推断中一个标准技巧是优化它的一个下界（Lower Bound）。这个下界相对容易计算，并且通过优化这个下界，我们也能间接地优化原始目标。

为了得到这个下界，我们引入一个提议分布（Proposal Distribution） $q(x1:T∣x0)q(\mathbf{x}_{1:T} | \mathbf{x}_0)$ ，也就是我们熟悉的、固定的前向过程。接下来，我们开始推导：

$\begin{aligned} \log p_\theta(\mathbf{x}_0) &= \log \int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} \quad \quad \quad \quad \quad \quad \text{(难处理的积分)} \\ &= \log \int \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} q(\mathbf{x}_{1:T} | \mathbf{x}_0) d\mathbf{x}_{1:T} \quad \text{(分子分母同乘 } q) \\ &= \log \left( \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \right) \end{aligned}$

现在，我们应用 Jensen 不等式。因为 $log⁡\log$ 函数是凹函数（Concave），所以不等式方向反转：

$\log \left( \mathbb{E}[\ldots] \right) \geq \mathbb{E} \left[ \log (\ldots) \right]$

Applying this:

$\log p_\theta(\mathbf{x}_0) \geq \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right]$

右边这项 $Eq[log⁡pθ(x0:T)q(x1:T∣x0)]\mathbb{E}_{q} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right]$ 就是证据下界（ELBO），也称为变分下界（VLB）。我们将其定义为 $−LVLB-L_{\text{VLB}}$ ，因此我们的目标从最大化对数似然转化为最小化 $LVLBL_{\text{VLB}}$ 。

$\log p_\theta(\mathbf{x}_0) \geq -L_{\text{VLB}} \quad \Rightarrow \quad \text{目标：} \min_\theta L_{\text{VLB}}$

为什么选择VLB？

可计算性： $LVLBL_{\text{VLB}}$ 是一个期望，我们可以通过蒙特卡洛采样来近似它，这使得优化成为可能。
理论保证：优化 VLB 等价于在最大化对数似然的同时，最小化真实后验分布 $q(x1:T∣x0)q(\mathbf{x}_{1:T} | \mathbf{x}_0)$ 和模型后验 $pθ(x1:T∣x0)p_\theta(\mathbf{x}_{1:T} | \mathbf{x}_0)$ 之间的 KL 散度。
可解释性：它可以被分解为具有明确意义的项。

3. 推导：将 $LVLBL_{\text{VLB}}$ 分解为可解释的项

现在，我们来详细推导 $LVLBL_{\text{VLB}}$ 的分解形式。

$\begin{aligned} L_{\text{VLB}} &= \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{q(\mathbf{x}_{1:T} | \mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \right] \\ &= \mathbb{E}_{q} \left[ \log \frac{ \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}) }{ p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) } \right] \quad \text{(代入马尔可夫链)} \\ &= \mathbb{E}_{q} \left[ -\log p(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)} \right] \end{aligned}$

注意到在 $t = 1$ 时， $q(x1∣x0)q(\mathbf{x}_1 | \mathbf{x}_0)$ 是合理的，但 $pθ(x0∣x1)p_\theta(\mathbf{x}_0 | \mathbf{x}_1)$ 没有定义。为了统一形式，我们引入 $pθ(x0∣x1)p_\theta(\mathbf{x}_0 | \mathbf{x}_1)$ 并巧妙地将求和索引从 $t = 2$ 开始：

现在，我们应用一个关键技巧：利用贝叶斯定理重写 $q(xt∣xt−1)q(\mathbf{x}_t | \mathbf{x}_{t-1})$ 。

$q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) q(\mathbf{x}_t | \mathbf{x}_0)}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)}$

将其代入上式中的 $∑t=2T\sum_{t=2}^T$ 部分：

$\begin{aligned} \sum_{t=2}^T \log \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)} &= \sum_{t=2}^T \left[ \log \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)} + \log \frac{q(\mathbf{x}_t | \mathbf{x}_0)}{q(\mathbf{x}_{t-1} | \mathbf{x}_0)} \right] \\ &= \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)} + \log \frac{q(\mathbf{x}_T | \mathbf{x}_0)}{q(\mathbf{x}_1 | \mathbf{x}_0)} \quad \text{(望远镜求和)} \end{aligned}$

现在把这个结果代回 $LVLBL_{\text{VLB}}$ 的表达式：

$\begin{aligned} L_{\text{VLB}} &= \mathbb{E}_{q} \left[ \log \frac{q(\mathbf{x}_1 | \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 | \mathbf{x}_1)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)} + \log \frac{q(\mathbf{x}_T | \mathbf{x}_0)}{q(\mathbf{x}_1 | \mathbf{x}_0)} - \log p(\mathbf{x}_T) \right] \\ &= \mathbb{E}_{q} \left[ -\log p_\theta(\mathbf{x}_0 | \mathbf{x}_1) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)} + \log \frac{q(\mathbf{x}_T | \mathbf{x}_0)}{p(\mathbf{x}_T)} \right] \quad \text{(合并项)} \end{aligned}$

最后，我们利用 KL 散度的定义 $DKL(p∣∣q)=Ep[log⁡pq]D_{\text{KL}}(p || q) = \mathbb{E}_p[\log \frac{p}{q}]$ ，将其改写为更简洁的形式：

$L_{\text{VLB}} = \mathbb{E}_{q} \left[ \underbrace{D_{\text{KL}}(q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_{\text{KL}}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t))}_{L_{t-1}} \underbrace{-\log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}_{L_0} \right]$

这就是 $LVLBL_{\text{VLB}}$ 的最终分解形式！

4. 分解项的意义

现在，这三项都有了清晰的解释：

$L_T$ : 先验匹配项（Prior Matching Term）。
- 这是反向过程最后一步的分布 $p(xT)=N(0,I)p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ 与前向过程最终分布 $q(xT∣x0)q(\mathbf{x}_T | \mathbf{x}_0)$ 之间的 KL 散度。
- 注意：如果前向过程的方差调度 $βt\beta_t$ 设置得当，使得 $αˉT≈0\bar{\alpha}_T \approx 0$ ，那么 $q(xT∣x0)q(\mathbf{x}_T | \mathbf{x}_0)$ 也会非常接近 $N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I})$ 。因此这一项没有可学习参数，值接近于零，在训练中可以忽略。
$L_0$ : 重构项（Reconstruction Term）。
- 这类似于VAE中的重构项，它衡量的是在给定第一步隐变量 $x1\mathbf{x}_1$ 的情况下，模型重建出原始图像 $x0\mathbf{x}_0$ 的能力。
- 在原始DDPM中，这项工作通过一个独立的离散解码器（Discrete Decoder）来处理，因为它发现这样效果更好。但在实践中，这项的损失值远小于 $L_t$ 项，通常可以忽略其影响。
$L_{t-1}$ ( $\le t \le T$ ): 去噪匹配项（Denoising Matching Term）。
- 这是整个损失函数的核心和关键。
- 它比较了真实的后验分布 $q(xt−1∣xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 和学习的去噪分布 $pθ(xt−1∣xt)p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ 。
- 我们的目标是让 $pθp_\theta$ 尽可能接近 $q$ 。如果 $pθp_\theta$ 完美地匹配了 $q$ ，那么整个反向过程在分布意义上就完美地逆转了前向过程。
- 如前所述， $q(xt−1∣xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 是一个高斯分布，其均值和方差都有解析解。因此，我们可以用另一个高斯分布 $pθp_\theta$ （其方差可设置为 $σt2I\sigma_t^2 \mathbf{I}$ ）去匹配它，而匹配两个高斯分布的关键就是匹配它们的均值。

后续的推导（如重参数化、将均值预测转化为噪声预测）都是基于最小化这个 $L_{t-1}$ 项展开的，最终得到了那个简洁的噪声预测损失 $∥ϵ−ϵθ∥2\| \epsilon - \epsilon_\theta \|^2$ 。

总结

整个推导路径可以概括为：
终极目标（难以处理） -> 引入VLB下界（可处理） -> 分解VLB为三项 -> 识别出核心项 $L_{t-1}$ -> 推导真实后验 $q$ -> 参数化 $pθp_\theta$ 去匹配 $q$ -> 简化得到噪声预测目标。