# 2、Denoising Diffusion Probabilistic Models(DDPMS)原理介绍

DDPMS的加噪和去噪是通过两个Markov链完成的。其中加噪链用于将原始数据转换为容易处理的Noise数据，这个分布一般会被取为Normal(正态的)，去噪链将采样后的Noise数据转换为新的生成数据。

## 2.1、DDPM forward Markov Chain(加噪链)—— q ( x t ∣ x t − 1 ) q(x_t|x_{t-1})

q ( x 0 , x 1 , x 2 , x 3 ⋅ ⋅ ⋅ x T ) = ∏ i = 0 T q ( x t ∣ x t − 1 ) q(x_0,x_1,x_2,x_3···x_T)=\prod \limits_{i=0}^T q(x_t|x_{t-1})

q ( x t ∣ x t − 1 ) = N ( ( 1 − β t ) x t − 1 , β t I )    ⟺    x t = ( 1 − β t ) x t − 1 + β t N ( 0 , 1 ) q(x_t|x_{t-1})=N((\sqrt{1-\beta_t})x_{t-1},\beta_tI) \iff x_t=(\sqrt{1-\beta_t})x_{t-1}+\sqrt{\beta_t}N(0,1)
β ∈ ( 0 , 1 ) \beta \in (0,1) 为超参数，训练前给定。这个过程比较容易理解为：每一步加噪的时候，其中一部分来源于原始数据，另一部分来源于一个正态分布的噪声

q ( x k ∣ x 0 ) = N ( ( ∏ i = 0 k a i ) x 0 , ( 1 − ∏ i = 0 k a i ) I ) q(x_k|x_0)=N((\prod \limits_{i=0}^k \sqrt{a_i})x_0,(1-\prod \limits_{i=0}^k a_i)I)

x 0 = x 0 x_0=x_0
x 1 = ( 1 − β 1 ) x 0 + β 1 z 0 x_1=(\sqrt{1-\beta_1})x_0+\sqrt{\beta_1}z_0
x 2 = ( 1 − β 2 ) x 1 + β 2 z 1 = ( 1 − β 2 ) ( ( 1 − β 1 ) x 0 + β 1 z 0 ) + β 2 z 1 x_2=(\sqrt{1-\beta_2})x_1+\sqrt{\beta_2}z_1=(\sqrt{1-\beta_2})((\sqrt{1-\beta_1})x_0+\sqrt{\beta_1}z_0)+\sqrt{\beta_2}z_1
x 2 x_2 改写，这即：
x 2 = ( 1 − β 2 ) ( 1 − β 1 ) x 0 + ( 1 − β 2 ) β 1 z 0 + β 2 z 1 x_2=(\sqrt{1-\beta_2})(\sqrt{1-\beta_1})x_0+(\sqrt{1-\beta_2})\sqrt{\beta_1}z_0+\sqrt{\beta_2}z_1

q ( x 2 ∣ x 0 ) = N ( ( 1 − β 1 ) ( 1 − β 2 ) x 0 , ( β 1 + β 2 − β 1 β 2 ) I ) q(x_2|x_0)=N((\sqrt{1-\beta_1})(\sqrt{1-\beta_2})x_0,(\beta_1+\beta_2-\beta_1\beta_2)I)

q ( x 2 ∣ x 0 ) = N ( ( 1 − β 1 ) ( 1 − β 2 ) x 0 , ( 1 − ( 1 − β 1 ) ( 1 − β 2 ) ) I ) q(x_2|x_0)=N((\sqrt{1-\beta_1})(\sqrt{1-\beta_2})x_0,(1-(1-\beta_1)(1-\beta_2))I)

q ( x k ∣ x 0 ) = N ( ( ∏ i = 0 k a i ) x 0 , ( 1 − ∏ i = 0 k a i ) I ) q(x_k|x_0)=N((\prod \limits_{i=0}^k \sqrt{a_i})x_0,(1-\prod \limits_{i=0}^k a_i)I)

N ( ( ∏ i = 0 k a i ) x 0 , ( 1 − ∏ i = 0 k a i ) I ) N((\prod \limits_{i=0}^k \sqrt{a_i})x_0,(1-\prod \limits_{i=0}^k {a_i})I)

lim ⁡ k → ∞ ( N ( ( ∏ i = 0 k a k ) x 0 , ( 1 − ∏ i = 0 k a k ) I ) ) = N ( 0 , 1 ) \lim_{k \rightarrow ∞}(N((\prod \limits_{i=0}^k \sqrt{a_k})x_0,(1-\prod \limits_{i=0}^k {a_k})I))=N(0,1)
lim ⁡ k → ∞ q ( x k ) = ∫ q ( x k ∣ x 0 ) q ( x 0 ) d ( x 0 ) = N ( 0 , 1 ) \lim_{k \rightarrow ∞}q(x_k)=\int q(x_k|x_0)q(x_0)d(x_0)=N(0,1)

## 2.2、DDPM reverse Markov Chain(去噪链)—— p ( x t − 1 ∣ x t ) p(x_{t-1}|x_t)

p ( x T ) = q ( x T ) ～ N ( 0 , 1 ) p(x_T)=q(x_T)～N(0,1)

x t − 1 ～ N ( μ θ ( x t , t ) , Σ θ ( x t , t ) ) x_{t-1}～N(\mu_ \theta(x_t,t),\Sigma_\theta(x_t,t)) 显然的，这需要两个网络 μ θ ( x t , t ) \mu_\theta(x_t,t) Σ θ ( x t , t ) \Sigma_\theta(x_t,t) ，它们接受 t 步骤 t步骤 的输入 x t x_t 与第 t t 步骤的位置信息，反馈一个正态分布，而 x t − 1 x_{t-1} 从该分布中进行采样。

[ x 0 → x 1 → x 2 → ⋅ ⋅ ⋅ → x T − 1 ] → x T → [ x T − 1 ^ → ⋅ ⋅ ⋅ → x 2 ^ → x 1 ^ → x 0 ^ ] [x_0 \rightarrow x_1 \rightarrow x_2 \rightarrow···\rightarrow x_{T-1}] \rightarrow x_T \rightarrow [\hat{x_{T-1}} \rightarrow ···\rightarrow \hat{x_{2}}\rightarrow \hat{x_{1}}\rightarrow \hat{x_{0}}]

DDPM的目的是去尽可能还原原来的噪声分布，注意，这里提到的是“分布还原”而不是数据还原，DDPM的目的是去学习一个这样加噪过程的逆过程。因此，目标很显然了，我们想要让 ( x 0 , x 1 , x 2 , x 3 , ⋅ ⋅ ⋅ x T − 1 , x T ) (x_0,x_1,x_2,x_3,···x_{T-1},x_T) ( x 0 ^ , x 1 ^ , x 2 ^ , x 3 ^ , ⋅ ⋅ ⋅ x T − 1 , x T ^ ^ ) (\hat{x_0},\hat{x_1},\hat{x_2},\hat{x_3},···\hat{x_{T-1},\hat{x_T}}) 分布尽可能相似：
q ∗ = q ( x 0 , x 1 , x 2 , x 3 , ⋅ ⋅ ⋅ x T − 1 , x T ) = q ( x 0 ) ∏ i = 1 T q ( x i ∣ x i − 1 ) q^{*}=q(x_0,x_1,x_2,x_3,···x_{T-1},x_T)=q(x_0)\prod \limits_{i=1}^Tq(x_i|x_{i-1})
p θ ∗ = p θ ( x 0 ^ , x 1 ^ , x 2 ^ , x 3 ^ , ⋅ ⋅ ⋅ x T − 1 , x T ^ ^ ) = p ( x T ^ ) ∏ i = 1 T p θ ( x i − 1 ^ ∣ x i ^ ) p_{\theta}^{*}=p_{\theta}(\hat{x_0},\hat{x_1},\hat{x_2},\hat{x_3},···\hat{x_{T-1},\hat{x_T}})=p(\hat{x_T})\prod \limits_{i=1}^Tp_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})

L o s s = K L ( [ q ∗ ∣ ∣ p θ ∗ ] ) = ∑ q ∗ l o g ( q ∗ p θ ∗ ) = ∑ q ∗ l o g ( q ∗ ) − ∑ q ∗ l o g ( p θ ∗ ) Loss=KL([q^*||p_{\theta}^*])=\sum q^*log(\frac{q^*}{p_{\theta}^*})=\sum q^*log(q^*)-\sum q^*log(p_{\theta}^*)

L o s s = − ∑ q ∗ l o g ( p θ ∗ ) + K = − E q ∗ [ l o g ( p θ ∗ ) ] + K Loss=-\sum q^*log(p_{\theta}^*)+K=-E_{q^*}[log(p_{\theta}^*)]+K

L o s s = − E q ∗ [ l o g ( p ( x T ^ ) ) + ∑ i = 1 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i ∣ x i − 1 ) ] + K + M ≥ − E q ∗ ( p θ ( x 0 ) ) Loss=-E_{q^*}[log(p(\hat{x_T}))+\sum \limits_{i=1}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_i|x_{i-1})}]+K+M \ge -E_{q^*}(p_\theta(x_0))
(该不等式的证明下面一小部分分析内容不感兴趣的读者可以不看)

( J e n s o n − i n e q u a l i t y ) (Jenson-inequality) : f ( x ) f(x) 为凸(凹)的， λ i > 0 \lambda_i>0 ∑ i = 1 N λ i = 1 \sum_{i=1}^N\lambda_i=1 。则有
f ( ∑ i = 1 N λ i x i ) ≤ ( ≥ ) ∑ i = 1 N λ i f ( x i ) f(\sum_{i=1}^N\lambda_ix_i) \le(\ge) \sum_{i=1}^N\lambda_if(x_i)

L o s s ≥ − E q ∗ ( ∑ i = 1 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i ∣ x i − 1 ) ) ≥ − l o g ( ∑ k = 1 M ∑ i = 1 T q ( x 0 k ) ⋅ ⋅ ⋅ q ( x T k ∣ x T − 1 k ) p θ ( x i − 1 ^ k ∣ x i ^ k ) q ( x i k ∣ x i − 1 k ) ) Loss \ge -E_{q^*}(\sum \limits_{i=1}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_i|x_{i-1})})\ge -log(\sum_{k=1}^M\sum_{i=1}^Tq(x_0^k)···q(x_T^k|x_{T-1}^k)\frac{p_{\theta}(\hat{x_{i-1}}^k|\hat{x_{i}}^k)}{q(x_i^k|x_{i-1}^k)})
− l o g ( ∑ k = 1 M ∑ i = 1 T q ( x 0 k ) ⋅ ⋅ ⋅ q ( x T k ∣ x T − 1 k ) p θ ( x i − 1 ^ k ∣ x i ^ k ) q ( x i k ∣ x i − 1 k ) ) ≥ − l o g ( ∑ k = 1 M ∑ i = 1 T q ∗ ( k ) p θ ( x i − 1 ^ k ∣ x i ^ k ) ) -log(\sum_{k=1}^M\sum_{i=1}^Tq(x_0^k)···q(x_T^k|x_{T-1}^k)\frac{p_{\theta}(\hat{x_{i-1}}^k|\hat{x_{i}}^k)}{q(x_i^k|x_{i-1}^k)})\ge -log(\sum_{k=1}^M\sum_{i=1}^Tq^{*(k)}p_{\theta}(\hat{x_{i-1}}^k|\hat{x_{i}}^k))

− l o g ( ∑ k = 1 M ∑ i = 1 T q ∗ ( k ) p θ ( x i − 1 ^ k ∣ x i ^ k ) ) ≥ − ∑ k = 1 M q ∗ ( k ) ∑ i = 1 T l o g ( p θ ( x i − 1 ^ k ∣ x i ^ k ) ) -log(\sum_{k=1}^M\sum_{i=1}^Tq^{*(k)}p_{\theta}(\hat{x_{i-1}}^k|\hat{x_{i}}^k)) \ge -\sum_{k=1}^Mq^{*(k)}\sum_{i=1}^Tlog(p_{\theta}(\hat{x_{i-1}}^k|\hat{x_{i}}^k))

L o s s ≥ − ∑ k = 1 M q ∗ ( k ) ∑ i = 1 T l o g ( p θ ( x i − 1 ^ k ∣ x i ^ k ) ) = − E q ∗ ( − l o g ( p θ ( x 0 ) ) ) Loss \ge -\sum_{k=1}^Mq^{*(k)}\sum_{i=1}^Tlog(p_{\theta}(\hat{x_{i-1}}^k|\hat{x_{i}}^k))=-E_{q^*}(-log(p_\theta(x_0)))

## 2.3、DDPM 的Loss-Function和优化目标

### 2.3.1、高方差Monte-Carlo模拟直接采样的Loss更新方式

L o s s = − E q ∗ [ l o g ( p ( x T ^ ) ) + ∑ i = 1 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i ∣ x i − 1 ) ] + K + M Loss=-E_{q^*}[log(p(\hat{x_T}))+\sum \limits_{i=1}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_i|x_{i-1})}]+K+M

θ = a r g m a x θ ( E q ∗ [ l o g ( p ( x T ^ ) ) + ∑ i = 1 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i ∣ x i − 1 ) ] ) \theta = argmax_{\theta}(E_{q^*}[log(p(\hat{x_T}))+\sum \limits_{i=1}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_i|x_{i-1})}])
L o s s = − ( E q ∗ [ l o g ( p ( x T ^ ) ) + ∑ i = 1 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i ∣ x i − 1 ) ] ) —— ( 1 ∗ ) Loss=-(E_{q^*}[log(p(\hat{x_T}))+\sum \limits_{i=1}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_i|x_{i-1})}])——(1^*)

Monte Carlo模拟虽然可以估计该Loss大小，虽然能保证是无偏的(一定程度上)，但是它有个巨大的缺陷和致命弱点：它具有极高方差性，这显然对于模型是非常不利的，那么我们要平稳的训练一个模型，需要改写上述的 ( 1 ∗ ) (1^*) 来获得一个低方差的公式，因此DDPM核心在于2.1.3.2的低方差更新方式。

### 2.3.2、低方差(先验分布-后验分布)Loss更新方式(重点）

L o s s = − ( E q ∗ [ l o g ( p ( x T ^ ) ) + ∑ i = 1 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i ∣ x i − 1 ) ] ) —— ( 1 ∗ ) Loss=-(E_{q^*}[log(p(\hat{x_T}))+\sum \limits_{i=1}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_i|x_{i-1})}])——(1^*)

L o s s = − ( E q ∗ [ l o g ( p ( x T ^ ) ) + ∑ i = 2 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i ∣ x i − 1 ) + l o g ( p θ ( x 0 ^ ∣ x 1 ^ ) q ( x 1 ∣ x 0 ) ) ] ) ( 2 ∗ ) Loss=-(E_{q^*}[log(p(\hat{x_T}))+\sum \limits_{i=2}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_i|x_{i-1})}+log(\frac{p_{\theta}(\hat{x_{0}}|\hat{x_{1}})}{q(x_1|x_{0})})])(2^*)

q ( x i ∣ x i − 1 ) = q ( x i − 1 ∣ x i , x 0 ) q ( x i ) q ( x i − 1 ) = q ( x i − 1 ∣ x i , x 0 ) q ( x i ∣ x 0 ) q ( x i − 1 ∣ x 0 ) q(x_i|x_{i-1})=\frac{q(x_{i-1}|x_i,x_0)q(x_i)}{q(x_{i-1})}=\frac{q(x_{i-1}|x_i,x_0)q(x_i|x_0)}{q(x_{i-1}|x_0)}

L o s s = − ( E q ∗ [ l o g ( p ( x T ^ ) ) + ∑ i = 2 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i − 1 ∣ x i , x 0 ) q ( x i − 1 ∣ x 0 ) q ( x i ∣ x 0 ) + l o g ( p θ ( x 0 ^ ∣ x 1 ^ ) q ( x 1 ∣ x 0 ) ) ] ) ( 3 ∗ ) Loss=-(E_{q^*}[log(p(\hat{x_T}))+\sum \limits_{i=2}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_{i-1}|x_i,x_0)}\frac{q(x_{i-1}|x_0)}{q(x_i|x_0)}+log(\frac{p_{\theta}(\hat{x_{0}}|\hat{x_{1}})}{q(x_1|x_{0})})])(3^*)

∑ i = 2 T l o g ( q ( x i − 1 ∣ x 0 ) q ( x i ∣ x 0 ) ) = l o g ( q ( x 1 ∣ x 0 ) ) − l o g ( q ( x T ∣ x 0 ) ) \sum \limits_{i=2}^Tlog(\frac{q(x_{i-1}|x_0)}{q(x_i|x_0)})=log(q(x_1|x_0))-log(q(x_T|x_0))
( 3 ∗ ) (3^*) 可被改写为如下简单的式子,其中会有 x T ^ = x T \hat{x_T}=x_T ,这在上面已经声明过了：
L o s s = − ( E q ∗ [ l o g p ( x T ) q ( x T ∣ x 0 ) + ∑ i = 2 T l o g p θ ( x i − 1 ^ ∣ x i ^ ) q ( x i − 1 ∣ x i , x 0 ) + l o g ( p θ ( x 0 ^ ∣ x 1 ^ ) ] ) ( 4 ∗ ) Loss=-(E_{q^*}[log\frac{p({x_T})}{q(x_T|x_0)}+\sum \limits_{i=2}^Tlog\frac{p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})}{q(x_{i-1}|x_i,x_0)}+log(p_{\theta}(\hat{x_{0}}|\hat{x_{1}})])(4^*)

E q ∗ [ K L [ q ( x T ∣ x 0 ) ∣ ∣ p ( x T ) ] + ∑ i = 2 T K L [ q ( x i − 1 ∣ x i , x 0 ) ∣ ∣ p θ ( x i − 1 ^ ∣ x i ^ ) ] − l o g ( p θ ( x 0 ^ ∣ x 1 ^ ) ] E_{q^*}[KL[q(x_T|x_0)||p(x_T)]+\sum \limits_{i=2}^TKL[q(x_{i-1}|x_i,x_0)||p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})]-log(p_{\theta}(\hat{x_{0}}|\hat{x_{1}})]
K L [ q ( x i − 1 ∣ x i , x 0 ) ∣ ∣ p θ ( x i − 1 ^ ∣ x i ^ ) ] = L i − 1 KL[q(x_{i-1}|x_i,x_0)||p_{\theta}(\hat{x_{i-1}}|\hat{x_{i}})]=L_{i-1}
L o s s = E q ∗ [ L T + ∑ i = 2 T L i − 1 − L 0 ] = E q ∗ [ ∑ i = 1 T L i − L 0 ] ( 5 ∗ ) Loss=E_{q^*}[L_T+\sum \limits_{i=2}^TL_{i-1}-L_0]=E_{q^*}[\sum \limits_{i=1}^TL_{i}-L_0](5^*)

q ( x i ∣ x i − 1 ) = q ( x i − 1 ∣ x i , x 0 ) q ( x i ∣ x 0 ) q ( x i − 1 ∣ x 0 ) → q ( x i − 1 ∣ x i , x 0 ) = q ( x i ∣ x i − 1 ) q ( x i − 1 ∣ x 0 ) q ( x i ∣ x 0 ) q(x_i|x_{i-1})=\frac{q(x_{i-1}|x_i,x_0)q(x_i|x_0)}{q(x_{i-1}|x_0)} \rightarrow q(x_{i-1}|x_i,x_0)=\frac{q(x_i|x_{i-1})q(x_{i-1}|x_0)}{q(x_i|x_0)}

q ( x i ∣ x i − 1 ) ～ N ( a i x i − 1 , ( 1 − a i ) I ) q(x_i|x_{i-1})～N(\sqrt{a_i}x_{i-1},(1-a_i)I)
q ( x i − 1 ∣ x 0 ) ～ N ( ( ∏ k = 0 i − 1 a k ) x 0 , ( 1 − ∏ k = 0 i − 1 a k ) I ) q(x_{i-1}|x_0)～N((\prod \limits_{k=0}^{i-1} \sqrt{a_k})x_0,(1-\prod \limits_{k=0}^{i-1} a_k)I)
q ( x i ∣ x 0 ) ～ N ( ( ∏ k = 0 i a k ) x 0 , ( 1 − ∏ k = 0 i a k ) I ) q(x_{i}|x_0)～N((\prod \limits_{k=0}^{i} \sqrt{a_k})x_0,(1-\prod \limits_{k=0}^{i} {a_k})I)

e x p ( − 1 2 [ ( x i − a i x i − 1 ) 2 1 − a i + ( x i − 1 − ∏ k = 0 i − 1 a k x 0 ) 2 1 − ∏ k = 0 i − 1 a k + ( x i − ∏ k = 0 i a k x 0 ) 2 1 − ∏ k = 0 i a k ] ) ( ∗ ∗ ) exp(-\frac{1}{2}[\frac{(x_i-\sqrt{a_i}x_{i-1})^2}{1-a_i}+\frac{(x_{i-1}-\sqrt{\prod \limits_{k=0}^{i-1} a_k}x_0)^2}{1-\prod \limits_{k=0}^{i-1} a_k}+\frac{(x_{i}-\sqrt{\prod \limits_{k=0}^{i} a_k}x_0)^2}{1-\prod \limits_{k=0}^{i} a_k}])(**)

e x p ( − 1 2 [ ( a i 1 − a i + 1 1 − ∏ k = 0 i − 1 a k ) x i − 1 2 − ( 2 a i 1 − a i x i + 2 ∏ k = 0 i − 1 a k 1 − ∏ k = 0 i − 1 a k x 0 ) x i − 1 ] ) + C exp(-\frac{1}{2}[(\frac{a_i}{1-a_i}+\frac{1}{1-\prod \limits_{k=0}^{i-1} a_k})x_{i-1}^2-(\frac{2\sqrt{a_i}}{1-a_i}x_i+\frac{2\sqrt{\prod \limits_{k=0}^{i-1} a_k}}{1-\prod \limits_{k=0}^{i-1} a_k}x_0)x_{i-1}])+C

q ( x i − 1 ∣ x i , x 0 ) ～ N ( μ , σ 2 I ) q(x_{i-1}|x_i,x_0)～N(\mu,\sigma^2 I)

μ i = ( a i 1 − a i x i + ∏ k = 0 i − 1 a k 1 − ∏ k = 0 i − 1 a k x 0 ) / ( a i 1 − a i + 1 1 − ∏ k = 0 i − 1 a k ) \mu_i=(\frac{\sqrt{a_i}}{1-a_i}x_i+\frac{\sqrt{\prod \limits_{k=0}^{i-1} a_k}}{1-\prod \limits_{k=0}^{i-1} a_k}x_0) /(\frac{a_i}{1-a_i}+\frac{1}{1-\prod \limits_{k=0}^{i-1} a_k})
σ i 2 = 1 / ( a i 1 − a i + 1 1 − ∏ k = 0 i − 1 a k ) \sigma_i^2=1/(\frac{a_i}{1-a_i}+\frac{1}{1-\prod \limits_{k=0}^{i-1} a_k})

## 2.4、DDPM算法

DDPM由以下几点重要部分构成：

L o s s ∗ = E q ∗ [ ∑ i = 2 T L i − 1 − L 0 ] Loss^*=E_{q^*}[\sum \limits_{i=2}^TL_{i-1}-L_0]

q ( x i − 1 ∣ x i , x 0 ) ～ N ( μ , σ 2 I ) q(x_{i-1}|x_i,x_0)～N(\mu,\sigma^2 I)
μ i = ( a i 1 − a i x i + ∏ k = 0 i − 1 a k 1 − ∏ k = 0 i − 1 a k x 0 ) / ( a i 1 − a i + 1 1 − ∏ k = 0 i − 1 a k ) \mu_i=(\frac{\sqrt{a_i}}{1-a_i}x_i+\frac{\sqrt{\prod \limits_{k=0}^{i-1} a_k}}{1-\prod \limits_{k=0}^{i-1} a_k}x_0) /(\frac{a_i}{1-a_i}+\frac{1}{1-\prod \limits_{k=0}^{i-1} a_k})
σ i 2 = 1 / ( a i 1 − a i + 1 1 − ∏ k = 0 i − 1 a k ) \sigma_i^2=1/(\frac{a_i}{1-a_i}+\frac{1}{1-\prod \limits_{k=0}^{i-1} a_k})

L t − 1 = K L [ q ( x t ∣ x t + 1 , x 0 ) ∣ ∣ p θ ( x t ^ ∣ x t + 1 ^ ) ] = K L [ N ( μ t , σ i 2 I ) ∣ ∣ N ( μ θ , σ i 2 I ) ] L_{t-1}=KL[q(x_{t}|x_{t+1},x_0)||p_{\theta}(\hat{x_{t}}|\hat{x_{t+1}})]=KL[N(\mu_t,\sigma_i^2I)||N(\mu_\theta,\sigma_i^2I)]

K L [ N ( μ t , σ i 2 I ) ∣ ∣ N ( μ θ , σ i 2 I ) ] = ∫ x 1 2 π σ i e ( x − μ t ) 2 2 σ i 2 ( − 1 2 σ i 2 [ ( x − μ θ ) 2 − ( x − μ t ) 2 ] ) d x KL[N(\mu_t,\sigma_i^2I)||N(\mu_\theta,\sigma_i^2I)]=\int_x\frac{1}{\sqrt{2\pi}\sigma_i}e^{\frac{(x-\mu_t)^2}{2\sigma_i^2}}(-\frac{1}{2\sigma_i^2}[(x-\mu_\theta)^2-(x-\mu_t)^2])dx

∫ x 1 2 π σ i e ( x − μ t ) 2 2 σ i 2 ( 1 2 σ i 2 [ ( x − μ t ) 2 ] ) d x = 1 2 σ i 2 E ( [ x − E ( x ) ] 2 ) = 1 2 σ i 2 D ( x ) = 1 2 \int_x\frac{1}{\sqrt{2\pi}\sigma_i}e^{\frac{(x-\mu_t)^2}{2\sigma_i^2}}(\frac{1}{2\sigma_i^2}[(x-\mu_t)^2])dx=\frac{1}{2\sigma_i^2}E([x-E(x)]^2)=\frac{1}{2\sigma_i^2}D(x)=\frac{1}{2}
∫ x 1 2 π σ i e ( x − μ t ) 2 2 σ i 2 ( − 1 2 σ i 2 [ ( x − μ θ ) 2 ] ) d x = σ i 2 + ∣ ∣ μ θ − μ t ∣ ∣ 2 2 σ i 2 \int_x\frac{1}{\sqrt{2\pi}\sigma_i}e^{\frac{(x-\mu_t)^2}{2\sigma_i^2}}(-\frac{1}{2\sigma_i^2}[(x-\mu_\theta)^2])dx=\frac{\sigma_i^2+||\mu_\theta-\mu_t||^2}{2\sigma_i^2}

E q ∗ [ L t − 1 ] = ( E q ∗ [ ∣ ∣ μ t ( x t , x 0 ) − μ θ ( x t , t ) ∣ ∣ 2 2 σ i 2 ] ) + C E_{q*}[L_{t-1}]= (E_{q*}[\frac{||\mu_t(x_t,x_0)-\mu_\theta(x_t,t)||^2}{2\sigma_i^2}])+C

μ t ( x t , x 0 ) = 1 a t ( x t − 1 − a t ( 1 − ∏ i = 0 k a i ) z ) \mu_t(x_t,x_0)=\frac{1}{\sqrt{a_t}}(x_t-\frac{1-a_t}{\sqrt{(1-\prod \limits_{i=0}^k {a_i})}}z)
E q ∗ [ L t − 1 ] − C = E x 0 , z [ 1 2 σ i 2 ∣ ∣ 1 a t ( x t − 1 − a t ( 1 − ∏ i = 0 k a i ) z ) − μ θ ( x t , t ) ∣ ∣ 2 ] E_{q*}[L_{t-1}]-C=E_{x_0,z}[\frac{1}{2 \sigma_i^2} ||\frac{1}{\sqrt{a_t}}(x_t-\frac{1-a_t}{\sqrt{(1-\prod \limits_{i=0}^k {a_i})}}z)-\mu_\theta(x_t,t)||^2]

μ θ ( x t , t ) = 1 a t ( x t − 1 − a t ( 1 − ∏ i = 0 k a i ) z θ ( x t , t ) ) \mu_\theta(x_t,t)=\frac{1}{\sqrt{a_t}}(x_t-\frac{1-a_t}{\sqrt{(1-\prod \limits_{i=0}^k {a_i})}}z_\theta(x_t,t))

E x 0 , z [ 1 2 σ i 2 ∣ ∣ 1 a t ( x t − 1 − a t ( 1 − ∏ i = 0 k a i ) z ) − 1 a t ( x t − 1 − a t ( 1 − ∏ i = 0 k a i ) z θ ) ∣ ∣ 2 ] E_{x_0,z}[\frac{1}{2 \sigma_i^2}||\frac{1}{\sqrt{a_t}}(x_t-\frac{1-a_t}{\sqrt{(1-\prod \limits_{i=0}^k {a_i})}}z)-\frac{1}{\sqrt{a_t}}(x_t-\frac{1-a_t}{\sqrt{(1-\prod \limits_{i=0}^k {a_i})}}z_\theta)||^2]
E q ∗ [ L t − 1 ] − C = E x 0 , z [ 1 2 σ i 2 ( 1 − a t ) 2 a t ( 1 − ∏ i = 0 k a i ) ∣ ∣ z − z θ ( x t , t ) ∣ ∣ 2 ] E_{q*}[L_{t-1}]-C=E_{x_0,z}[\frac{1}{2 \sigma_i^2}\frac{(1-a_t)^2}{a_t(1-\prod \limits_{i=0}^k {a_i})}||z-z_\theta(x_t,t)||^2]

L o s s = E t ( E q ∗ [ L t ] ) = E x 0 , z , t [ ∣ ∣ z − z θ ( x t , t ) ∣ ∣ 2 ] Loss=E_t(E_{q*}[L_t])=E_{x_0,z,t}[||z-z_\theta(x_t,t)||^2]

x t − 1 = σ t 2 I + μ θ ( x t , t ) = σ t 2 I + 1 a t ( x t − 1 − a t ( 1 − ∏ i = 0 k a i ) z θ ( x t , t ) ) x_{t-1}=\sigma_t^2I+\mu_\theta(x_t,t)=\sigma_t^2I+\frac{1}{\sqrt{a_t}}(x_t-\frac{1-a_t}{\sqrt{(1-\prod \limits_{i=0}^k {a_i})}}z_\theta(x_t,t))