DDPM模型——公式推导

论文传送门:Denoising Diffusion Probabilistic Models
代码实现:DDPM模型——pytorch实现
推荐视频:54、Probabilistic Diffusion Model概率扩散模型理论与完整PyTorch代码详细解读

需要的数学基础:

联合概率(Joint probability):
P ( A , B , C ) = P ( C ∣ B , A ) P ( B , A ) = P ( C ∣ B , A ) P ( B ∣ A ) P ( A ) P(A, B, C)=P(C \mid B, A) P(B, A)=P(C \mid B, A) P(B \mid A) P(A) P(A,B,C)=P(CB,A)P(B,A)=P(CB,A)P(BA)P(A)
条件概率(Conditional probability):
P ( B , C ∣ A ) = P ( B ∣ A ) P ( C ∣ A , B ) P(B, C \mid A)=P(B \mid A) P(C \mid A, B) P(B,CA)=P(BA)P(CA,B)
马尔可夫链(Markov Chain):
p ( X t + 1 ∣ X t , … , X 1 ) = p ( X t + 1 ∣ X t ) p\left(X_{t+1} \mid X_{t}, \ldots, X_{1}\right)=p\left(X_{t+1} \mid X_{t}\right) p(Xt+1Xt,,X1)=p(Xt+1Xt)
贝叶斯公式(Bayes Rule):
P ( A i ∣ B ) = P ( B ∣ A i ) P ( A i ) ∑ j P ( B ∣ A j ) P ( A j ) P\left(A_{i} \mid B\right)=\frac{P\left(B \mid A_{i}\right) P\left(A_{i}\right)}{\sum_{j} P\left(B \mid A_{j}\right) P\left(A_{j}\right)} P(AiB)=jP(BAj)P(Aj)P(BAi)P(Ai)
正态分布(Normal distribution) X ∼ N ( μ , σ 2 ) X \sim N\left(\mu, \sigma^{2}\right) XN(μ,σ2)的概率密度函数:
f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x)=\frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(x-\mu)^{2}}{2 \sigma^{2}}} f(x)=2π σ1e2σ2(xμ)2
两个正态分布 X ∼ N ( μ X , σ X 2 ) X \sim N\left(\mu_{X}, \sigma_{X}^{2}\right) XN(μX,σX2) Y ∼ N ( μ Y , σ Y 2 ) Y \sim N\left(\mu_{Y}, \sigma_{Y}^{2}\right) YN(μY,σY2)的叠加:
U = X + Y ∼ N ( μ X + μ Y , σ X 2 + σ Y 2 ) U=X+Y \sim N\left(\mu_{X}+\mu_{Y}, \sigma_{X}^{2}+\sigma_{Y}^{2}\right) U=X+YN(μX+μY,σX2+σY2)
两个正态分布 p , q p,q p,q的KL散度(Kullback-Leibler divergence):
K L ( p , q ) = log ⁡ σ q σ p + σ p 2 + ( μ p − μ q ) 2 2 σ q 2 − 1 2 K L(p, q)=\log \frac{\sigma_{q}}{\sigma_{p}}+\frac{\sigma_{p}^{2}+\left(\mu_{p}-\mu_{q}\right)^{2}}{2 \sigma_{q}^{2}}-\frac{1}{2} KL(p,q)=logσpσq+2σq2σp2+(μpμq)221
重参数技巧(Reparameterrization):
若 X ∼ N ( μ , σ 2 ) , Y = X − μ σ ∼ N ( 0 , 1 ) 若X \sim N\left(\mu, \sigma^{2}\right), Y=\frac{X-\mu}{\sigma} \sim N(0,1) XN(μ,σ2),Y=σXμN(0,1)
从正态分布 X X X中采样 z z z,等价于从标准正态分布 Y Y Y中采样 z ′ z' z z = μ + σ × z ′ z = \mu + \sigma \times z' z=μ+σ×z
一元二次式的配方:
a x 2 + b x = a ( x + b 2 a ) 2 + C a x^{2}+b x=a\left(x+\frac{b}{2 a}\right)^{2}+C ax2+bx=a(x+2ab)2+C

概念:

t t t:时刻(加噪次数)
T T T:总时长(总加噪次数)
x \mathbf{x} x:图像
x 0 \mathbf{x}_{0} x0:初始时刻图像
x t \mathbf{x}_{t} xt t t Time t image
x T \mathbf{x}_{T}xT: Termination moment image
x 0 x_0x0 ~ q ( x 0 ) q(x_0) q(x0) q ( x 0 ) q(x_0) q(x0) : Real image distribution
p θ ( x 0 ) : = ∫ p θ ( x 0 : T ) dx 1 : T p_\theta (x_0) := \int p_\theta (x_{0:T}) d x_{ 1:T}pi(x0):=pi(x0:T)dx1:T, p θ(x0) p_\theta(x_0)pi(x0) : Generate image distribution
θ \thetaθ : (network) parameter
β t \beta_{t}bt: The variance
β \beta of the noise added to the diffusion process at time tβ : noise variance sequence, length T, in( 0 , 1 ) (0,1)(0,1 ) Monotonically increasing within the interval

Reverse process:

The mathematical expression of the inverse diffusion process:
p θ ( x 0 : T ) : = p ( x T ) ∏ t = 1 T p θ ( xt − 1 ∣ xt ) p_{\theta}\left(\mathbf{x}_ {0: T}\right):=p\left(\mathbf{x}_{T}\right) \prod_{t=1}^{T} p_{\theta}\left(\mathbf{x} _{t-1} \mid \mathbf{x}_{t}\right)pi(x0:T):=p(xT)t=1Tpi(xt1xt) The joint probability distribution p θ ( x 0 : T ) p_{\theta}\left(\mathbf{x}_{0: T}\right) of
all time imagespi(x0:T),整个过程是马尔科夫链。其中,
p ( x T ) = N ( x T ; 0 , I ) p\left(\mathbf{x}_{T}\right)=\mathcal{N}\left(\mathbf{x}_{T} ; \mathbf{0}, \mathbf{I}\right) p(xT)=N(xT;0,I)
p ( x T ) p\left(\mathbf{x}_{T}\right) p(xT)是标准正态分布, x T \mathbf{x}_{T} xT为采样值,与网络参数无关。
t t t时刻去噪的数学表达:
p θ ( x t − 1 ∣ x t ) : = N ( x t − 1 ; μ θ ( x t , t ) , Σ θ ( x t , t ) ) p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right):=\mathcal{N}\left(\mathbf{x}_{t-1} ; \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \boldsymbol{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)\right) pθ(xt1xt):=N(xt1;μθ(xt,t),Σθ(xt,t))
x t − 1 \mathbf{x}_{t-1} xt1服从均值为 μ θ ( x t , t ) \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right) μθ(xt,t),方差为 Σ θ ( x t , t ) \boldsymbol{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right) Σθ(xt,t)的正态分布,作者在原文中将方差 Σ θ ( x t , t ) \boldsymbol{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right) Σθ(xt,t)设为 σ t 2 = β ~ t = 1 − α ˉ t − 1 1 − α ˉ t β t \sigma_{t}^{2}=\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}} \beta_{t} σt2=β~t=1αˉt1αˉt1βt(经实验, σ t 2 = β t \sigma_{t}^{2}={\beta}_{t} σt2=βt σ t 2 = β ~ t \sigma_{t}^{2}=\tilde{\beta}_{t} σt2=β~t的结果相似),与模型参数无关( β ~ t \tilde{\beta}_{t} β~t在后续计算中会提到)。

Forward process:

扩散过程的数学表达:
q ( x 1 : T ∣ x 0 ) : = ∏ t = 1 T q ( x t ∣ x t − 1 ) q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_{0}\right):=\prod_{t=1}^{T} q\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right) q(x1:Tx0):=t=1Tq(xtxt1)
给定初始图像 x 0 \mathbf{x}_{0} x0,全部时刻( t > 0 t>0 t>0)的联合概率分布,整个过程是马尔科夫链。
t t t时刻加噪的数学表达:
q ( x t ∣ x t − 1 ) : = N ( x t ; 1 − β t x t − 1 , β t I ) q\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right):=\mathcal{N}\left(\mathbf{x}_{t} ; \sqrt{1-\beta_{t}} \mathbf{x}_{t-1}, \beta_{t} \mathbf{I}\right) q(xtxt1):=N(xt;1βt xt1,βtI)
x t \mathbf{x}_{t} xt服从均值为 1 − β t x t − 1 \sqrt{1-\beta_{t}} \mathbf{x}_{t-1} 1βt xt1,方差为 β t \beta_{t} βt的正态分布。
使用重参数技巧,任意时刻的图像 x t \mathbf{x}_{t} xt可以由初始时刻图像 x 0 \mathbf{x}_{0} x0和噪声方差序列 β \beta β to determine, to simplify the expression, defineα t : = 1 − β t \alpha_{t}:=1-\beta_{t}at:=1bt α ˉ t : = ∏ s = 1 t α s \bar{\alpha}_{t}:=\prod_{s=1}^{t} \alpha_{s} aˉt:=s=1tas,but:
xt = α txt − 1 + 1 − α t ϵ t − 1 = α t ( α t − 1 xt − 2 + 1 − α t − 1 ϵ t − 2 ) + 1 − α t ϵ t − 1 = α t α t − 1 xt − 2 + α t 1 − α t − 1 ϵ t − 2 + 1 − α t ϵ t − 1 = α t α t − 1 xt − 2 + α t − α t α t − 1 + 1 − α t ϵ ‾ t − 2 = α t α t − 1 xt − 2 + 1 − α t α t − 1 ϵ ‾ t − 2 = ... = α ˉ tx 0 + 1 − α ˉ t ϵ \begin{ aligned} \mathbf{x}_{t} & =\sqrt{\alpha_{t}}\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\mathbf{\epsilon }_{t-1} \\ & =\sqrt{\alpha_{t}} (\sqrt{\alpha_{t-1}} \mathbf{x}_{t-2}+\sqrt{1-\ alpha_{t-1}} \mathbf{\epsilon}_{t-2})+\sqrt{1-\alpha_{t}}\mathbf{\epsilon}_{t-1}\\ & =\sqrt {\alpha_{t}\alpha_{t-1}} \mathbf{x}_{t-2}+ \sqrt{\alpha_{t}}\sqrt{1-\alpha_{t-1}} \mathbf {\epsilon}_{t-2}+\sqrt{1-\alpha_{t}} \mathbf{\epsilon}_{t-1} \\ & =\sqrt{\alpha_{t}\alpha_{t -1}} \mathbf{x}_{t-2}+ \sqrt{\alpha_{t}-\alpha_{t}\alpha_{t-1} +1-\alpha_{t}}\overline{\mathbf{\epsilon}}_{t-2}\\&=\sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_ {t-2}+\sqrt{1-\alpha_{t} \alpha_{t-1}} \overline{\mathbf{\epsilon}}_{t-2} \\ & =\ldots \\ & = \sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \mathbf{\epsilon} \end{aligned } }xt=at xt1+1at ϵt1=at (at1 xt2+1at1 ϵt2)+1at ϵt1=atat1 xt2+at 1at1 ϵt2+1at ϵt1=atat1 xt2+atatat1+1at ϵt2=atat1 xt2+1atat1 ϵt2==aˉt x0+1aˉt ϵ
The above formula can be rewritten, using xt \mathbf{x}_{t}xtand ϵ {\epsilon}ϵ to expressx 0 \mathbf{x}_{0}x0:
x 0 = 1 α ˉ t ( xt − 1 − α ˉ t ϵ ) \mathbf{x}_{0}=\frac{1}{\sqrt{\bar{\alpha}_{t}}}\ left(\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}\mathbf{\epsilon}\right)x0=aˉt 1(xt1aˉt ϵ )
(the derivation process uses the superposition formula of two normal distributions)
where,ϵ i \mathbf{\epsilon}_{i}ϵi ~ N ( 0 , I ) \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right) N(0,I)

Loss:

An upper bound on the negative log-likelihood:
E q [ − log ⁡ p θ ( x 0 ) ] ≤ E q [ − log ⁡ p θ ( x 0 ) ] + D K L ( q ( x 1 : T ∣ x 0 ) ∥ p ( x 1 : T ∣ x 0 ) ) = E q [ − log ⁡ p θ ( x 0 ) ] + E q [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) / p θ ( x 0 ) ] = E q [ − log ⁡ p θ ( x 0 ) ] + E q [ − log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) + log ⁡ p θ ( x 0 ) ] = E q [ − log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] \begin{aligned} \mathbb{E}_{q}\left[-\log p_{\theta}\left(\mathbf{x}_{0}\right)\right] & \leq \mathbb{E}_{q}\left[-\log p_{\theta}\left(\mathbf{x}_{0}\right)\right] + D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{1:T}\mid \mathbf{x}_{0} \right) \| p\left(\mathbf{x}_{1:T} \mid \mathbf{x}_{0}\right)\right)\\ & =\mathbb{E}_{q}\left[-\log p_{\theta}\left(\mathbf{x}_{0}\right)\right] + \mathbb{E}_{q}\left[\log \frac{q\left(\mathbf{x}_{1:T} \mid \mathbf{x}_{0}\right)}{p_{\theta}\left(\mathbf{x}_{0: T}\right) / p_{\theta}\left(\ mathbf{x}_{0}\right)}\right] \\ & =\mathbb{E}_{q}\left[-\log p_{\theta}\left(\mathbf{x}_{0 }\right)\right] + \mathbb{E}_{q}\left[-\log\frac{p_{\theta}\left(\mathbf{x}_{0:T}\right)}{ q\left(\mathbf{x}_{1:T}\mid\mathbf{x}_{0}\right)}+\log p_{\theta}\left(\mathbf{x}_{0} \right)\right]\\ & =\mathbb{E}_{q}\left[-\log \frac{p_{\theta}\left(\mathbf{x}_{0:T}\right) }{q\left(\mathbf{x}_{1:T}\mid\mathbf{x}_{0}\right)}\right]\\\end{aligned}T} \mid \mathbf{x}_{0}\right)}+\log p_{\theta}\left(\mathbf{x}_{0}\right)\right]\\ & =\mathbb{ E}_{q}\left[-\log\frac{p_{\theta}\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{ 1: T} \mid \mathbf{x}_{0}\right)}\right]\\\end{aligned}T} \mid \mathbf{x}_{0}\right)}+\log p_{\theta}\left(\mathbf{x}_{0}\right)\right]\\ & =\mathbb{ E}_{q}\left[-\log\frac{p_{\theta}\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{ 1: T} \mid \mathbf{x}_{0}\right)}\right]\\\end{aligned}Eq[logpi(x0)]Eq[logpi(x0)]+DKL(q(x1:Tx0)p(x1:Tx0))=Eq[logpi(x0)]+Eq[logpi(x0:T)/pi(x0)q(x1:Tx0)]=Eq[logpi(x0)]+Eq[logq(x1:Tx0)pi(x0:T)+logpi(x0)]=Eq[logq(x1:Tx0)pi(x0:T)]
定义损失函数L:
E q [ − log ⁡ p θ ( x 0 ) ] ≤ E q [ − log ⁡ p θ ( x 0 : T ) q ( x 1 : T ∣ x 0 ) ] : = L \mathbb{E}_{q}\left[-\log p_{\theta}\left(\mathbf{x}_{0}\right)\right] \leq\mathbb{E}_{q}\left[-\log \frac{p_{\theta}\left(\mathbf{x}_{0: T}\right)}{q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_{0}\right)}\right]:=L Eq[logpi(x0)]Eq[logq(x1:Tx0)pi(x0:T)]:=
Further derivation of L L:
\mathbf{x}_{0}\right) \| p_{\theta}\left(\mathbf{x}_{t-1}\mid \mathbf{x}_{t}\right)\right)-\log p_{\theta}\left(\mathbf{ x}_{0} \mid \mathbf{x}_{1}\right)\right]\\\end{aligned}L=Eq[logq(x1:Tx0)pi(x0:T)]=Eq[logp(xT)t1logq(xtxt1)pi(xt1xt)]=Eq[logp(xT)t>1logq(xtxt1)pi(xt1xt)logq(x1x0)pi(x0x1)]=Eq[logp(xT)t>1logq(xt1xt,x0)pi(xt1xt)q(xtx0)q(xt1x0)logq(x1x0)pi(x0x1)]=Eq[logq(xTx0)p(xT)t>1logq(xt1xt,x0)pi(xt1xt)logpi(x0x1)]=Eq[DKL(q(xTx0)p(xT))+t>1DKL(q(xt1xt,x0)pi(xt1xt))logpi(x0x1)]
with LT{L}_{T}LT, L t − 1 {L}_{t-1} Lt1, L 0 {L}_{0}L0Definition:
L = E q [ DKL ( q ( x T ∣ x 0 ) ∥ p ( x T ) ) ⏟ LT + ∑ t > 1 DKL ( q ( xt − 1 ∣ xt , x 0 ) ∥ p θ ( xt − 1 ∣ xt ) ) ⏟ L t − 1 − log ⁡ p θ ( x 0 ∣ x 1 ) ⏟ L 0 ] L = \mathbb{E}_{q}[\underbrace{D_{\mathrm{KL }}\left(q\left(\mathbf{x}_{T}\mid \mathbf{x}_{0}\right) \|p\left(\mathbf{x}_{T}\right) \right)}_{L_{T}}+\sum_{t>1}\underbrace{D_{\mathrm{KL}}\left(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{x}_{0}\right) \|p_{\theta}\left(\mathbf{x}_{t-1}\mid \mathbf{x} _{t}\right)\right)}_{L_{t-1}} \underbrace{-\log p_{\theta}\left(\mathbf{x}_{0}\mid \mathbf{x} _{1}\right)}_{L_{0}}]L=Eq[LT DKL(q(xTx0)p(xT))+t>1Lt1 DKL(q(xt1xt,x0)pi(xt1xt))L0 logpi(x0x1)]
first itemLT{L}_{T}LTand network parameters θ \thetaθ is irrelevant and can be ignored.
For the third itemL 0 {L}_{0}L0进行分析:
p θ ( x 0 ∣ x 1 ) = ∏ i = 1 D ∫ δ − ( x 0 i ) δ + ( x 0 i ) N ( x ; μ θ i ( x 1 , 1 ) , σ 1 2 ) d x δ + ( x ) = { ∞  if  x = 1 x + 1 255  if  x < 1 δ − ( x ) = { − ∞  if  x = − 1 x − 1 255  if  x > − 1 \begin{aligned} p_{\theta}\left(\mathbf{x}_{0} \mid \mathbf{x}_{1}\right) & =\prod_{i=1}^{D} \int_{\delta_{-}\left(x_{0}^{i}\right)}^{\delta_{+}\left(x_{0}^{i}\right)} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(\mathbf{x}_{1}, 1\right), \sigma_{1}^{2}\right) d x \\ \delta_{+}(x) & =\left\{\begin{array}{ll} \infty & \text { if } x=1 \\ x+\frac{1}{255} & \text { if } x<1 \end{array} \quad \delta_{-}(x)=\left\{\begin{array}{ll} -\infty & \text { if } x=-1 \\ x-\frac{1}{255} & \text { if } x>-1 \end{array}\right.\right. \end{aligned} pi(x0x1)δ+(x)=i=1Dδ(x0i)δ+(x0i)N(x;μθi(x1,1),σ12)dx={ x+2551 if x=1 if x<1δ(x)={ x2551 if x=1 if x>1
相当于从连续空间向离散空间的变化,即将连续高斯分布转化为离散高斯分布,以对应输入的图片数据。
对第二项 L t − 1 {L}_{t-1} Lt1进行分析:
KL散度的第一项 q ( x t − 1 ∣ x t , x 0 ) q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{x}_{0}\right) q(xt1xt,x0),设
q ( x t − 1 ∣ x t , x 0 ) = N ( x t − 1 ; μ ~ t ( x t , x 0 ) , β ~ t I ) q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{x}_{0}\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \tilde{\boldsymbol{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right), \tilde{\beta}_{t} \mathbf{I}\right) q(xt1xt,x0)=N(xt1;μ~t(xt,x0),β~tI)
使用贝叶斯公式和配方,计算 q ( x t − 1 ∣ x t , x 0 ) q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{x}_{0}\right) q(xt1xt,x0)
q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 , x 0 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) ∝ exp ⁡ ( − 1 2 ( ( x t − α t x t − 1 ) 2 β t + ( x t − 1 − α ˉ t − 1 x 0 ) 2 1 − α ˉ t − 1 − ( x t − α ˉ t x 0 ) 2 1 − α ˉ t ) ) = exp ⁡ ( − 1 2 ( ( α t β t + 1 1 − α ˉ t − 1 ) x t − 1 2 − ( 2 α t β t x t + 2 α t − 1 1 − α ˉ t − 1 x 0 ) x t − 1 + C ( x t , x 0 ) ) ) \begin{aligned} q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{x}_{0}\right) & =q\left(\mathbf{x}_{t} \mid \mathbf{x}_{t-1}, \mathbf{x}_{0}\right) \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{0}\right)}{q\left(\mathbf{x}_{t} \mid \mathbf{x}_{0}\right)} \\ & \propto \exp \left(-\frac{1}{2}\left(\frac{\left(\mathbf{x}_{t}-\sqrt{\alpha_{t}} \mathbf{x}_{t-1}\right)^{2}}{\beta_{t}}+\frac{\left(\mathbf{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_{0}\right)^{2}}{1-\bar{\alpha}_{t-1}}-\frac{\left(\mathbf{x}_{t}-\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}\right)^{2}}{1-\bar{\alpha}_{t}}\right)\right) \\ & =\exp \left(-\frac{1}{2}\left(\left(\frac{\alpha_{t}}{\beta_{t}}+\frac{1}{1-\bar{\alpha}_{t-1}}\right) \mathbf{x}_{t-1}^{2}-\left(\frac{2 \sqrt{\alpha_{t}}}{\beta_{t}} \mathbf{x}_{t}+\frac{2 \sqrt{\alpha_{t-1}}}{1-\bar{\alpha}_{t-1}} \mathbf{x}_{0}\right) \mathbf{x}_{t-1}+C\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)\right)\right) \end{aligned} q(xt1xt,x0)=q(xtxt1,x0)q(xtx0)q(xt1x0)exp(21(bt(xtat xt1)2+1aˉt1(xt1aˉt1 x0)21aˉt(xtaˉt x0)2))=exp(21((btat+1aˉt11)xt12(bt2at xt+1aˉt12at1 x0)xt1+C(xt,x0)))
Get the variance β ~ t \tilde{\beta}_{t}b~t
β ~ t = 1 / ( α t β t + 1 1 − α ˉ t − 1 ) = 1 − α ˉ t − 1 α t + β t − α ˉ t ⋅ β t = 1 − α ˉ t − 1 1 − α ˉ t ⋅ β t \begin{aligned} \tilde{\beta}_{t} & =1 /\left(\frac{\alpha_{t}}{\beta_{t}}+\frac{1}{1-\bar{\alpha}_{t-1}}\right)\\ &=\frac{1-\bar{\alpha}_{t-1}}{ {\alpha}_{t}+{\beta}_{t} -\bar{\alpha}_{t}} \cdot \beta_{t}\\ &=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}} \cdot \beta_{t}\\ \end{aligned} b~t=1/(btat+1aˉt11)=at+btaˉt1aˉt1bt=1aˉt1aˉt1bt
Get the mean value μ ~ t ( xt , x 0 ) \tilde{\boldsymbol{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)m~t(xt,x0)
μ ~ t ( x t , x 0 ) = ( α t β t x t + α ˉ t − 1 1 − α ˉ t − 1 x 0 ) / ( α t β t + 1 1 − α ˉ t − 1 ) = α t ( 1 − α ˉ t − 1 ) α t + β t − α ˉ t x t + α ˉ t − 1 β t α t + β t − α ˉ t x 0 = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t x 0 \begin{aligned} \tilde{\boldsymbol{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)&=\left(\frac{\sqrt{\alpha_{t}}}{\beta_{t}} \mathbf{x}_{t}+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}} \mathbf{x}_{0}\right) /\left(\frac{\alpha_{t}}{\beta_{t}}+\frac{1}{1-\bar{\alpha}_{t-1}}\right)\\ &=\frac{\sqrt{\alpha_{t}}\left(1-\bar{\alpha}_{t-1}\right)}{ {\alpha }_{t}+{\beta}_{t}-\bar{\alpha}_{t}}\mathbf{x}_{t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t}}{ {\alpha }_{t}+{\beta}_{t}-\bar{\alpha}_{t}} \mathbf{x}_{0}\\ &=\frac{\sqrt{\alpha_{t}}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_{t}} \mathbf{x}_{t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t}}{1-\bar{\alpha}_{t}} \mathbf{x}_{0}\\ \end{aligned} μ~t(xt,x0)=(βtαt xt+1αˉt1αˉt1 x0)/(βtαt+1αˉt11)=αt+βtαˉtαt (1αˉt1)xt+αt+βtαˉtαˉt1 βtx0=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtx0
使用 x t \mathbf{x}_{t} xt ϵ {\epsilon} ϵ来表达 x 0 \mathbf{x}_{0} x0
μ ~ t = α t ( 1 − α ˉ t − 1 ) 1 − α ˉ t x t + α ˉ t − 1 β t 1 − α ˉ t 1 α ˉ t ( x t − 1 − α ˉ t ϵ ) = 1 α t ( x t − β t 1 − α ˉ t ϵ ) \begin{aligned} \tilde{\boldsymbol{\mu}}_{t} & =\frac{\sqrt{\alpha_{t}}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_{t}} \mathbf{x}_{t}+\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t}}{1-\bar{\alpha}_{t}} \frac{1}{\sqrt{\bar{\alpha}_{t}}}\left(\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}} \mathbf{\epsilon}\right) \\ & =\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \mathbf{\epsilon}\right) \end{aligned} μ~t=1αˉtαt (1αˉt1)xt+1αˉtαˉt1 βtαˉt 1(xt1αˉt ϵ)=αt 1(xt1αˉt βtϵ ).
The second term of KL divergence p θ ( xt − 1 ∣ xt ) p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)pi(xt1xt) , has been defined in the process of inverse diffusion, the author puts the varianceΣ θ ( xt , t ) \boldsymbol{\Sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)Si(xt,t ) Setσ t 2 = β ~ t \sigma_{t}^{2}=\tilde{\beta}_{t}pt2=b~tp θ ( xt − 1 ∣ xt ) : = N ( xt − 1 ; μ θ ( xt , t ) , Σ θ ( xt , t ) ) =
N ( xt − 1 ; μ θ ( xt , t ) , σ t 2 ) \begin{aligned} p_{\theta}\left(\mathbf{x}_{t-1}\mid \mathbf{x}_{t}\right)&:=\mathcal{N}\ left(\mathbf{x}_{t-1} ; \ball symbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \ball symbol{\Sigma}_ {\theta}\left(\mathbf{x}_{t}, t\right)\right)\\ &=\mathcal{N}\left(\mathbf{x}_{t-1} ; \ball symbol {\mu}_{\theta}\left(\mathbf{x}_{t}, t\right), \ball symbol{\sigma}^{2}_{t}\right)\\\end{aligned } }pi(xt1xt):=N(xt1;mi(xt,t),Si(xt,t))=N(xt1;mi(xt,t),pt2)
Using the KL divergence calculation formula of two normal distributions (with the same variance), it can be calculated as follows:
L t − 1 = E q [ 1 2 σ t 2 ∥ μ ~ t ( xt , x 0 ) − μ θ ( xt , t ) ∥ 2 ] + C L_{t-1}=\mathbb{E}_{q}\left[\frac{1}{2 \sigma_{t}^{2}}\left\|\tilde{ \boldsymbol{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)-\boldsymbol{\mu}_{\theta}\left (\mathbf{x}_{t}, t\right)\right\|^{2}\right]+CLt1=Eq[2 pt21m~t(xt,x0)mi(xt,t)2]+C
usesx 0 \mathbf{x}_{0}x0and ϵ \epsilonϵ to expressxt \mathbf{x}_{t}xt, ϵ \epsilonϵ ~N ( 0 , I ) \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)N(0,I ) :
xt ( x 0 , ϵ ) = α ˉ tx 0 + 1 − α ˉ t ϵ \mathbf{x}_{t}\left(\mathbf{x}_{0}, \ball symbol{\epsilon} \right)=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\ball symbol{\epsilon}xt(x0,) _=aˉt x0+1aˉt ϵ
Then,L t − 1 − C L_{t-1}-CLt1C can be expressed as:
L t − 1 − C = E x 0 , ϵ [ 1 2 σ t 2 ∥ μ ~ t ( xt ( x 0 , ϵ ) , 1 α ˉ t ( xt ( x 0 , ϵ ) − 1 − α ˉ t ϵ ) ) − μ θ ( xt ( x 0 , ϵ ) , t ) ∥ 2 ] = E x 0 , ϵ [ 1 2 σ t 2 ∥ 1 α t ( xt ( x 0 , ϵ ) − β t 1 − α ˉ t ϵ ) − μ θ ( xt ( x 0 , ϵ ) , t ) ∥ 2 ] \begin{aligned} L_{t-1}-C & =\mathbb{E}_{\mathbf{x}_{0 }, \ball symbol{\epsilon}}\left[\frac{1}{2 \sigma_{t}^{2}}\left\|\tilde{\ball symbol{\mu}}_{t}\left( \mathbf{x}_{t}\left(\mathbf{x}_{0}, \ball symbol{\epsilon}\right), \frac{1}{\sqrt{\bar{\alpha}_{t }}}\left(\mathbf{x}_{t}\left(\mathbf{x}_{0}, \ball symbol{\epsilon}\right)-\sqrt{1-\bar{\alpha}_ {t}} \ball symbol{\epsilon}\right)\right)-\ball symbol{\mu}_{\theta}\left(\mathbf{x}_{t}\left(\mathbf{x}_{ 0}, \ball symbol{\epsilon}\right), t\right)\right\|^{2}\right] \\ & =\mathbb{E}_{\mathbf{x}_{0},\boldsymbol{\epsilon}}\left[\frac{1}{2 \sigma_{t}^{2}}\left\|\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}\left(\mathbf{x}_{0}, \boldsymbol{\epsilon}\right)-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \boldsymbol{\epsilon}\right)-\boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}\left(\mathbf{x}_{0}, \boldsymbol{\epsilon}\right), t\right)\right\|^{2}\right] \end{aligned}Lt1C=Ex0, ϵ[2 pt21 m~t(xt(x0,) _,aˉt 1(xt(x0,) _1aˉt ) ) _mi(xt(x0,) _,t) 2]=Ex0, ϵ[2 pt21 at 1(xt(x0,) _1aˉt bt) _mi(xt(x0,) _,t) 2]
The above formula shows that, given xt \mathbf{x}_{t}xtIn the case of , the network output μ θ ( xt ( x 0 , ϵ ) , t ) \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}\left(\mathbf{x }_{0}, \boldsymbol{\epsilon}\right), t\right)mi(xt(x0,) _,t ) field1 α t ( xt ( x 0 , ϵ ) − β t 1 − α ˉ t ϵ ) \frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_ {t}\left(\mathbf{x}_{0}, \ball symbol{\epsilon}\right)-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t }}} \ball symbol{\epsilon}\right)at 1(xt(x0,) _1aˉt btϵ ) .
But the author did not build the network in this way, but chose to useμ θ ( xt , t ) \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right)mi(xt,t)进行参数化处理:
μ θ ( x t , t ) = μ ~ t ( x t , 1 α ˉ t ( x t − 1 − α ˉ t ϵ θ ( x t ) ) ) = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}, t\right)=\tilde{\boldsymbol{\mu}}_{t}\left(\mathbf{x}_{t}, \frac{1}{\sqrt{\bar{\alpha}_{t}}}\left(\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}} \boldsymbol{\epsilon}_{\theta}\left(\mathbf{x}_{t}\right)\right)\right)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \boldsymbol{\epsilon}_{\theta}\left(\mathbf{x}_{t}, t\right)\right) mi(xt,t)=m~t(xt,aˉt 1(xt1aˉt ϵi(xt)))=at 1(xt1aˉt btϵi(xt,t ) )
It can be seen that the network passes inputxt \mathbf{x}_{t}xtand t{t}t , the actual output isϵ θ {\epsilon}_{\theta}ϵi(i.e. prediction noise), while xt \mathbf{x}_{t}xtAnd it can be given by x 0 \mathbf{x}_{0}x0To express, the final L t − 1 − C L_{t-1}-CLt1Define C
: E x 0 , ϵ [ β t 2 2 σ t 2 α t ( 1 − α ˉ t ) ∥ ϵ − ϵ θ ( α ˉ tx 0 + 1 − α ˉ t ϵ , t ) ∥ \mathbb{E}_{\mathbf{x}_{0}, \ball symbol{\epsilon}}\left[\frac{\beta_{t}^{2}}{2 \sigma_{t}^{2 } \alpha_{t}\left(1-\bar{\alpha}_{t}\right)}\left\|\ballsymbol{\epsilon}-\ballsymbol{\epsilon}_{\theta}\left( \sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}} \ball symbol{\epsilon}, t\right )\right\|^{2}\right]Ex0, ϵ[2 pt2at(1aˉt)bt2 ϵϵi(aˉt x0+1aˉt ϵ ,t) 2 ]
Objectively, we can calculate the quantity and the loss ratio:
L simple ( θ ) : = E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( α ˉ tx 0 + 1 − α ˉ t ϵ , t ) ∥ 2 ] L_{\text {simple}}(\theta):=\mathbb{E}_{t, \mathbf{x}_{0}, \ball symbol{\epsilon}}\left[\left \|\ball symbol{\epsilon}-\ball symbol{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{ 1-\bar{\alpha}_{t}}\ballsymbol{\epsilon}, t\right)\right\|^{2}\right]Lsimple ( i ):=Et,x0, ϵ[ ϵϵi(aˉt x0+1aˉt ϵ ,t) 2]

Guess you like

Origin blog.csdn.net/Peach_____/article/details/128694125