DDPM Diffusion Model Formula Reasoning----Loss Function

DDPM Diffusion Model Formula Reasoning----Loss Function

2.3 Loss function derivation

We construct the loss function with the idea of ​​maximum likelihood estimation:
L = − log ⁡ p θ ( x 0 ) \mathcal{L}=-\log p_{\theta}\left(x_{0}\right)L=logpi(x0) is the parameter θ \theta
of the inverse diffusion networkθ makes the data x 0 x_0just started samplingx0most likely to occur.

Next, we need to transform the above formula, using some ELBO and VAE content.

2.3.1 ELBA

Known maximum likelihood estimate p θ ( x 0 ) p_{\theta}\left(x_{0}\right)pi(x0) and observationx 0 x_0x0, and x 1 obtained by diffusion of observations : T x_{1:T}x1:T, by the marginal probability distribution formula:
p θ ( x 0 ) = ∫ p θ ( x 0 : T ) dx 1 : T p_{\theta }(\boldsymbol{x}_{0})=\int p_{\theta }(x_{0:T}) d x_{1:T}pi(x0)=pi(x0:T)dx1:T
therefore
log ⁡ p θ ( x 0 ) = log ⁡ ∫ p θ ( x 0 : T ) d x 1 : T = log ⁡ ∫ p θ ( x 0 : T ) q ϕ ( x 1 : T ∣ x 0 ) q ϕ ( x 1 : T ∣ x 0 ) d x 1 : T = log ⁡ E q ϕ ( x 1 : T ∣ x 0 ) [ p θ ( x 0 : T ) q ϕ ( x 1 : T ∣ x 0 ) ] ≥ E q ϕ ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 : T ) q ϕ ( x 1 : T ∣ x 0 ) ] \begin{aligned} \log p_{\theta }(\boldsymbol{x}_{0}) & =\log \int p_{\theta}(\boldsymbol{x_{0:T}}) d \boldsymbol{x_{1:T}} \\ & =\log \int \frac{p_{\theta}(\boldsymbol{x_{0:T}}) q_{\phi}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})} d \boldsymbol{\boldsymbol{x_{1:T}}} \\ & =\log \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{x_{1:T}} \mid \boldsymbol{x_0})}\left[\frac{p_{\theta }(\boldsymbol{\boldsymbol{x_{0:T}}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \\ & \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned} logpi(x0)=logpi(x0:T)dx1:T=logqϕ(x1:Tx0)pi(x0:T)qϕ(x1:Tx0)dx1:T=logEqϕ(x1:Tx0)[qϕ(x1:Tx0)pi(x0:T)]Eqϕ(x1:Tx0)[logqϕ(x1:Tx0)pi(x0:T)]
The last step is determined by the piano sound inequality ( Jensen ′ s I inequality ) (Jensen's Inequality)(Jensen sInequality).
This does not seem very intuitive, and there is another way of derivation that is simpler:
log ⁡ p ( x ) = log ⁡ p ( x ) ∫ q ϕ ( z ∣ x ) dz = ∫ q ϕ ( z ∣ x ) ( log ⁡ p ( x ) ) dz = E q ϕ ( z ∣ x ) [ log ⁡ p ( x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) p ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) p ( z ∣ x ) q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) ] + E q ϕ ( z ∣ x ) [ log ⁡ q ϕ ( z ∣ x ) p ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) ] + DKL ( q ϕ ( z ∣ x ) ∥ p ( z ∣ x ) ) ≥ E q ϕ ( z ∣ x ) [ log ⁡ p ( x , z ) q ϕ ( z ∣ x ) ] = E q ϕ ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 : T ) q ϕ ( x 1 :T ∣ x 0 ) ] \begin{aligned} \log p(\boldsymbol{x}) & =\log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) d z \\ & =\int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})(\log p(\boldsymbol{x})) d z \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}[\log p(\boldsymbol{x})] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{p(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x},\boldsymbol{z}) q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}{p(\boldsymbol{z} \mid \boldsymbol{x}) q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right]+\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}{p(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x},\boldsymbol{z})}{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\right]+D_{\mathrm{KL}}\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \| p(\boldsymbol{z} \mid \boldsymbol{x})\right) \\ & \geq \mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned}T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned}T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned}logp(x)=logp(x)qϕ(zx )dz=qϕ(zx)(logp ( x )) d z=Eqϕ(zx)[logp(x)]=Eqϕ(zx)[logp(zx)p(x,z)]=Eqϕ(zx)[logp(zx)qϕ(zx)p(x,z)qϕ(zx)]=Eqϕ(zx)[logqϕ(zx)p(x,z)]+Eqϕ(zx)[logp(zx)qϕ(zx)]=Eqϕ(zx)[logqϕ(zx)p(x,z)]+DKL(qϕ(zx)p(zx))Eqϕ(zx)[logqϕ(zx)p(x,z)]=Eqϕ(x1:Tx0)[logqϕ(x1:Tx0)pi(x0:T)]
zz herez stands forx 1 : T x_{1:T}x1:T x x x stands forx 0 x_0x0. Bayesian formula is used in the middle.

The conclusions derived by the two methods here are:
log ⁡ p θ ( x 0 ) ≥ E q ϕ ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 : T ) q ϕ ( x 1 : T ∣ x 0 ) ] \log p_{\theta }(\boldsymbol{x}_{0}) \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1 :T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}} (\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right]logpi(x0)Eqϕ(x1:Tx0)[logqϕ(x1:Tx0)pi(x0:T)]

其中 E q ϕ ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 : T ) q ϕ ( x 1 : T ∣ x 0 ) ] \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] Eqϕ(x1:Tx0)[logqϕ(x1:Tx0)pi(x0:T)]就是ELBO( Lower Bound Evidence ) (Lower Bound Evidence)( E v i d e n ce Lower B o u n d ) , that is , the variational lower bound .

We want to make the loss function − log ⁡ p θ ( x 0 ) -\log p_{\theta}\left(x_{0}\right)logpi(x0) 最小,就是另ELBO E q ϕ ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 : T ) q ϕ ( x 1 : T ∣ x 0 ) ] \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] Eqϕ(x1:Tx0)[logqϕ(x1:Tx0)pi(x0:T)] Max.

我们令 L V L B = − E q ϕ ( x 1 : T ∣ x 0 ) [ log ⁡ p θ ( x 0 : T ) q ϕ ( x 1 : T ∣ x 0 ) ] = E q ϕ ( x 1 : T ∣ x 0 ) [ log ⁡ q ϕ ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] L_{VLB} = -\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}{p_{\theta }(\boldsymbol{x_{0:T}})}\right] LV L B=Eqϕ(x1:Tx0)[logqϕ(x1:Tx0)pi(x0:T)]=Eqϕ(x1:Tx0)[logpi(x0:T)qϕ(x1:Tx0)] , and then convert to solving the loss function. Break it down further:
\mathbf{x}_{0}\right) \| p_{\theta}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t}\right)\right)}_{L_{t-1}}-\underbrace{ \log p_{\theta}\left(\mathbf{x}_{0} \mid \mathbf{x}_{1}\right)}_{L_{0}}] \end{aligned} \mathbf{x}_{0}\right) \| p_{\theta}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t}\right)\right)}_{L_{t-1}}-\underbrace{ \log p_{\theta}\left(\mathbf{x}_{0} \mid \mathbf{x}_{1}\right)}_{L_{0}}] \end{aligned}LVLB=Eq(x0:T)[logpi(x0:T)q(x1:Tx0)]=Eq[logpi(xT)t=1Tpi(xt1xt)t=1Tq(xtxt1)]=Eq[logpi(xT)+t=1Tlogpi(xt1xt)q(xtxt1)]=Eq[logpi(xT)+t=2Tlogpi(xt1xt)q(xtxt1)+logpi(x0x1)q(x1x0)]=Eq[logpi(xT)+t=2Tlog(pi(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0))+logpi(x0x1)q(x1x0)]=Eq[logpi(xT)+t=2Tlogpi(xt1xt)q(xt1xt,x0)+t=2Tlogq(xt1x0)q(xtx0)+logpi(x0x1)q(x1x0)]=Eq[logpi(xT)+t=2Tlogpi(xt1xt)q(xt1xt,x0)+logq(x1x0)q(xTx0)+logpi(x0x1)q(x1x0)]=Eq[logpi(xT)q(xTx0)+t=2Tlogpi(xt1xt)q(xt1xt,x0)logpi(x0x1)]=Eq[LT DKL(q(xTx0)pi(xT))+t=2TLt1 DKL(q(xt1xt,x0)pi(xt1xt))L0 logpi(x0x1)]

Here the initial subscript starts from q ϕ ( x 1 : T ∣ x 0 ) q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})qϕ(x1:Tx0) is changed toq ( x 0 : T ) q\left(\mathbf{x}_{0: T}\right)q(x0:T) , I thinkx 0 x_0x0is known, so the two expressions are equivalent.

There is another confusing point in the middle, q ( x 1 : T ∣ x 0 ) q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_{0}\right)q(x1:Tx0) represents the distribution of forward propagation,q ( xt − 1 ∣ xt ) q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)q(xt1xt) represents the true distribution of the inverse diffusion process, whereqqq has no forward or reverse meaning, it just represents probability and distribution, which can be understood as the probability PPin probability theoryPp θ ( xt − 1 ∣ xt ) p_{\theta}\left(\mathbf{x}_{t-1}\mid \mathbf{x}_{t}\right)pi(xt1xt) represents the inverse diffusion distribution we want to solve for.

Next we have LT , L t − 1 , L 0 L_{T}, L_{t-1}, L_{0}LT,Lt1,L0These three situations are classified and discussed:

2.3.2 L T L_{T} LT

q ( x 1 : T ∣ x 0 ) q\left(\mathbf{x}_{1:T} \mid \mathbf{x}_{0}\right) q(x1:Tx0) represents the forward diffusion process, there is no learnable parameter;p θ ( x T ) p_{\theta}\left(\mathbf{x}_{T}\right)pi(xTx T x_Tin )xTis the noise that obeys the standard Gaussian distribution, p θ p_{\theta}piis the inverse diffusion process, for the inverse diffusion process, x T x_TxTis known, so this term LT L_{T}LTCan be used as a constant.

2.3.3 L t − 1 L_{t-1} Lt1

And L t − 1 L_{t-1}Lt1It can be seen that the real inverse diffusion distribution q ( xt − 1 ∣ xt , x 0 ) q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{ x}_{0}\right)q(xt1xt,x0) and the inverse diffusion distribution we requirep θ ( xt − 1 ∣ xt ) p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)pi(xt1xt) KL divergence.

  1. The function q ( xt − 1 ∣ xt , x 0 ) q\left(\mathbf{x}_{t-1}\mid \mathbf{x}_{t}, \mathbf{x}_{0}\ right)q(xt1xt,x0) mean and variance we have obtained:
    μ ~ t = 1 α t ( xt − 1 − α t 1 − α ˉ t ε t ) , β ~ t = 1 − α ˉ t − 1 1 − α ˉ t ⋅ β t \tilde{\mu}_{t}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\ sqrt{1-\bar{\alpha}_{t}}} \varepsilon_{t}\right), \tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t -1}}{1-\bar{\alpha}_{t}} \cdot \beta_{t}m~t=at 1(xt1aˉt 1atet),b~t=1aˉt1aˉt1bt
  2. The second distribution p θ ( xt − 1 ∣ xt ) p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)pi(xt1xt) is the target distribution we want to fit, it is also a Gaussian distribution, the mean is estimated by the network, and the variance is set to beβ t \beta_tbtForm:
    p θ ( xt − 1 ∣ xt ) = N ( xt − 1 ; μ θ ( xt , t ) , Σ θ ( xt , t ) ) p_{\theta}\left(\mathbf{x}_{t -1} \mid \mathbf{x}_{t}\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \ball symbol{\mu}_{\theta}\ left(\mathbf{x}_{t}, t\right), \ballsymbol{\sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)\right);pi(xt1xt)=N(xt1;mi(xt,t),Si(xt,t))

Therefore, in order to make these two distributions close, we can ignore the variance, and only need to minimize the distance between the means of the two distributions. We use the second norm to express:
L t = E q [ ∥ μ ~ t ( xt , x 0 ) − μ θ ( xt , t ) ∥ 2 ] = E x 0 , ϵ [ ∥ 1 α t ( xt ( x 0 , ϵ ) − β t 1 − α ˉ t ϵ ) − μ θ ( xt ( x 0 , ϵ ) , t ) ∥ 2 ] ϵ ∼ N ( 0 , 1 ) \begin{aligned} L_{t} ​​& =\mathbb{E}_{q}\left[\left\|\tilde{\boldsymbol{\ mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)-\boldsymbol{\mu}_{\theta}\left(\mathbf{ x}_{t}, t\right)\right\|^{2}\right] \\ & =\mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\ left\|\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}\left(\mathbf{x}_{0}, \epsilon\right)- \frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \epsilon\right)-\boldsymbol{\mu}_{\theta}\left(\mathbf{ x}_{t}\left(\mathbf{x}_{0}, \epsilon\right), t\right)\right\|^{2}\right] \quad \epsilon \sim \mathcal{N }(0,1) \end{aligned}Lt=Eq[m~t(xt,x0)mi(xt,t)2]=Ex0, ϵ[ at 1(xt(x0,) _1aˉt bt) _mi(xt(x0,) _,t) 2]ϵN(0,1)
It can be observed in this formula that we need to use μ θ ( xt ( x 0 , ϵ ) , t ) \boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}\left( \mathbf{x}_{0}, \epsilon\right), t\right)mi(xt(x0,) _,t ) within1α t ( xt ( x 0 , ϵ ) − β t 1 − α ˉ t ϵ ) \frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x} _{t}\left(\mathbf{x}_{0}, \epsilon\right)-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \epsilon\right)at 1(xt(x0,) _1aˉt btϵ ) ,define:
μ θ ( xt , t ) = 1 α t ( xt − β t 1 − α ˉ t ϵ θ ( xt , t ) ) \bold symbol{\mu}_{\theta}\left( \mathbf{x}_{t}, t\right)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{ t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}\left(\mathbf{x}_{t}, t\right)\right)mi(xt,t)=at 1(xt1aˉt btϵi(xt,t ) )
is to directly use the neural networkϵ θ ( xt , t ) \epsilon_{\theta}\left(\mathbf{x}_{t}, t\right)ϵi(xt,t ) to predict the noiseϵ \epsilonϵ . Then bring the predicted noise into the defined expression to calculate the predicted mean.

So the loss function becomes:
L t = E x 0 , ϵ [ ∥ 1 α t ( xt − β t 1 − α ˉ t ϵ ) − 1 α t ( xt − β t 1 − α ˉ t ϵ θ ( xt , t ) ) ∥ ϵ ∼ N ( 0 , 1 ) = E x 0 , ϵ [ ∥ ϵ − ϵ θ ( xt , t ) ∥ 2 ] ϵ ∼ N ( 0 , 1 ) Let ϵ ∼ N ( 0 , 1 ) be an infinitesimal, continuous solver训练= E x 0 , ϵ [ ∥ ϵ − ϵ θ ( α ˉ tx 0 + 1 − α ˉ t ϵ , t ) ∥ 2 ] , ϵ ∼ N ( 0 , 1 ) \begin{aligned} L_{t} ​​& = \mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\left\|\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x }_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \epsilon\right)-\frac{1}{\sqrt{\alpha_ {t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta }\left(\mathbf{x}_{t},t\right)\right)\right\|^{2}\right]\quad\epsilon\sim\mathcal{N}(0,1)\\ & =\mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\mathbf{x}_{t} , t\right)\right\|^{2}\right] \quad \epsilon\sim\mathcal{N}(0,1) \quad \text {Throw away all the coefficients of the constant term, the author said this is better for training} \\ & =\mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\ left\|\epsilon-\epsilon_{\theta}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha} _{t}} \epsilon, t\right)\right\|^{2}\right], \quad \epsilon \sim \mathcal{N}(0,1) \end{aligned}Lt=Ex0, ϵ[ at 1(xt1aˉt bt) _at 1(xt1aˉt btϵi(xt,t)) 2]ϵN(0,1)=Ex0, ϵ[ϵϵi(xt,t)2]ϵN(0,1) Throw away the coefficients of the constant items the author said that it is better to train =Ex0, ϵ[ ϵϵi(aˉt x0+1aˉt ϵ ,t) 2],ϵN(0,1)
The input to the network is a picture xt x_t linearly combined with noisext, the true value of the noise combined with it is ϵ \epsilonϵ ,the free functionθ ( α ˉ tx 0 + 1 − α ˉ t ϵ , t ) \epsilon_{\theta}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x} _{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon, t\right)ϵi(aˉt x0+1aˉt ϵ ,t ) to fit this noise.

2.3.4 L 0 L_{0}L0

The final L 0 = − log ⁡ p θ ( x 0 ∣ x 1 ) L_{0}=-\log p_{\theta}\left(x_{0} \mid x_{1}\right)L0=logpi(x0x1) is the noised image of the last stepx 1 x_1x1Generate denoising image x 0 x_0x0The maximum likelihood estimation of , in order to generate a better image, we need to use the maximum likelihood estimation for each pixel, so that each pixel value on the image satisfies the discrete log likelihood.

To achieve this, the last part of the inverse diffusion process is changed from x 1 to x_1x1to x 0 x_0x0The transformations are set to independent discrete calculations. That is, in the last conversion process at a given x 1 x_1x1get image x 0 x_0x0Satisfy the log likelihood, assuming that the pixels are independent of each other:
p θ ( x 0 ∣ x 1 ) = ∏ i = 1 D p θ ( x 0 i ∣ x 1 i ) p_{\theta}\left (x_{0} \mid x_{1}\right)=\prod_{i=1}^{D} p_{\theta}\left(x_{0}^{i} \mid x_{1}^{ i}\right)pi(x0x1)=i=1Dpi(x0ix1i)
D D D is forxxdimension of x , superscriptiii represents a coordinate position in the image. The goal now is to determine how likely the value of a given pixel is, that is, to know the corresponding time stept = 1 t=1t=1 lower noise imagexxDistribution of corresponding pixel values ​​in x
: N ( x ; μ θ i ( x 1 , 1 ) , σ 1 2 ) \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_ {1}, 1\right), \sigma_{1}^{2}\right)N(x;mii(x1,1),p12)
wheret = 1 t=1t=The pixel distribution of 1 comes from a multivariate Gaussian distribution whose diagonal covariance matrix allows us to split the distribution into a product of univariate Gaussians: N (
x ; μ θ ( x 1 , 1 ) , σ 1 2 I ) = ∏ i = 1 DN ( x ; μ θ i ( x 1 , 1 ) , σ 1 2 ) \mathcal{N}\left(x ; \mu_{\theta}\left(x_{1}, 1\right), \sigma_ {1}^{2} \mathbb{I}\right)=\prod_{i=1}^{D} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left( x_{1}, 1\right), \sigma_{1}^{2}\right)N(x;mi(x1,1),p12I)=i=1DN(x;mii(x1,1),p12)
Now assume that the image has been normalized in the range [-1,1] from the value of 0-255. Given the pixel value of each pixel at t=0, the transition probability distributionp θ ( x 0 ∣ x 1 ) p_{\theta}\left(x_{0} \mid x_{ 1}\right)pi(x0x1) 的值就是每个像素值的乘积。所以:
p θ ( x 0 ∣ x 1 ) = ∏ i = 1 D ∫ δ − ( x 0 i ) δ + ( x 0 i ) N ( x ; μ θ i ( x 1 , 1 ) , σ 1 2 ) d x δ + ( x ) = { ∞  if  x = 1 x + 1 255  if  x < 1 δ − ( x ) = { − ∞  if  x = − 1 x − 1 255  if  x > − 1 \begin{aligned} p_{\theta}\left(\mathbf{x}_{0} \mid \mathbf{x}_{1}\right) & =\prod_{i=1}^{D} \int_{\delta_{-}\left(x_{0}^{i}\right)}^{\delta_{+}\left(x_{0}^{i}\right)} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(\mathbf{x}_{1}, 1\right), \sigma_{1}^{2}\right) d x \\ \delta_{+}(x) & =\left\{\begin{array}{ll} \infty & \text { if } x=1 \\ x+\frac{1}{255} & \text { if } x<1 \end{array} \quad \delta_{-}(x)=\left\{\begin{array}{ll} -\infty & \text { if } x=-1 \\ x-\frac{1}{255} & \text { if } x>-1 \end{array}\right.\right. \end{aligned} pi(x0x1)d+(x)=i=1Dd(x0i)d+(x0i)N(x;mii(x1,1),p12)dx={ x+2551 if x=1 if x<1d(x)={ x2551 if x=1 if x>1
This formula comes from the original paper, here is an analysis of its meaning. That is, we want to add the noise image of the last step x 1 x_1x1Fit denoised image x 0 x_0x0, set each pixel of the image to a Gaussian distribution, a total of DDD pixels. whilex 0 x_0x0The original value range of each pixel is { 0 , 1 , … , 255 } \{0,1, \ldots, 255\}{ 0,1,,255 } , mapped to [ − 1 , 1 ] [-1,1]after normalization[1,1 ] range.

Now we take out a single x 1 x_1x1Pixels on x 1 i x_1^ix1i,given the function N ( x ; μ θ i ( x 1 , 1 ) , σ 1 2 ) \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_{1}, 1\right), \sigma_{1}^{2}\right)N(x;mii(x1,1),p12) , the target to be fitted isx 0 x_0x0The corresponding position pixel point x 0 i x_0^ix0i, and x 0 i x_0^ix0iThe value range of the original discrete space { 0 , 1 , … , 255 } \{0,1, \ldots, 255\}{ 0,1,,255 } is mapped to the continuous space[ − 1 , 1 ] [-1,1][1,1 ] , so each original discrete value corresponds to an interval in the continuous space, and the formula for interval mapping is:
δ + ( x ) = { ∞ if x = 1 x + 1 255 if x < 1 δ − ( x ) = { − ∞ if x = − 1 x − 1 255 if x > − 1 \begin{aligned} \delta_{+}(x) & =\left\{\begin{array}{ll} \infty & \text { if } x=1 \\ x+\frac{1}{255} & \text { if } x<1 \end{array} \quad \delta_{-}(x)=\left\{\begin{array }{ll} -\infty & \text { if } x=-1 \\ x-\frac{1}{255} & \text { if } x>-1 \end{array}\right.\right. \end{aligned}d+(x)={ x+2551 if x=1 if x<1d(x)={ x2551 if x=1 if x>1

The above is the analysis and derivation of all the formulas involved in the diffusion and inverse diffusion process of DDPM, including the loss function construction part.

References and blog

Understanding Diffusion Models: A Unified Perspective
Denoising Diffusion Probabilistic Models
https://yinglinzheng.netlify.app/diffusion-model-tutorial
https://zhuanlan.zhihu.com/p/549623622

Guess you like

Origin blog.csdn.net/weixin_45453121/article/details/131223653
Recommended