DDPM Diffusion Model Formula Reasoning----Loss Function

DDPM Diffusion Model Formula Reasoning----Loss Function
- - 2.3 Loss function derivation
- References and blog

2.3 Loss function derivation

We construct the loss function with the idea of maximum likelihood estimation:
$\mathcal{L}=-\log p_{\theta}\left(x_{0}\right)$ $\theta$
of the inverse diffusion network $θ$ $x_0$ just started sampling $x_{0}$ most likely to occur.

Next, we need to transform the above formula, using some ELBO and VAE content.

2.3.1 ELBA

Known maximum likelihood estimate $p_{\theta}\left(x_{0}\right)$ and observation $x_0$ , and x 1 obtained by diffusion of observations $x_{1:T}$ , by the marginal probability distribution formula:
$p_{\theta }(\boldsymbol{x}_{0})=\int p_{\theta }(x_{0:T}) d x_{1:T}$
therefore
$\begin{aligned} \log p_{\theta }(\boldsymbol{x}_{0}) & =\log \int p_{\theta}(\boldsymbol{x_{0:T}}) d \boldsymbol{x_{1:T}} \\ & =\log \int \frac{p_{\theta}(\boldsymbol{x_{0:T}}) q_{\phi}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})} d \boldsymbol{\boldsymbol{x_{1:T}}} \\ & =\log \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{x_{1:T}} \mid \boldsymbol{x_0})}\left[\frac{p_{\theta }(\boldsymbol{\boldsymbol{x_{0:T}}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \\ & \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned}$
The last step is determined by the piano sound inequality $(J e n se n^{'} s I n e q u a l i t y)$ .
This does not seem very intuitive, and there is another way of derivation that is simpler:
$\begin{aligned} \log p(\boldsymbol{x}) & =\log p(\boldsymbol{x}) \int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) d z \\ & =\int q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})(\log p(\boldsymbol{x})) d z \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}[\log p(\boldsymbol{x})] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{p(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x},\boldsymbol{z}) q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}{p(\boldsymbol{z} \mid \boldsymbol{x}) q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right]+\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}{p(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & =\mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x},\boldsymbol{z})}{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\right]+D_{\mathrm{KL}}\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \| p(\boldsymbol{z} \mid \boldsymbol{x})\right) \\ & \geq \mathbb{E}_{q_{\phi}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\ & = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned}T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned}T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] \end{aligned}$
here $z$ stands for $x_{1:T}$ ， $x$ stands for $x_0$ . Bayesian formula is used in the middle.

The conclusions derived by the two methods here are:
$\log p_{\theta }(\boldsymbol{x}_{0}) \geq \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1 :T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}} (\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right]$

其中 $\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right]$ 就是ELBO $(E v i d e n ce Lower B o u n d) , that is$ , the variational lower bound $.$

We want to make the loss function $-\log p_{\theta}\left(x_{0}\right)$ 最小，就是另ELBO $\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right]$ Max.

我们令 $L_{VLB} = -\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{p_{\theta }(\boldsymbol{x_{0:T}})}{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\right] = \mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}\left[\log \frac{q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})}{p_{\theta }(\boldsymbol{x_{0:T}})}\right]$ , and then convert to solving the loss function. Break it down further:
$mathbf{x}_{0}\right) \| p_{\theta}\left(\mathbf{x}_{t-1}\mid\mathbf{x}_{t}\right)\right)}_{L_{t-1}}-\underbrace{ \log p_{\theta}\left(\mathbf{x}_{0} \mid \mathbf{x}_{1}\right)}_{L_{0}}] \end{aligned}$

Here the initial subscript starts from $q_{\boldsymbol{\phi}}(\boldsymbol{\boldsymbol{x_{1:T}}} \mid \boldsymbol{x_0})$ is changed to $q\left(\mathbf{x}_{0: T}\right)$ , I think $x_0$ is known, so the two expressions are equivalent.

There is another confusing point in the middle, $q\left(\mathbf{x}_{1: T} \mid \mathbf{x}_{0}\right)$ represents the distribution of forward propagation, $q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)$ represents the true distribution of the inverse diffusion process, where $q$ in probability theory $P$ 。 $p_{\theta}\left(\mathbf{x}_{t-1}\mid \mathbf{x}_{t}\right)$ represents the inverse diffusion distribution we want to solve for.

Next we have $L_{T}, L_{t-1}, L_{0}$ These three situations are classified and discussed:

2.3.2 $L_{T}$

$q\left(\mathbf{x}_{1:T} \mid \mathbf{x}_{0}\right)$ represents the forward diffusion process, there is no learnable parameter; $p_{\theta}\left(\mathbf{x}_{T}\right)$ $x_T$ in $)$ $x_{T}$ is the noise that obeys the standard Gaussian distribution, $p_{\theta}$ is the inverse diffusion process, for the inverse diffusion process, $x_T$ is known, so this term $L_{T}$ Can be used as a constant.

2.3.3 $L_{t-1}$

And $L_{t-1}$ It can be seen that the real inverse diffusion distribution $q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}, \mathbf{ x}_{0}\right)$ and the inverse diffusion distribution we require $p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)$ KL divergence.

The function $q\left(\mathbf{x}_{t-1}\mid \mathbf{x}_{t}, \mathbf{x}_{0}\ right)$ mean and variance we have obtained:
$\tilde{\mu}_{t}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\ sqrt{1-\bar{\alpha}_{t}}} \varepsilon_{t}\right), \tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t -1}}{1-\bar{\alpha}_{t}} \cdot \beta_{t}$
The second distribution $p_{\theta}\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{t}\right)$ is the target distribution we want to fit, it is also a Gaussian distribution, the mean is estimated by the network, and the variance is set to be $\beta_t$ Form:
$p_{\theta}\left(\mathbf{x}_{t -1} \mid \mathbf{x}_{t}\right)=\mathcal{N}\left(\mathbf{x}_{t-1} ; \ball symbol{\mu}_{\theta}\ left(\mathbf{x}_{t}, t\right), \ballsymbol{\sigma}_{\theta}\left(\mathbf{x}_{t}, t\right)\right);$

Therefore, in order to make these two distributions close, we can ignore the variance, and only need to minimize the distance between the means of the two distributions. We use the second norm to express:
$\begin{aligned} L_{t} & =\mathbb{E}_{q}\left[\left\|\tilde{\boldsymbol{\ mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)-\boldsymbol{\mu}_{\theta}\left(\mathbf{ x}_{t}, t\right)\right\|^{2}\right] \\ & =\mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\ left\|\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}\left(\mathbf{x}_{0}, \epsilon\right)- \frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \epsilon\right)-\boldsymbol{\mu}_{\theta}\left(\mathbf{ x}_{t}\left(\mathbf{x}_{0}, \epsilon\right), t\right)\right\|^{2}\right] \quad \epsilon \sim \mathcal{N }(0,1) \end{aligned}$
It can be observed in this formula that we need to use $\boldsymbol{\mu}_{\theta}\left(\mathbf{x}_{t}\left( \mathbf{x}_{0}, \epsilon\right), t\right)$ within1 $\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x} _{t}\left(\mathbf{x}_{0}, \epsilon\right)-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \epsilon\right)$ ,define:
$\bold symbol{\mu}_{\theta}\left( \mathbf{x}_{t}, t\right)=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{ t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta}\left(\mathbf{x}_{t}, t\right)\right)$
is to directly use the neural network $\epsilon_{\theta}\left(\mathbf{x}_{t}, t\right)$ to predict the noise $\epsilon$ . Then bring the predicted noise into the defined expression to calculate the predicted mean.

So the loss function becomes:
$\begin{aligned} L_{t} & = \mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\left\|\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x }_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}} \epsilon\right)-\frac{1}{\sqrt{\alpha_ {t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon_{\theta }\left(\mathbf{x}_{t},t\right)\right)\right\|^{2}\right]\quad\epsilon\sim\mathcal{N}(0,1)\\ & =\mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\left\|\epsilon-\epsilon_{\theta}\left(\mathbf{x}_{t} , t\right)\right\|^{2}\right] \quad \epsilon\sim\mathcal{N}(0,1) \quad \text {Throw away all the coefficients of the constant term, the author said this is better for training} \\ & =\mathbb{E}_{\mathbf{x}_{0}, \epsilon}\left[\ left\|\epsilon-\epsilon_{\theta}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x}_{0}+\sqrt{1-\bar{\alpha} _{t}} \epsilon, t\right)\right\|^{2}\right], \quad \epsilon \sim \mathcal{N}(0,1) \end{aligned}$
$x_t$ linearly combined with noise $x_{t}$ , the true value of the noise combined with it is $\epsilon$ ,the free function $\epsilon_{\theta}\left(\sqrt{\bar{\alpha}_{t}} \mathbf{x} _{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon, t\right)$ to fit this noise.

2.3.4 $L_{0}$

The final $L_{0}=-\log p_{\theta}\left(x_{0} \mid x_{1}\right)$ is the noised image of the last step $x_1$ Generate denoising image $x_0$ The maximum likelihood estimation of , in order to generate a better image, we need to use the maximum likelihood estimation for each pixel, so that each pixel value on the image satisfies the discrete log likelihood.

To achieve this, the last part of the inverse diffusion process is changed from $x_1$ to $x_0$ The transformations are set to independent discrete calculations. That is, in the last conversion process at a given $x_1$ get image $x_0$ Satisfy the log likelihood, assuming that the pixels are independent of each other:
$p_{\theta}\left (x_{0} \mid x_{1}\right)=\prod_{i=1}^{D} p_{\theta}\left(x_{0}^{i} \mid x_{1}^{ i}\right)$
$D$ is for $dimension of x$ , superscript $i$ represents a coordinate position in the image. The goal now is to determine how likely the value of a given pixel is, that is, to know the corresponding time step $t = 1$ lower noise imageDistribution of corresponding pixel values in $x$
$\mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_ {1}, 1\right), \sigma_{1}^{2}\right)$
where $t = The pixel distribution of 1$ comes from a multivariate Gaussian distribution whose diagonal covariance matrix allows us to split the distribution into a product of univariate Gaussians: N (
$\mathcal{N}\left(x ; \mu_{\theta}\left(x_{1}, 1\right), \sigma_ {1}^{2} \mathbb{I}\right)=\prod_{i=1}^{D} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left( x_{1}, 1\right), \sigma_{1}^{2}\right)$
Now assume that the image has been normalized in the range [-1,1] from the value of 0-255. Given the pixel value of each pixel at t=0, the transition probability distribution $p_{\theta}\left(x_{0} \mid x_{ 1}\right)$ 的值就是每个像素值的乘积。所以：
$\begin{aligned} p_{\theta}\left(\mathbf{x}_{0} \mid \mathbf{x}_{1}\right) & =\prod_{i=1}^{D} \int_{\delta_{-}\left(x_{0}^{i}\right)}^{\delta_{+}\left(x_{0}^{i}\right)} \mathcal{N}\left(x ; \mu_{\theta}^{i}\left(\mathbf{x}_{1}, 1\right), \sigma_{1}^{2}\right) d x \\ \delta_{+}(x) & =\left\{\begin{array}{ll} \infty & \text { if } x=1 \\ x+\frac{1}{255} & \text { if } x<1 \end{array} \quad \delta_{-}(x)=\left\{\begin{array}{ll} -\infty & \text { if } x=-1 \\ x-\frac{1}{255} & \text { if } x>-1 \end{array}\right.\right. \end{aligned}$
This formula comes from the original paper, here is an analysis of its meaning. That is, we want to add the noise image of the last step $x_1$ Fit denoised image $x_0$ , set each pixel of the image to a Gaussian distribution, a total of $D$ pixels. while $x_0$ The original value range of each pixel is $\{0,1, \ldots, 255\}$ after normalization $[- 1, 1]$ range.

Now we take out a single $x_1$ Pixels on $x_1^i$ ,given the function $\mathcal{N}\left(x ; \mu_{\theta}^{i}\left(x_{1}, 1\right), \sigma_{1}^{2}\right)$ , the target to be fitted is $x_0$ The corresponding position pixel point $x_0^i$ , and $x_0^i$ The value range of the original discrete space $\{0,1, \ldots, 255\}$ is mapped to the continuous space $[- 1, 1]$ , so each original discrete value corresponds to an interval in the continuous space, and the formula for interval mapping is:
$\begin{aligned} \delta_{+}(x) & =\left\{\begin{array}{ll} \infty & \text { if } x=1 \\ x+\frac{1}{255} & \text { if } x<1 \end{array} \quad \delta_{-}(x)=\left\{\begin{array }{ll} -\infty & \text { if } x=-1 \\ x-\frac{1}{255} & \text { if } x>-1 \end{array}\right.\right. \end{aligned}$

The above is the analysis and derivation of all the formulas involved in the diffusion and inverse diffusion process of DDPM, including the loss function construction part.

References and blog

Understanding Diffusion Models: A Unified Perspective
Denoising Diffusion Probabilistic Models
https://yinglinzheng.netlify.app/diffusion-model-tutorial
https://zhuanlan.zhihu.com/p/549623622