Detailed explanation of the principle of Diffusion Model

Diffusion Model (diffusion model) targets Generative Adversarial Network (GAN), as long as GAN can do it, it can basically do it. Before using GAN network to realize some image generation tasks, the effect is not very ideal, and the training is often very unstable. However, the pictures generated after switching to the Diffusion Model are very realistic, and it is obvious that the results of each round of training are better than before, that is, the training is more stable.

This article will use popular language and formulas to introduce Diffusion Model to you, and combine the formulas to sort out the code of Diffusion Model for you, and explore how it is implemented through code.

the whole idea

The overall idea of ​​the Diffusion Model is shown in the figure below:

It is mainly divided into forward process and reverse process. The forward process is similar to encoding, and the reverse process is similar to decoding.

  • Forward process
    First, for an original picture x_0, we x_0add a Gaussian noise, and the picture becomes x_1. 【Note: Gaussian noise must be added here, because Gaussian noise obeys Gaussian distribution, and some of the following operations need to use some characteristics of Gaussian distribution】" Then we will x_1add Gaussian noise on the basis of x_2. Repeat the above step of adding Gaussian noise until the picture becomes x_n, due to the addition of enough Gaussian noise, the current x_napproximation obeys a Gaussian distribution (also known as a normal distribution).
    Now there is a question for everyone to think about. Is the amount of Gaussian noise we add in each step always constant? Drip, start answering. The answer is that the amount of Gaussian noise added in each step varies, and the latter step adds more Gaussian noise than the previous step. I think it is very easy for you to understand through the above picture. At the beginning, the original picture is relatively clean. We can add a small amount of Gaussian noise to interfere with the original picture; Gaussian noise, then it will basically not have any impact on the results of the previous step. [Note: The image of each moment described later and the image of each step here have the same meaning, such as the x_1image of the moment represents x_1this image]
  • The reverse process
    First, we will randomly generate a noise image that obeys a Gaussian distribution, and then reduce the noise step by step until the expected image is generated. It is good for everyone to have such an understanding of the reverse process, and the details will be introduced later.

implementation details

This part introduces the details of the forward process and reverse process of the Diffusion Model, mainly by deriving some formulas to express the relationship between images before and after adding noise.

forward process

In the overall thinking part, we already know that the forward process is actually a process of continuously adding noise, so we consider whether we can use some formulas to express the relationship between the image before and after adding noise. I want everyone to think about what factors will affect the image at the next moment, and more specifically, x_2what quantities are it determined by? I think this problem is very simple, that is, it is determined by the noise added together, that is to say, the image at the next moment is mainly determined by two quantities, one is the image at the previous moment, and the other is the amount of noise added x_2. x_1Knowing this, we can use a formula to express the relationship between x_ttime and x_t-1two images, as follows:

X_t=\sqrt{a_t} X_{t-1}+\sqrt{1-a_t} Z_1 ——Formula 1

Wherein, X_trepresents tthe image at the time, X_{t-1}represents t-1the time image, and Z_1represents the added Gaussian noise, which obeys the N(0,1) distribution. [Note: N(0,1) represents a standard Gaussian distribution, with a variance of 1 and a mean of 0] At present, you can see that X_tboth and X_{t-1}and Z_1have a relationship, which is consistent with what we said earlier 后一时刻的图像由前一时刻图像和噪声决定. The sum in front of this formula represents the weight of these two quantities, and the sum of their squares is 1 \sqrt{a_t}.\sqrt{1-a_t}

 I think you have understood formula 1, but you may still have some doubts about the understanding of \sqrt{a_t}sum \sqrt{1-a_t}, such as why should such a weight be set? Is the setting of this weight preset by us? In fact, a_tit is also related to another quantity \beta_t, the relationship is as follows:

​——Formula a_t=1-\beta_t 2

Among them, \beta_tis a predetermined value, which is a value that increases with time, and its range is [0.0001,0.02] in the paper. Since \beta_tit is getting bigger and bigger, a_tit is getting smaller a_tand smaller, and 1− a_tis getting bigger and bigger. Now let's consider formula 1, Z_1the weight of the weight  \sqrt{1-a_t}increases with time, indicating that we add more and more Gaussian noise, which is consistent with our overall thinking, that is, the more we add The more noise.

Now, we have obtained the relationship between x_tthe moment and x_{t-1}the two images of the moment, but x_{t-1}the image of the moment is unknown. [ Note: Only x_0the stage image is known, that is, the original image] We need to x_{t-2}derive x_{t-1}the time image from the time, and then x_{t-3}derive x_{t-2}the time image from the time, and so on, until the time image x_0is derived from the time. x_1In this case, we might as well try the relationship between x_{t-2}time images and x_{t-1}time images first, as follows:

 X_{t-1}=\sqrt{a_{t-1}} X_{t-2}+\sqrt{1-a_{t-1}} Z_2   ——Official 3

This formula is an analogous formula of formula 1. At this time, we substitute formula 3 into formula 1 to get:

​——Formula\begin{aligned} X_t & =\sqrt{a_t}\left(\sqrt{a_{t-1}} X_{t-2}+\sqrt{1-a_{t-1}} Z_2\right)+\sqrt{1-a_t} Z_1 \\ & =\sqrt{a_t a_{t-1}} X_{t-2}+\sqrt{a_t\left(1-a_{t-1}\right)} Z_2+\sqrt{1-a_t} Z_1 \\ & =\sqrt{a_t a_{t-1}} X_{t-2}+\sqrt{1-a_t a_{t-1}} \hat{Z}_2 \end{aligned} 4

Can everyone understand this formula 4? I think everyone should have doubts about the last equation, that is, how \sqrt{a_t\left(1-a_{t-1}\right)} Z_2+\sqrt{1-a_t} Z_1is it equal \sqrt{1-a_t a_{t-1}} \hat{Z}_2? In fact, this uses some knowledge of the Gaussian distribution, see the appendix for this part. After reading the relevant properties of the Gaussian distribution in the appendix, I think you should be able to understand it here. I am helping you sort it out, as shown in the figure below:

Now I understand the content of formula 4. Note that here \hat{Z}_2is also subject to N(0,1) Gaussian distribution, \sqrt{a_t\left(1-a_{t-1}\right)} Z_2+\sqrt{1-a_t} Z_1obey N\left(0,1-a_t a_{t-1}\right). Let's take a look at what formula 4 has obtained - it has obtained the relationship between x_tthe time image and x_{t-2}the time image. According to our previous understanding, we list the relationship between x_{t-3}time images and x_{t-2}time images as follows:

X_{t-2}=\sqrt{a_{t-2}} X_{t-3}+\sqrt{1-a_{t-2}} Z_3 ——Official 5

Similarly, we substitute Formula 5 into Formula 4 to obtain the relationship between x_ttime images and x_{t-3}time images. The formula is as follows: 

X_t=\sqrt{a_t a_{t-1} a_{t-2}} X_{t-3}+\sqrt{1-a_t a_{t-1} a_{t-2}} \hat{Z}_3——Official 6

I did not take you through the calculation of formula 5 step by step. I only wrote the final result. You can calculate it yourself. It is very simple and only uses the relevant properties of the Gaussian distribution. Note that the above \hat{Z}_3also obeys Nthe (0,1) Gaussian distribution. Then formula 6 can get the relationship between x_tthe time image and x_{t-3}the time image. If we continue to calculate in this way, we will get  the relationship between x_t the time image and x_0the time image. But such a derivation seems to be very long, as you deduce backwards, you will find that this derivation is regular. We can compare the results of Formula 4 and Formula 6, and you will find an obvious rule. Here I will directly write the relationship between the time x_t image and x_0the time image according to this rule. The formula is as follows:

X_t=\sqrt{\bar{a}_t} X_0+\sqrt{1-\bar{a}_t} \hat{Z}_t ——Official 7

Among them, \bar{a}_tit represents the multiplication operation, that is \bar{a}_t=a_t \cdot a_{t-1} \cdot a_{t-2} \cdots a_1 , \hat{Z}_tthe sample obeys Nthe Gaussian distribution of (0,1). [Here \hat{Z}_t is just a representation, as long as it Zobeys the standard Gaussian distribution, you can use any representation] This formula 7 is the core formula of the whole forward process, "The x_timage representing the time can be composed of x_0the image at the time and A standard Gaussian noise representation" , you need to keep this formula in mind, it will be used later and in the code.


reverse process

The reverse process is the process of restoring Gaussian noise to the expected picture. Let's first take a look at what we have known. In fact, it is x_tGaussian noise at one moment. We hope to x_tturn the Gaussian noise of the time into x_0the image of the time, which is difficult to complete in one step. Therefore, we can think about whether we can first consider the relationship between x_tthe time image and x_{t-1}the time as in the forward process, and then deduce the conclusion step by step. x_tOk, now that we have an idea, let’s first think about how to get a time image from a known time image x_{t-1}.

x_{t-1}Here we need to use the conclusions in the forward process. We can obtain the time image from the time image in the forward process , and then use the Bayesian formula to solve it. The expression of Bayesian formula is as follows: x_t

Then we will use the Bayesian formula to find x_{t-1}the moment image, the formula is as follows:

q\left(X_{t-1} \mid X_t\right)=q\left(X_t \mid X_{t-1}\right) \frac{q\left(X_{t-1}\right)}{q\left(X_t\right)}——Official 8

We can get it from formula 8 q\left(X_t \mid X_{t-1}\right), which is what we just found in the forward process. But q\left(X_{t-1}\right)and q\left(X_{t}\right)is unknown. It can also be seen from formula 7 that X_0the image at each moment can be obtained, X_tand of course X_{t-1}the image at the same time can be obtained, so add one to formula 8 X_0as a known condition, and change formula 8 into formula 9, as follows:

q\left(X_{t-1} \mid X_t, X_0\right)=q\left(X_t \mid X_{t-1}, X_0\right) \frac{q\left(X_{t-1} \mid X_0\right)}{q\left(X_t \mid X_0\right)}——Official 9

Now it can be found that the three items on the right side of formula 9 can be calculated. We list their formulas and corresponding distributions, as shown in the figure below:

Knowing the distribution of the three terms on the right side of the equation in Formula 9, we can calculate the left side of the equation q\left(X_{t-1} \mid X_t, X_0\right). This calculation is very simple, there is no skill, it is pure calculation. In the appendix->Gaussian distribution properties section, we know that the expression of the Gaussian distribution is: f(x)=\frac{1}{\sqrt{2 \pi \sigma}} e^{-\frac{(x-u)^2}{2 \sigma^2}}. Then we only need to ask for the three Gaussian distribution expressions on the right side of the equation 9, and then perform multiplication and division operations to obtain it q\left(X_{t-1} \mid X_t, X_0\right)

The figure above shows the three Gaussian distribution expressions on the right side of the equation. Everyone should know how to get this result, that is, just substitute the respective mean and variance into the Gaussian distribution expressions. Now we only need to perform corresponding multiplication and division operations on the above three formulas, as shown in the figure below:

Well, the formula we got in the above figure M \cdot e^{-\frac{1}{2}\left[\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{a}_{t-1}}\right) X_{t-1}^2-\left(\frac{2 \sqrt{a_t}}{\beta_t} X_t+\frac{2 \sqrt{a_{t-1}}}{1-\bar{a}_{t-1}} X_0\right) X_{t-1}+C\left(X_t, X_0\right)\right]}is actually q\left(X_{t-1} \mid X_t\right)the expression of . Knowing what is the use of this expression, it is mainly to find the mean and variance. First of all, we should know that the result of multiplication and division of Gaussian distribution is still Gaussian distribution, that is to say, it q\left(X_{t-1} \mid X_t\right)obeys Gaussian distribution, then its expression is  f(x)=\frac{1}{\sqrt{2 \pi \sigma}} e^{-\frac{(x-u)^2}{2 \sigma^2}}=\frac{1}{\sqrt{2 \pi \sigma}} e^{-\frac{1}{2}\left[\frac{x^2}{\sigma^2}-\frac{2 u x}{\sigma^2}+\frac{u^2}{\sigma^2}\right]}, we can calculate the usum by comparing the two expressions \sigma^2, as shown in the figure below:

Now that we have the mean uand variance, \sigma^2 we can find q\left(X_{t-1} \mid X_t\right)it, that is, x_{t-1}the image of the moment. I don’t know how much you have understood after deriving it here? In fact, if you do the math with your little hands, you will find that it is still very simple. But I don’t know if you have discovered a problem. The final result we just obtained uand \sigma^2the middle meaning X_0, X_0what is this, it is the final result we want, how can it be regarded as a known quantity now? This piece is indeed a bit strange, let's take a look at where we introduced it X_0. Scrolling up, you will find that when using the Bayesian formula, we use the formula 7 derived in the forward process to represent the q\left(X_{t-1}\right)sum q\left(X_{t}\right), but now it seems that a new unknown will be introduced in that place X_0, what should we do? At this time, we consider using formula 7 to reverse estimate , that is, the expression  X_0obtained by reversing formula 7 , as follows:X_0

X_0=\frac{1}{\sqrt{a_t}}\left(X_t-\sqrt{1-\bar{a}_t} \hat{Z}_t\right)——Official 10

The estimated value obtained X_0, at this time, substitute formula 10 into the above figure u, and obtain the final estimated value after calculation  \tilde{u}, the expression is as follows:

​——Formula \tilde{u}=\frac{1}{\sqrt{a_t}}\left(x_t-\frac{\beta_t}{\sqrt{1-\bar{a}_t}} \hat{Z}_t\right) 11

Ok, now let’s sort out the mean and variance tof the image at −1 time  , as shown in the figure below:u\sigma^2

With formula 12, we can estimate X_{t-1}the image of the moment, and then we can find the images of X_{t-2}, X_{t-3}, X_{1}, step by step X_{0}.


Principle summary

The detailed explanation of the principle of this section is here for everyone. How much do you understand? I believe that after reading this part, you already have a general understanding of the principle of Diffusion Model, but there must be some doubts. Don't worry, the code part will help you further.

Diffusion Model source code analysis

Code download and use

This code download address: Diffusion Model code

Let's talk about the use of the code first. The code contains two items, one is ddpm.pyand the other is ddpm_condition.py. You can understand it as ddpm.pythe simplest diffusion model and ddpm_condition.pythe ddpm.pybest optimization. This section will explain it ddpm.pyto everyone. The code is very simple to use. First, ddpm.pyspecify the dataset path in the file, which is the set dataset_pathvalue, and then we can run the code. It should be noted that if you are using a CPU, you may also need to modify devicethe parameters in the code.


Here to briefly talk ddpmabout the meaning, the English full name Denoising Diffusion Probabilistic Model, Chinese translation 去噪扩散概率模型.


Code Flowchart

Here we directly look at the flow chart given in the paper, as follows:

This figure shows that the process of the entire algorithm is divided into the training phase (Training) and the sampling phase (Sampling).

  • Training
    As we all know, we need to have real and predicted values ​​for training, so what are the real and predicted values ​​for this example? Is the true value the picture we input and the predicted value the picture we output? In fact, it is not. For this example, the actual value and the predicted value are both noises. Also take the picture below as a demonstration for everyone.

The noise we add in the forward process is actually known and can be used as the real value. The reverse process is equivalent to a denoising process. We use a model to predict the noise, so that the noise added in each step of the forward process is as consistent as possible with the noise predicted by the corresponding step of the reverse process. The way to predict noise in the reverse process is to throw it into the model. Training is actually the fifth step in Training.

  • Sampling
    knows the training process, and the sampling process is very simple. In fact, the sampling process corresponds to the reverse process introduced in our theoretical part, and iterates forward step by step from a Gaussian noise, and finally obtains the time X_0image.

code analysis

First of all, according to our theoretical part, there should be a positive process, the most important of which is the final formula 7, as follows:

X_t=\sqrt{\bar{a}_t} X_0+\sqrt{1-\bar{a}_t} \hat{Z}_t

Then let's take a look at how to use this formula 7 in the code, the code is as follows:

def noise_images(self, x, t):
    sqrt_alpha_hat = torch.sqrt(self.alpha_hat[t])[:, None, None, None]
    sqrt_one_minus_alpha_hat = torch.sqrt(1 - self.alpha_hat[t])[:, None, None, None]
    Ɛ = torch.randn_like(x)
    return sqrt_alpha_hat * x + sqrt_one_minus_alpha_hat * Ɛ, Ɛ

Ɛ is a random standard Gaussian distribution, which is actually the true value. As you can see, the return value of the above formula sqrt_alpha_hat * x + sqrt_one_minus_alpha_hatactually represents formula 7. [Note: I omitted a lot of details in this code. I only show you the key code. If you want to fully understand it, you need to remember to debug and debug]

Then we predict the noise through a model, as follows:

predicted_noise = model(x_t, t)

modelThe structure is very simple, it is a Unet structure, and several Transformer mechanisms are nested in it, so I will not take everyone to jump in and take a look. Now that we have the predicted value and the real value Ɛ [after returning, Ɛ is represented by noise], we can calculate their loss and iterate continuously.

loss = mse(noise, predicted_noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()

The above is actually the general structure of the training process, I have omitted a lot, now let’s look at the code of the sampling process.

def sample(self, model, n):
    logging.info(f"Sampling {n} new images....")
    model.eval()
    with torch.no_grad():
        x = torch.randn((n, 3, self.img_size, self.img_size)).to(self.device)
# for i in tqdm(reversed(range(1, self.noise_steps)), position=0):
        for i in tqdm(reversed(range(1, 5)), position=0):
            t = (torch.ones(n) * i).long().to(self.device)
            predicted_noise = model(x, t)
            alpha = self.alpha[t][:, None, None, None]
            alpha_hat = self.alpha_hat[t][:, None, None, None]
            beta = self.beta[t][:, None, None, None]
            if i > 1:
                noise = torch.randn_like(x)
            else:
                noise = torch.zeros_like(x)
            x = 1 / torch.sqrt(alpha) * (x - ((1 - alpha) / (torch.sqrt(1 - alpha_hat))) * predicted_noise) + torch.sqrt(beta) * noise   
    model.train()
    x = (x.clamp(-1, 1) + 1) / 2
    x = (x * 255).type(torch.uint8)
    return x

The key to the above code is this formula, which corresponds to step 4 in the Sampling phase of the code flowchart. It should be noted that the formula for variance here is given , but in fact, in our theoretical calculation , an approximate calculation is done here, that is, the sum and sum are both very small and close to 0, so they are calculated as 1. Pay attention here just fine. x = 1 / torch.sqrt(alpha) * (x - ((1 - alpha) / (torch.sqrt(1 - alpha_hat))) * predicted_noise) + torch.sqrt(beta) * noise\sigma_t\sqrt{\beta}\sqrt{\frac{\beta_t\left(1-\bar{a}_{t-1}\right)}{1-\bar{a}_t}}\bar{a}_{t-1}\bar{a}_{t}\frac{\left(1-\bar{a}_{t-1}\right)}{1-\bar{a}_t}

code summary

It can be seen that the space used in this part is very small, and only the key parts are listed, and many details need to be understood by everyone. For example, the usage of time T in the code is actually difficult to understand, and it is treated as a sine-cosine position code in the code. If you are not familiar with positional encoding, you can take a look at this article , which has a detailed introduction to positional encoding.

appendix

Gaussian distribution properties

The Gaussian distribution is also known as the normal distribution, and its expression is:

f(x)=\frac{1}{\sqrt{2 \pi \sigma}} e^{-\frac{(x-u)^2}{2 \sigma^2}}

where uis the mean and \sigma^2is the variance. If the random variable Xobeys a Gaussian distribution with a normal mean uand a variance of \sigma^2, it is generally recorded as X \sim N\left(u, \sigma^2\right). In addition, one thing everyone needs to know is that if we know that a random variable obeys a Gaussian distribution, and we know their mean and variance, then we can write the expression of the random variable.

The Gaussian distribution also has some very good properties. Here are some examples to help you understand.

  • If X \sim N\left(u, \sigma^2\right), then a X \sim N\left(a u,(a \sigma)^2\right).
  • If X \sim N\left(u_1, \sigma^2{ }_1\right), Y \sim N\left(u_2, \sigma^2{ }_2\right), then X+Y \sim N\left(u_1+u_2, \sigma^2{ }_1+\sigma^2{ }_2\right).

 

Reference content: Detailed explanation of Diffusion Model principle and source code analysis - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/weixin_42620109/article/details/129156101