Diffusion Model (diffusion model) targets Generative Adversarial Network (GAN), as long as GAN can do it, it can basically do it. Before using GAN network to realize some image generation tasks, the effect is not very ideal, and the training is often very unstable. However, the pictures generated after switching to the Diffusion Model are very realistic, and it is obvious that the results of each round of training are better than before, that is, the training is more stable.
This article will use popular language and formulas to introduce Diffusion Model to you, and combine the formulas to sort out the code of Diffusion Model for you, and explore how it is implemented through code.
the whole idea
The overall idea of the Diffusion Model is shown in the figure below:
It is mainly divided into forward process and reverse process. The forward process is similar to encoding, and the reverse process is similar to decoding.
- Forward process
First, for an original picture , we add a Gaussian noise, and the picture becomes . 【Note: Gaussian noise must be added here, because Gaussian noise obeys Gaussian distribution, and some of the following operations need to use some characteristics of Gaussian distribution】" Then we will add Gaussian noise on the basis of . Repeat the above step of adding Gaussian noise until the picture becomes , due to the addition of enough Gaussian noise, the current approximation obeys a Gaussian distribution (also known as a normal distribution).
Now there is a question for everyone to think about. Is the amount of Gaussian noise we add in each step always constant? Drip, start answering. The answer is that the amount of Gaussian noise added in each step varies, and the latter step adds more Gaussian noise than the previous step. I think it is very easy for you to understand through the above picture. At the beginning, the original picture is relatively clean. We can add a small amount of Gaussian noise to interfere with the original picture; Gaussian noise, then it will basically not have any impact on the results of the previous step. [Note: The image of each moment described later and the image of each step here have the same meaning, such as the image of the moment represents this image] - The reverse process
First, we will randomly generate a noise image that obeys a Gaussian distribution, and then reduce the noise step by step until the expected image is generated. It is good for everyone to have such an understanding of the reverse process, and the details will be introduced later.
implementation details
This part introduces the details of the forward process and reverse process of the Diffusion Model, mainly by deriving some formulas to express the relationship between images before and after adding noise.
forward process
In the overall thinking part, we already know that the forward process is actually a process of continuously adding noise, so we consider whether we can use some formulas to express the relationship between the image before and after adding noise. I want everyone to think about what factors will affect the image at the next moment, and more specifically, what quantities are it determined by? I think this problem is very simple, that is, it is determined by the noise added together, that is to say, the image at the next moment is mainly determined by two quantities, one is the image at the previous moment, and the other is the amount of noise added . Knowing this, we can use a formula to express the relationship between time and two images, as follows:
——Formula 1
Wherein, represents the image at the time, represents the time image, and represents the added Gaussian noise, which obeys the N(0,1) distribution. [Note: N(0,1) represents a standard Gaussian distribution, with a variance of 1 and a mean of 0] At present, you can see that both and and have a relationship, which is consistent with what we said earlier 后一时刻的图像由前一时刻图像和噪声决定
. The sum in front of this formula represents the weight of these two quantities, and the sum of their squares is 1 .
I think you have understood formula 1, but you may still have some doubts about the understanding of sum , such as why should such a weight be set? Is the setting of this weight preset by us? In fact, it is also related to another quantity , the relationship is as follows:
——Formula 2
Among them, is a predetermined value, which is a value that increases with time, and its range is [0.0001,0.02] in the paper. Since it is getting bigger and bigger, it is getting smaller and smaller, and 1− is getting bigger and bigger. Now let's consider formula 1, the weight of the weight increases with time, indicating that we add more and more Gaussian noise, which is consistent with our overall thinking, that is, the more we add The more noise.
Now, we have obtained the relationship between the moment and the two images of the moment, but the image of the moment is unknown. [ Note: Only the stage image is known, that is, the original image] We need to derive the time image from the time, and then derive the time image from the time, and so on, until the time image is derived from the time. In this case, we might as well try the relationship between time images and time images first, as follows:
——Official 3
This formula is an analogous formula of formula 1. At this time, we substitute formula 3 into formula 1 to get:
——Formula 4
Can everyone understand this formula 4? I think everyone should have doubts about the last equation, that is, how is it equal ? In fact, this uses some knowledge of the Gaussian distribution, see the appendix for this part. After reading the relevant properties of the Gaussian distribution in the appendix, I think you should be able to understand it here. I am helping you sort it out, as shown in the figure below:
Now I understand the content of formula 4. Note that here is also subject to (0,1) Gaussian distribution, obey . Let's take a look at what formula 4 has obtained - it has obtained the relationship between the time image and the time image. According to our previous understanding, we list the relationship between time images and time images as follows:
——Official 5
Similarly, we substitute Formula 5 into Formula 4 to obtain the relationship between time images and time images. The formula is as follows:
——Official 6
I did not take you through the calculation of formula 5 step by step. I only wrote the final result. You can calculate it yourself. It is very simple and only uses the relevant properties of the Gaussian distribution. Note that the above also obeys the (0,1) Gaussian distribution. Then formula 6 can get the relationship between the time image and the time image. If we continue to calculate in this way, we will get the relationship between the time image and the time image. But such a derivation seems to be very long, as you deduce backwards, you will find that this derivation is regular. We can compare the results of Formula 4 and Formula 6, and you will find an obvious rule. Here I will directly write the relationship between the time image and the time image according to this rule. The formula is as follows:
——Official 7
Among them, it represents the multiplication operation, that is , the sample obeys the Gaussian distribution of (0,1). [Here \hat{Z}_t is just a representation, as long as it obeys the standard Gaussian distribution, you can use any representation] This formula 7 is the core formula of the whole forward process, "The image representing the time can be composed of the image at the time and A standard Gaussian noise representation" , you need to keep this formula in mind, it will be used later and in the code.
reverse process
The reverse process is the process of restoring Gaussian noise to the expected picture. Let's first take a look at what we have known. In fact, it is Gaussian noise at one moment. We hope to turn the Gaussian noise of the time into the image of the time, which is difficult to complete in one step. Therefore, we can think about whether we can first consider the relationship between the time image and the time as in the forward process, and then deduce the conclusion step by step. Ok, now that we have an idea, let’s first think about how to get a time image from a known time image .
Here we need to use the conclusions in the forward process. We can obtain the time image from the time image in the forward process , and then use the Bayesian formula to solve it. The expression of Bayesian formula is as follows:
Then we will use the Bayesian formula to find the moment image, the formula is as follows:
——Official 8
We can get it from formula 8 , which is what we just found in the forward process. But and is unknown. It can also be seen from formula 7 that the image at each moment can be obtained, and of course the image at the same time can be obtained, so add one to formula 8 as a known condition, and change formula 8 into formula 9, as follows:
——Official 9
Now it can be found that the three items on the right side of formula 9 can be calculated. We list their formulas and corresponding distributions, as shown in the figure below:
Knowing the distribution of the three terms on the right side of the equation in Formula 9, we can calculate the left side of the equation . This calculation is very simple, there is no skill, it is pure calculation. In the appendix->Gaussian distribution properties section, we know that the expression of the Gaussian distribution is: . Then we only need to ask for the three Gaussian distribution expressions on the right side of the equation 9, and then perform multiplication and division operations to obtain it .
The figure above shows the three Gaussian distribution expressions on the right side of the equation. Everyone should know how to get this result, that is, just substitute the respective mean and variance into the Gaussian distribution expressions. Now we only need to perform corresponding multiplication and division operations on the above three formulas, as shown in the figure below:
Well, the formula we got in the above figure is actually the expression of . Knowing what is the use of this expression, it is mainly to find the mean and variance. First of all, we should know that the result of multiplication and division of Gaussian distribution is still Gaussian distribution, that is to say, it obeys Gaussian distribution, then its expression is , we can calculate the sum by comparing the two expressions , as shown in the figure below:
Now that we have the mean and variance, we can find it, that is, the image of the moment. I don’t know how much you have understood after deriving it here? In fact, if you do the math with your little hands, you will find that it is still very simple. But I don’t know if you have discovered a problem. The final result we just obtained and the middle meaning , what is this, it is the final result we want, how can it be regarded as a known quantity now? This piece is indeed a bit strange, let's take a look at where we introduced it . Scrolling up, you will find that when using the Bayesian formula, we use the formula 7 derived in the forward process to represent the sum , but now it seems that a new unknown will be introduced in that place , what should we do? At this time, we consider using formula 7 to reverse estimate , that is, the expression obtained by reversing formula 7 , as follows:
——Official 10
The estimated value obtained , at this time, substitute formula 10 into the above figure , and obtain the final estimated value after calculation , the expression is as follows:
——Formula 11
Ok, now let’s sort out the mean and variance of the image at −1 time , as shown in the figure below:
With formula 12, we can estimate the image of the moment, and then we can find the images of , , , step by step .
Principle summary
The detailed explanation of the principle of this section is here for everyone. How much do you understand? I believe that after reading this part, you already have a general understanding of the principle of Diffusion Model, but there must be some doubts. Don't worry, the code part will help you further.
Diffusion Model source code analysis
Code download and use
This code download address: Diffusion Model code
Let's talk about the use of the code first. The code contains two items, one is ddpm.py
and the other is ddpm_condition.py
. You can understand it as ddpm.py
the simplest diffusion model and ddpm_condition.py
the ddpm.py
best optimization. This section will explain it ddpm.py
to everyone. The code is very simple to use. First, ddpm.py
specify the dataset path in the file, which is the set dataset_path
value, and then we can run the code. It should be noted that if you are using a CPU, you may also need to modify device
the parameters in the code.
Here to briefly talk ddpm
about the meaning, the English full name Denoising Diffusion Probabilistic Model
, Chinese translation 去噪扩散概率模型
.
Code Flowchart
Here we directly look at the flow chart given in the paper, as follows:
This figure shows that the process of the entire algorithm is divided into the training phase (Training) and the sampling phase (Sampling).
- Training
As we all know, we need to have real and predicted values for training, so what are the real and predicted values for this example? Is the true value the picture we input and the predicted value the picture we output? In fact, it is not. For this example, the actual value and the predicted value are both noises. Also take the picture below as a demonstration for everyone.
The noise we add in the forward process is actually known and can be used as the real value. The reverse process is equivalent to a denoising process. We use a model to predict the noise, so that the noise added in each step of the forward process is as consistent as possible with the noise predicted by the corresponding step of the reverse process. The way to predict noise in the reverse process is to throw it into the model. Training is actually the fifth step in Training.
- Sampling
knows the training process, and the sampling process is very simple. In fact, the sampling process corresponds to the reverse process introduced in our theoretical part, and iterates forward step by step from a Gaussian noise, and finally obtains the time image.
code analysis
First of all, according to our theoretical part, there should be a positive process, the most important of which is the final formula 7, as follows:
Then let's take a look at how to use this formula 7 in the code, the code is as follows:
def noise_images(self, x, t):
sqrt_alpha_hat = torch.sqrt(self.alpha_hat[t])[:, None, None, None]
sqrt_one_minus_alpha_hat = torch.sqrt(1 - self.alpha_hat[t])[:, None, None, None]
Ɛ = torch.randn_like(x)
return sqrt_alpha_hat * x + sqrt_one_minus_alpha_hat * Ɛ, Ɛ
Ɛ is a random standard Gaussian distribution, which is actually the true value. As you can see, the return value of the above formula sqrt_alpha_hat * x + sqrt_one_minus_alpha_hat
actually represents formula 7. [Note: I omitted a lot of details in this code. I only show you the key code. If you want to fully understand it, you need to remember to debug and debug]
Then we predict the noise through a model, as follows:
predicted_noise = model(x_t, t)
model
The structure is very simple, it is a Unet structure, and several Transformer mechanisms are nested in it, so I will not take everyone to jump in and take a look. Now that we have the predicted value and the real value Ɛ [after returning, Ɛ is represented by noise], we can calculate their loss and iterate continuously.
loss = mse(noise, predicted_noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
The above is actually the general structure of the training process, I have omitted a lot, now let’s look at the code of the sampling process.
def sample(self, model, n):
logging.info(f"Sampling {n} new images....")
model.eval()
with torch.no_grad():
x = torch.randn((n, 3, self.img_size, self.img_size)).to(self.device)
# for i in tqdm(reversed(range(1, self.noise_steps)), position=0):
for i in tqdm(reversed(range(1, 5)), position=0):
t = (torch.ones(n) * i).long().to(self.device)
predicted_noise = model(x, t)
alpha = self.alpha[t][:, None, None, None]
alpha_hat = self.alpha_hat[t][:, None, None, None]
beta = self.beta[t][:, None, None, None]
if i > 1:
noise = torch.randn_like(x)
else:
noise = torch.zeros_like(x)
x = 1 / torch.sqrt(alpha) * (x - ((1 - alpha) / (torch.sqrt(1 - alpha_hat))) * predicted_noise) + torch.sqrt(beta) * noise
model.train()
x = (x.clamp(-1, 1) + 1) / 2
x = (x * 255).type(torch.uint8)
return x
The key to the above code is this formula, which corresponds to step 4 in the Sampling phase of the code flowchart. It should be noted that the formula for variance here is given , but in fact, in our theoretical calculation , an approximate calculation is done here, that is, the sum and sum are both very small and close to 0, so they are calculated as 1. Pay attention here just fine. x = 1 / torch.sqrt(alpha) * (x - ((1 - alpha) / (torch.sqrt(1 - alpha_hat))) * predicted_noise) + torch.sqrt(beta) * noise
code summary
It can be seen that the space used in this part is very small, and only the key parts are listed, and many details need to be understood by everyone. For example, the usage of time T in the code is actually difficult to understand, and it is treated as a sine-cosine position code in the code. If you are not familiar with positional encoding, you can take a look at this article , which has a detailed introduction to positional encoding.
appendix
Gaussian distribution properties
The Gaussian distribution is also known as the normal distribution, and its expression is:
where is the mean and is the variance. If the random variable obeys a Gaussian distribution with a normal mean and a variance of , it is generally recorded as . In addition, one thing everyone needs to know is that if we know that a random variable obeys a Gaussian distribution, and we know their mean and variance, then we can write the expression of the random variable.
The Gaussian distribution also has some very good properties. Here are some examples to help you understand.
- If , then .
- If , then .
Reference content: Detailed explanation of Diffusion Model principle and source code analysis - Zhihu (zhihu.com)