Stable Diffusion Core Algorithm DDPM Analysis

DDPM: Denoising Diffusion Probabilistic Model, denoising diffusion probability model

Reference in this article: A video to understand the principle derivation of the diffusion model DDPM | AI painting underlying model_哔哩哔哩_bilibili

1. General principle

From right to left x_0\rightarrow x_Tis the forward noise addition process, and from left to right x_t\rightarrow x_0is the reverse noise reduction process.

Continuously adding noise in the forward process, after T times x_T, we hopex_T\sim N(0,1)

In this way, during inference, we can N(0,1) take out from random x_T{'}(add ' to indicate that this is a new value).

If we can learn x_t\rightarrow x_{t-1}the noise reduction method, we can finally pass x_T{'}\rightarrow x_0{'}the new picture.

2. What does the noise reduction method of the diffusion model predict

Now is the noise reduction method that needs to be learnedx_t\rightarrow x_{t-1}x_{t-1} . The DDPM algorithm is not a method of directly learning the predicted value, but the predicted x_{t-1}conditional probability distributionp(x_{t-1}|x_t) , and then the value obtained from the distribution x_{t-1}. This method is similar to the deepar prediction method in that distributions are predicted instead of values.

So why predict distributions instead of exact x_{t-1}values?

Because the distribution can be sampled x_{t-1}{'}, the model has randomness.

Furthermore, if you get it , you can get it by sampling , so that you can get it step by step . Therefore, what we want to learn is the distribution of p, not an exact graph.p(x_{t-2}|x_{t-1})x_{t-2}{'}x_T{'}\rightarrow x_0{'}

Conclusion: The whole learning process is predicting the distribution p .

Later we will see that the model is predicting noise, which is not the noise between x_tand , but the noise involved in the calculation x_{t-1}of the normal distribution p .\mu\varepsilon

So, we \varepsilonget it by prediction \mu, and then get p. It also verified our conclusion, that is, the whole learning process is predicting the distribution p .

3. Disassembly of conditional probability distribution

Formula 1 :p(x_{t-1}|x_t)=\frac{p(x_t|x_{t-1})\cdot p(x_{t-1})}{p(x_t)}, the original conditional probability distribution is transformed according to the Bayesian formula, and the new formula contains 3 probability distributions.

(1) Calculation of the first p

The first p is:p(x_t|x_{t-1})

From x_{t-1}to x_tthe probability distribution in the process of adding noise, because the process of adding noise is defined in advance, the probability distribution p can also be defined.

Now we define the ramping process as follows:

Equation 2 :x_t=\sqrt{\alpha _t}\cdot x_{t-1} + \sqrt{\beta_t}\varepsilon _t, where\varepsilon _t\sim N(0,1)the noise,\beta_t=1-\alpha_t.

Because \varepsilon _t\sim N(0,1), so \sqrt{\beta_t} \varepsilon_t \sim N(0, \beta_t). (ps: the variance needs to be squared)

Can be seen \beta_tas the variance of the noise, it needs to be very small close to 0. Only when the added noise is small, the forward and backward directions obey the normal distribution.

Further derivation, x_t \sim N(\sqrt{\alpha_t} \cdot x_{t-1}, \beta_t), namely:

Formula 3:p(x_t|x_{t-1}) \sim N(\sqrt{\alpha_t} \cdot x_{t-1}, \beta_t) .

(2) Calculation of the third p

The third p is: p(x_t), which is similar to the second p. If you find a calculation method for one, then the other can be similarly obtained.

In the previous step, we obtained the formula 2 of each step of adding noise, and the conditional probability distribution formula 3 of each step of adding noise.

For the process of adding noise, x_0 \rightarrow x_t \rightarrow \cdots \rightarrow x_{t-1} \rightarrow x_t, so it can x_0be used x_t.

Transformation of formula 1:

Formula 4:p(x_{t-1}|x_t,x_0)=\frac{p(x_t|x_{t-1},x_0)\cdot p(x_{t-1}|x_0)}{p(x_t|x_0)}

Because the heating process is a Markov process, it is p(x_t|x_{t-1},x_0)only related to the previous step, and has nothing to do with the previous step, that is, the sum x_0has nothing to do, sop(x_t|x_{t-1},x_0)=p(x_t|x_{t-1})

Itp(x_t|x_0) is obtained step by x_0step x_t, so no further simplification is possible. Furthermore, Equation 4 simplifies to:

Formula 5 :p(x_{t-1}|x_t,x_0)=\frac{p(x_t|x_{t-1})\cdot p(x_{t-1}|x_0)}{p(x_t|x_0)}

Now start to calculate the value of the new third p again, and deduce it from formula 2 as follows (ps: brackets indicate that some parameters are included but not written out, and unimportant information is omitted):

x_t=\sqrt{\alpha_t} \cdot x_{t-1} + \sqrt{\beta_t} \cdot \varepsilon_t \\ =\sqrt{\alpha_t}(\sqrt{\alpha_{t-1}} \cdot x_{t-2}+\sqrt{\beta_{t-1} }\cdot \varepsilon_{t-1}) + \sqrt{\beta_t} \cdot \varepsilon_t \\ = \cdots \\ =\sqrt{\alpha_t \cdots \alpha_1} \cdot x_0 + ()\varepsilon_t + \cdots + ()\varepsilon_2 + ()\varepsilon_1 \\ = \sqrt{\alpha_t \cdots \alpha_1} \cdot x_0 + ()\varepsilon

Finally, after an imprecise derivation, we give the official result:

Equation 6 :p(x_t|x_0) \sim N(\sqrt{\bar{\alpha_t}} \cdot x_0, (1- \bar{\alpha_t}) \cdot \varepsilon_t ^{2}), which\bar{\alpha_t} = \alpha_t \cdots \alpha_0represents continuous multiplication.

(3) Diffusion formula solution

If obtained in the previous step p(x_t|x_0), it can also be obtained similarly p(x_{t-1}|x_0).

The official result of formula 4 is given directly:

Formula 7 :p(x_{t-1}|x_t, x_0) \sim N(\bar \mu (x_0,x_t), \tilde{\beta_t})

Among them \tilde{\beta_t}is the hyperparameter, \bar{\mu} (x_0,x_t)the formula is as follows:

Formula 8 :\bar{\mu }(x_0,x_t)=\frac{\sqrt{\bar{\alpha_{t-1}}} \cdot \beta_t}{1-\bar{\alpha_t}} \cdot x_0 + \frac{\sqrt{\alpha_t} \cdot(1-\bar{\alpha_{t-1}})}{1-\bar{\alpha_t}} \cdot x_t

Because \tilde{\beta_t}it is fixed, p(x_{t-1}|x_t,x_0)the task of seeking becomes seeking \bar{\mu} (x_0,x_t).

If so \bar{\mu }(x_0,x_t), then the predicted inference value can be obtained according to the following formula:

Formula 9 :x_{t-1}=\bar{\mu }(x_0, x_t)+ \sqrt{\tilde{\beta_t}} \cdot \varepsilon _t,\varepsilon _t \sim N(0,1)

If one is taken out p(x_{t-1}|x_t,x_0)of x_{t-1}, the process is non-derivable (directly input the mean value and variance value through the python package), then there is a problem in the reverse process, so it can be converted to formula 9 through the heavy parameter technique. Guided formula to express x_{t-1}.

In the inference stage x_0is the value we ultimately want, which is unknown, so a formula to convert to a known factor is needed.

Equation 6 is transformed by the heavy parameter technique as follows:

Formula 10 :x_t=\sqrt{\bar{\alpha_t}} \cdot x_0 + \sqrt{1-\bar{\alpha_t}} \cdot \varepsilon _t, and then get:

Formula 11 :x_0=\frac{1}{\sqrt{\bar{\alpha_t}}}(x_t - \sqrt{1-\bar{\alpha_t}} \cdot \varepsilon _t), where t is the current number of noise adding stages, which will change. At the same time, thisx_0is the parameter value of the intermediate process and cannot be used as the final predicted value, because the p-process of reasoning needs to follow the Markov process, so it must be derived step by stepx_0.

In Formula 7, the unknown value is \bar{\mu} (x_0,x_t), and the unknown value in the value is x_0, and x_0the unknown value in is \varepsilon _t, which cannot be calculated and derived by existing formulas .

So we use the UNet network, input x_t, output \varepsilon _t.

Substituting Equation 11 into Equation 8, we get:

Equation 12 : \bar{\mu }(x_0, x_t)=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha_t}}} \cdot \varepsilon_t), where among \varepsilonother things are known.

It is\varepsilon predicted by the UNet network, which can be expressed as \varepsilon _\theta (x_t, t)a \thetaparameter of the UNet model.

*************The process of the diffusion model getting the predicted image through the UNet network************** :

UNet \rightarrow \varepsilon _t \rightarrow x_0 -> \bar{\mu }(x_0,x_t) \rightarrow p(x_{t-1}|x_t,x_0) \rightarrow x_{t-1}{'} \rightarrow \cdots \rightarrow x_0{'}

The above is the most important logic of the diffusion model DDPM .

4. Model training

According to Equation 12, it can be seen that the UNet network is trained with normally distributed noise \varepsilon.

Question 1: What is the input and output during model training?

Answer: input x_t, output \varepsilon _t.

Question 2: So which process performs the training of UNet network parameters?

Answer: the noise-adding process. The denoising process is the training phase, and the denoising process is the inference phase.

According to formula 2, the noise of the noise addition process is defined by the implementation, so we can compare the predicted noise \hat{\varepsilon }and the real \varepsilonKL divergence to calculate the loss value. In the official description, the KL divergence formula can be simplified to calculate the two The mse value of a value.

Question 3: Is it deduced step by step during training?

Answer: No need. During the training process, according to formula 10  x_t=\sqrt{\bar{\alpha_t}} \cdot x_0 + \sqrt{1-\bar{\alpha_t}} \cdot \varepsilon _t, x_tit can be calculated by \bar{\alpha_t}, x_0, \varepsilon _t, t these four values.

\bar{\alpha_t}It can be calculated in advance and stored in memory, x_0which is the input image set, \varepsilon _tthe input noise, tand the number of noise-adding stages.

Therefore, each step in the forward direction can directly obtain x_tthe value.

5. Pseudo-code implementation of training and inference

(1) Training stage

Interpretation:

q(x_0)Represents taking pictures from the data set

Uniform(\left \{1,...,T \right \})Indicates that a number of noise-adding stages is randomly selected. As mentioned earlier, the noise-adding process does not need to be done step by step.

\sqrt{\bar{\alpha_t}} \cdot x_0 + \sqrt{1-\bar{\alpha_t}} \cdot \varepsilon _tforx_t

(2) Reasoning stage

 

 Interpretation:

for t=T,...,1 do It means that the reverse process needs to be done step by step.

The complex calculation in step 4 corresponds to Equation 9, and the first formula in the calculation corresponds to Equation 12.

Guess you like

Origin blog.csdn.net/benben044/article/details/132331725