Article directory
Paper: Denoising Diffusion Implicit Models
Code: https://github.com/CompVis/stable-diffusion/blob/main/ldm/models/diffusion/ddim.py
Source: ICLR 2021
Time: 2021.10.05
DDIM Contributions:
- Point out L sample L_{sample} in DDPMLsampleIt is not directly related to the specific form of the joint distribution of the diffusion process, so training L sample L_{sample}LsampleIt is equivalent to training a series of latent diffusion models
- Construct a more general Markov diffusion process, and let it satisfy the marginal distribution invariance, so as to be able to reuse the trained DDPM model
- Constructed a more general sampling algorithm
- The technique of respacing is proposed to reduce the sampling steps, in training L sample L_{sample}LsampleWhen , it has nothing to do with the joint distribution, so some steps can be skipped to sample, and DDIM can obtain the same sampling quality as DDPM with 5 times fewer steps
Official account introduction
Station B: https://www.bilibili.com/read/cv25126637/
DDPM: https://zhuanlan.zhihu.com/p/638442430
DDIIM : https://www.zhangzhenhu.com/aigc/ddim.html#equation-eq-ddim-226
Non-Markov: https://zhuanlan.zhihu.com/p/627616358
1. Background
DDPM does not use the confrontation network and obtains a good image generation effect, but due to the need for multiple sampling (about 1000 times) in the reverse denoising, it can produce a better generation result.
The forward denoising process of DDPM is a Markov process, that is, the state at the current moment is only related to the state at the previous moment, and the reverse denoising process of DDPM is the inverse process of Markov
This has a lot to do with the principle of the diffusion model. The generation method based on the diffusion model is to gradually restore the image from the noise step by step. How many steps are forward diffusion and how many steps are required for reverse denoising, which is very slow compared with GAN , GAN only needs one time to generate the required picture
- DDPM: It takes about 20h to sample and generate 50k 32x32 images (it takes about 1000h to generate 50k 256x256 images)
- GAN: It takes about 1min
In order to improve the sampling speed of DDPM, the author of this paper proposes denoising diffusion implicit models (DDIMs), which is the same as the objective function of DDPM. DDIM constructs the forward process as a non-Markov process, so the reverse process is also non-Markov. The inverse process of the process , precisely because of this non-Markov process, determines that the generation process of DDIM can be more certain, can generate high-quality samples faster, and is more balanced in terms of calculation amount and sampling quality.
Why DDIM can accelerate:
DDIM can speed up DDPM by 10x~50x. The reason why it can be accelerated is the non-Markov process it uses, because the reason why DDPM requires the same number of steps forward and backward is that the Markov process it uses determines that it can only In this way, after DDIM does not use the Markov process, it can skip some steps, that is, skip sampling
The characteristics of DDIM: DDIM is very similar to DDPM, and the objective function of training is also the same, so you can directly use the model trained by DDPM and modify the sampling process to use it
- The reverse process uses a non-Markov chain process: each step forward and reverse of DDPM is a Markov chain process, which is the result of the current moment xt − 1 x_{t-1}xt−1Only rely on the value xt x_{t} of the previous momentxt, but DDIM uses a non-Markov chain process in reverse, that is, the current moment depends on xt x_{t} at the same timextand x 0 x_0x0, can use fewer steps to generate pictures, improve sampling efficiency (10~50 times speedup)
- There is better consistency: as long as the initial value is the same, the final high-level features of samples generated using different step lengths are also similar, and the generation process is deterministic
- Interpolation can be used: meaningful semantic interpolation can be obtained due to the good consistency of the generated results of DDIM
2. How to improve DDIM
2.1 Principle review of DDPM
DDIM and DDPM have some differences in the use of symbols, you need to pay attention
- α ‾ t \overline{\alpha}_t in DDPMatThe representation in DDIM is α t \alpha_tat
Given a data distribution q ( x 0 ) q(x_0)q(x0) , can generate a series of samples through sampling, the generative model is concerned with being able to learn this data distribution, and the way of learning is to learn ap θ ( x 0 ) p_{\theta}(x_0)pi(x0) to approximateq ( x 0 ) q(x_0)q(x0) , when the approximation effect is good enough, it can be obtained fromp θ ( x 0 ) p_{\theta}(x_0)pi(x0) to generate new samples, this is the generative model
In DDPM, the data distribution is actually modeled in the form of a hidden variable model. It is considered that:
The joint distribution here is the form of the second half, which is the product of a series of Markov processes. The generative process needs to simulate the inverse process of the Markov process of forward diffusion. Here, x 1 x_1x1to x T x_TxTand x 0 x_0x0are the same size
The objective function in DDPM is to maximize the log likelihood, which is a variational model, and the lower bound is optimized
In DDPM, the process of adding noise is done in the form of Markov chain, which is the way of multiplication, and the conditional distribution q is also a Gaussian distribution, and the mean and variance are shown in formula 3
So the forward diffusion process is a Markov chain, then the reverse denoising process is also a Markov chain, and the approximation is an inverse process q ( xt − 1 ∣ xt ) q(x_{t-1}|x_t)q(xt−1∣xt)
For the forward diffusion process, there is a special property that can put q ( xt ∣ x 0 ) q(x_t|x_0)q(xt∣x0) is written out, which is equivalent to a new normal distribution, which is also a marginal distribution
Since there is such a form, so xt x_txtand x 0 x_0x0By introducing noise ϵ \epsilonϵ to express:
When α T \alpha_TaTWhen set close to 0, q ( x T ∣ x 0 ) q(x_T|x_0)q(xT∣x0) will approach the standard Gaussian distribution, sop θ ( x T ) : = N ( 0 , I ) p_{\theta}(x_T):=N(0, I)pi(xT):=N(0,I ) , that is,p θ ( x T ) p_{\theta}(x_T)pi(xT) distribution is close to the standard Gaussian distribution, so a Gaussian distribution can be initialized, and thenx 0 x_0x0
The objective function of DDPM is as follows, which is also the form of rewriting Equation 2, which is also assumed to learn the Gaussian distribution mean (variance is fixed), it can be written as the square of the difference between the predicted noise and the real noise, ϵ t \ epsilon_tϵtis the real noise, the coefficient in DDPM γ t = 1 \gamma_t=1ct=1,当 γ t = 1 \gamma_t=1 ct=1 , this is actually a fractional model, which further explains that if you want to use this model to train the model, the forward diffusion process does not have to be a Markov process, as long as the marginal distribution satisfies formula 4.
In DDPM, in order for p θ ( x T ) p_{\theta}(x_T)pi(xT) tends to the standard Gaussian distributionN ( 0 , I ) N(0 , I)N(0,I ) , and in order to make the process of adding noise and the inverse process both Gaussian distribution, so T is set relatively large (1000), also to make the generation process approximate a Gaussian distribution, but this sampling process is very time-consuming.
The characteristics of the DDPM loss function:
- Since the noise comes from q ( xt ∣ x 0 ) q(x_t|x_0)q(xt∣x0) sampling, so the loss function is only composed ofq ( xt ∣ x 0 ) q(x_t|x_0)q(xt∣x0) decision, it is also proved that the loss function is a form of score matching
- That is to say, the loss function only depends on the marginal distribution, and does not directly depend on the joint distribution, that is to say, the form of the joint distribution will not affect the model training.
Therefore, is there a non-Markov process that can also achieve this noise addition method, only need to guarantee q ( xt ∣ x 0 ) q(x_t|x_0)q(xt∣x0) is the same as DDPM, andq ( x 1 : T ∣ x 0 ) q(x_{1:T}|x_0)q(x1:T∣x0) can be different. So can we avoid using Markov's step-by-step reverse recursion process, but directly use the more general formq ( xt ∣ xt − 1 , x 0 ) q(x_t|x_{t-1}, x_0)q(xt∣xt−1,x0) , as long as q ( xt ∣ x 0 ) q(x_t|x_0)is guaranteedq(xt∣x0) remain unchanged.
Therefore, the author of DDIM gave a non-Markovian forward diffusion process and the expression of the posterior probability distribution, and the posterior distribution just satisfies the marginal distribution q ( xt ∣ x 0 ) q ( x_t|x_0)q(xt∣x0) 。
2.2 Non-Markov forward diffusion process of DDIM
Consider a new distribution, introducing a new parameter σ \sigmaσ , is a real number, greater than or equal to 0, is a hyperparameter, does not require training, but you can customize the size of this value
The forward diffusion process is designed in a new way (Equation 6), and the posterior distribution is still Gaussian (Equation 7)
Based on formulas 6 and 7, it can be proved that at any time q σ ( xt ∣ x 0 ) q_{\sigma}(x_t|x_0)qp(xt∣x0) still obey the normal distribution
proving process:
Summarize:
- The author of DDIM designed a non-Markov forward diffusion process, and the marginal distribution is still the same as in DDPM, so the network can continue to be trained with the objective function of DDPM
Comparison of DDPM and DDIM posterior probability distributions:
- DDIM has one more hyperparameter σ \sigmaσ , will affect the posterior distribution and affect the reparameterization process of the sampling process, because the distribution constructed by the neural network is to approximate the posterior distribution, and the sample at time t-1 is obtained by sampling from this posterior distribution, that is, it must first Find the mean and variance of the posterior distribution for this prediction, then use the heavy parameter to compute the noise. Different values of hyperparameters will affect the calculation of heavy parameters, because the hyperparameters are different, the mean and variance are different, and the final structure is different
2.3 Sampling of non-Markov diffusion inverse process
The next step is to define a trainable generation process p θ ( x 0 : T ) p_{\theta}(x_{0:T})pi(x0:T),每个 p θ ( t ) ( x t − 1 ∣ x t ) p_{\theta}^{(t)}(x_{t-1}|x_t) pi(t)(xt−1∣xt) is to approximateq σ ( xt − 1 ∣ xt , x 0 ) q_{\sigma}(x_{t-1}|x_t,x_0)qp(xt−1∣xt,x0) , that is to say given a noise samplext x_txtPredict x 0 x_0 firstx0, and then according to the conditional distribution q σ ( xt − 1 ∣ xt , x 0 ) q_{\sigma}(x_{t-1}|x_t,x_0)qp(xt−1∣xt,x0) to getxt − 1 x_{t-1}xt−1
In DDPM there is no prediction x 0 x_0x0is the predicted noise ϵ \epsilonϵ
But in fact, as shown in Equation 4, the noise ϵ \epsilon is obtainedAfter ϵ , it is possible to deducex 0 x_0x0
So, by formula 4, the ϵ \epsilonϵ andxt x_txtinto, calculate f θ f_{\theta}fi, is given xt x_txtIn the case of denoising observations, that is, x 0 x_0 predicted at the current time tx0, as shown in Equation 9
According to the predicted x 0 x_0x0, you can use the posterior distribution as the approximation target, which is a piecewise function, and when t>1 is q σ q_{\sigma}qpdistribution, when t=1, it is expected to be a normal distribution, as shown in Equation 10
A special sampling - DDIM: σ = 0 \sigma=0p=0
The following formula 12 is the heavy parameter process of the posterior distribution proposed in this paper, which is from xt x_txtgenerate xt − 1 x_{t-1}xt−1Process, mean + standard deviation * noise
Different σ \sigmaσ will lead to different mean and standard deviation, so the sampling results must be different, but the objective function is the same, or the modelϵ θ \epsilon_{\theta}ϵiare the same, that is, for different σ \sigmaσ , there is no need to retrain the model,σ \sigmaσ only affects the result of sampling.
σ \sigma When σ takes the following value, it degenerates into the generation process of DDPM (the generation process of Markov chain)
当σ = 0 \sigma=0p=When it is 0 , it is deterministic sampling, and the generation process is deterministic, because the random item is gone, this time it is DDIM
The author also deduces why DDIM can be trained directly using the objective function of DDPM:
To summarize DDIMs:
Construct a more general non-Markov process, set the hyperparameters to 0, and turn it into a deterministic sampling process
2.4 Accelerated sampling - Respacing
The above DDIM itself is a model, and there is no acceleration. The acceleration is because a trick can be introduced on the model, and the damage is very small, so it can be accelerated
In DDPM, there are T steps in the forward process and T steps in the backward process, but L1 (that is, L sample L_{sample}LsampleThe process of actually does not depend on the forward process, whether it is a Markov chain or not, as long as q σ ( xt ∣ x 0 ) q_{\sigma}(x_t|x_0)qp(xt∣x0) is fixed on it.
So the author wants to speed up the sampling, originally from 1 ~ T sequence x 1 : T x_{1:T}x1:TIterate step by step, and now find a subset { x τ 1 , x τ 2 , . . . , x τ S } \{x_{\tau_1},x_{\tau_2},...,x_{\ tau_S}\}{ xt1,xt2,...,xtS} , a total of S were selected as a subset
So, define the forward process q ( x τ i ∣ x 0 ) = N ( α τ ix 0 , ( 1 − α τ i ) I ) q(x_{\tau_i}|x_0)=N(\sqrt{\alpha_ {\tau_i}}x_0,(1-\alpha_{\tau_i})I)q(xti∣x0)=N(atix0,(1−ati) I ) match the previously defined marginal distribution, so that the generation process can be done directly on the subsequence. That is to say, it is a complete sequence when training, and a subsequence when generating it. As long as the subsequence is small and the effect is not bad, it can be accelerated.