[Diffusion Model] 3. DDIM | Accelerate the sampling speed of DDPM

insert image description here

Paper: Denoising Diffusion Implicit Models

Code: https://github.com/CompVis/stable-diffusion/blob/main/ldm/models/diffusion/ddim.py

Source: ICLR 2021

Time: 2021.10.05

DDIM Contributions:

  • Point out L sample L_{sample} in DDPMLsampleIt is not directly related to the specific form of the joint distribution of the diffusion process, so training L sample L_{sample}LsampleIt is equivalent to training a series of latent diffusion models
  • Construct a more general Markov diffusion process, and let it satisfy the marginal distribution invariance, so as to be able to reuse the trained DDPM model
  • Constructed a more general sampling algorithm
  • The technique of respacing is proposed to reduce the sampling steps, in training L sample L_{sample}LsampleWhen , it has nothing to do with the joint distribution, so some steps can be skipped to sample, and DDIM can obtain the same sampling quality as DDPM with 5 times fewer steps

Zhihu introduction

Official account introduction

Station B: https://www.bilibili.com/read/cv25126637/

DDPM: https://zhuanlan.zhihu.com/p/638442430

DDIIM : https://www.zhangzhenhu.com/aigc/ddim.html#equation-eq-ddim-226

Non-Markov: https://zhuanlan.zhihu.com/p/627616358

bilibili:https://www.bilibili.com/video/BV1JY4y1N7dn/?spm_id_from=333.337.search-card.all.click&vd_source=dff5a38233d0daec447c275bf4070791

1. Background

DDPM does not use the confrontation network and obtains a good image generation effect, but due to the need for multiple sampling (about 1000 times) in the reverse denoising, it can produce a better generation result.

The forward denoising process of DDPM is a Markov process, that is, the state at the current moment is only related to the state at the previous moment, and the reverse denoising process of DDPM is the inverse process of Markov

This has a lot to do with the principle of the diffusion model. The generation method based on the diffusion model is to gradually restore the image from the noise step by step. How many steps are forward diffusion and how many steps are required for reverse denoising, which is very slow compared with GAN , GAN only needs one time to generate the required picture

  • DDPM: It takes about 20h to sample and generate 50k 32x32 images (it takes about 1000h to generate 50k 256x256 images)
  • GAN: It takes about 1min

In order to improve the sampling speed of DDPM, the author of this paper proposes denoising diffusion implicit models (DDIMs), which is the same as the objective function of DDPM. DDIM constructs the forward process as a non-Markov process, so the reverse process is also non-Markov. The inverse process of the process , precisely because of this non-Markov process, determines that the generation process of DDIM can be more certain, can generate high-quality samples faster, and is more balanced in terms of calculation amount and sampling quality.

Why DDIM can accelerate:

DDIM can speed up DDPM by 10x~50x. The reason why it can be accelerated is the non-Markov process it uses, because the reason why DDPM requires the same number of steps forward and backward is that the Markov process it uses determines that it can only In this way, after DDIM does not use the Markov process, it can skip some steps, that is, skip sampling

The characteristics of DDIM: DDIM is very similar to DDPM, and the objective function of training is also the same, so you can directly use the model trained by DDPM and modify the sampling process to use it

  • The reverse process uses a non-Markov chain process: each step forward and reverse of DDPM is a Markov chain process, which is the result of the current moment xt − 1 x_{t-1}xt1Only rely on the value xt x_{t} of the previous momentxt, but DDIM uses a non-Markov chain process in reverse, that is, the current moment depends on xt x_{t} at the same timextand x 0 x_0x0, can use fewer steps to generate pictures, improve sampling efficiency (10~50 times speedup)
  • There is better consistency: as long as the initial value is the same, the final high-level features of samples generated using different step lengths are also similar, and the generation process is deterministic
  • Interpolation can be used: meaningful semantic interpolation can be obtained due to the good consistency of the generated results of DDIM

2. How to improve DDIM

2.1 Principle review of DDPM

DDIM and DDPM have some differences in the use of symbols, you need to pay attention

  • α ‾ t \overline{\alpha}_t in DDPMatThe representation in DDIM is α t \alpha_tat

Given a data distribution q ( x 0 ) q(x_0)q(x0) , can generate a series of samples through sampling, the generative model is concerned with being able to learn this data distribution, and the way of learning is to learn ap θ ( x 0 ) p_{\theta}(x_0)pi(x0) to approximateq ( x 0 ) q(x_0)q(x0) , when the approximation effect is good enough, it can be obtained fromp θ ( x 0 ) p_{\theta}(x_0)pi(x0) to generate new samples, this is the generative model

In DDPM, the data distribution is actually modeled in the form of a hidden variable model. It is considered that:

insert image description here

The joint distribution here is the form of the second half, which is the product of a series of Markov processes. The generative process needs to simulate the inverse process of the Markov process of forward diffusion. Here, x 1 x_1x1to x T x_TxTand x 0 x_0x0are the same size

The objective function in DDPM is to maximize the log likelihood, which is a variational model, and the lower bound is optimized

insert image description here

In DDPM, the process of adding noise is done in the form of Markov chain, which is the way of multiplication, and the conditional distribution q is also a Gaussian distribution, and the mean and variance are shown in formula 3

insert image description here

So the forward diffusion process is a Markov chain, then the reverse denoising process is also a Markov chain, and the approximation is an inverse process q ( xt − 1 ∣ xt ) q(x_{t-1}|x_t)q(xt1xt)

For the forward diffusion process, there is a special property that can put q ( xt ∣ x 0 ) q(x_t|x_0)q(xtx0) is written out, which is equivalent to a new normal distribution, which is also a marginal distribution

insert image description here

Since there is such a form, so xt x_txtand x 0 x_0x0By introducing noise ϵ \epsilonϵ to express:

insert image description here

When α T \alpha_TaTWhen set close to 0, q ( x T ∣ x 0 ) q(x_T|x_0)q(xTx0) will approach the standard Gaussian distribution, sop θ ( x T ) : = N ( 0 , I ) p_{\theta}(x_T):=N(0, I)pi(xT):=N(0,I ) , that is,p θ ( x T ) p_{\theta}(x_T)pi(xT) distribution is close to the standard Gaussian distribution, so a Gaussian distribution can be initialized, and thenx 0 x_0x0

The objective function of DDPM is as follows, which is also the form of rewriting Equation 2, which is also assumed to learn the Gaussian distribution mean (variance is fixed), it can be written as the square of the difference between the predicted noise and the real noise, ϵ t \ epsilon_tϵtis the real noise, the coefficient in DDPM γ t = 1 \gamma_t=1ct=1,当 γ t = 1 \gamma_t=1 ct=1 , this is actually a fractional model, which further explains that if you want to use this model to train the model, the forward diffusion process does not have to be a Markov process, as long as the marginal distribution satisfies formula 4.

insert image description here

In DDPM, in order for p θ ( x T ) p_{\theta}(x_T)pi(xT) tends to the standard Gaussian distributionN ( 0 , I ) N(0 , I)N(0,I ) , and in order to make the process of adding noise and the inverse process both Gaussian distribution, so T is set relatively large (1000), also to make the generation process approximate a Gaussian distribution, but this sampling process is very time-consuming.

The characteristics of the DDPM loss function:

  • Since the noise comes from q ( xt ∣ x 0 ) q(x_t|x_0)q(xtx0) sampling, so the loss function is only composed ofq ( xt ∣ x 0 ) q(x_t|x_0)q(xtx0) decision, it is also proved that the loss function is a form of score matching
  • That is to say, the loss function only depends on the marginal distribution, and does not directly depend on the joint distribution, that is to say, the form of the joint distribution will not affect the model training.

Therefore, is there a non-Markov process that can also achieve this noise addition method, only need to guarantee q ( xt ∣ x 0 ) q(x_t|x_0)q(xtx0) is the same as DDPM, andq ( x 1 : T ∣ x 0 ) q(x_{1:T}|x_0)q(x1:Tx0) can be different. So can we avoid using Markov's step-by-step reverse recursion process, but directly use the more general formq ( xt ∣ xt − 1 , x 0 ) q(x_t|x_{t-1}, x_0)q(xtxt1,x0) , as long as q ( xt ∣ x 0 ) q(x_t|x_0)is guaranteedq(xtx0) remain unchanged.

Therefore, the author of DDIM gave a non-Markovian forward diffusion process and the expression of the posterior probability distribution, and the posterior distribution just satisfies the marginal distribution q ( xt ∣ x 0 ) q ( x_t|x_0)q(xtx0)

2.2 Non-Markov forward diffusion process of DDIM

Consider a new distribution, introducing a new parameter σ \sigmaσ , is a real number, greater than or equal to 0, is a hyperparameter, does not require training, but you can customize the size of this value

The forward diffusion process is designed in a new way (Equation 6), and the posterior distribution is still Gaussian (Equation 7)

Based on formulas 6 and 7, it can be proved that at any time q σ ( xt ∣ x 0 ) q_{\sigma}(x_t|x_0)qp(xtx0) still obey the normal distribution

insert image description here

proving process:

insert image description here

insert image description here

Summarize:

  • The author of DDIM designed a non-Markov forward diffusion process, and the marginal distribution is still the same as in DDPM, so the network can continue to be trained with the objective function of DDPM

Comparison of DDPM and DDIM posterior probability distributions:

  • DDIM has one more hyperparameter σ \sigmaσ , will affect the posterior distribution and affect the reparameterization process of the sampling process, because the distribution constructed by the neural network is to approximate the posterior distribution, and the sample at time t-1 is obtained by sampling from this posterior distribution, that is, it must first Find the mean and variance of the posterior distribution for this prediction, then use the heavy parameter to compute the noise. Different values ​​of hyperparameters will affect the calculation of heavy parameters, because the hyperparameters are different, the mean and variance are different, and the final structure is different

2.3 Sampling of non-Markov diffusion inverse process

The next step is to define a trainable generation process p θ ( x 0 : T ) p_{\theta}(x_{0:T})pi(x0:T),每个 p θ ( t ) ( x t − 1 ∣ x t ) p_{\theta}^{(t)}(x_{t-1}|x_t) pi(t)(xt1xt) is to approximateq σ ( xt − 1 ∣ xt , x 0 ) q_{\sigma}(x_{t-1}|x_t,x_0)qp(xt1xt,x0) , that is to say given a noise samplext x_txtPredict x 0 x_0 firstx0, and then according to the conditional distribution q σ ( xt − 1 ∣ xt , x 0 ) q_{\sigma}(x_{t-1}|x_t,x_0)qp(xt1xt,x0) to getxt − 1 x_{t-1}xt1

In DDPM there is no prediction x 0 x_0x0is the predicted noise ϵ \epsilonϵ

But in fact, as shown in Equation 4, the noise ϵ \epsilon is obtainedAfter ϵ , it is possible to deducex 0 x_0x0

insert image description here

So, by formula 4, the ϵ \epsilonϵ andxt x_txtinto, calculate f θ f_{\theta}fi, is given xt x_txtIn the case of denoising observations, that is, x 0 x_0 predicted at the current time tx0, as shown in Equation 9

insert image description here

According to the predicted x 0 x_0x0, you can use the posterior distribution as the approximation target, which is a piecewise function, and when t>1 is q σ q_{\sigma}qpdistribution, when t=1, it is expected to be a normal distribution, as shown in Equation 10

insert image description here

A special sampling - DDIM: σ = 0 \sigma=0p=0

The following formula 12 is the heavy parameter process of the posterior distribution proposed in this paper, which is from xt x_txtgenerate xt − 1 x_{t-1}xt1Process, mean + standard deviation * noise

Different σ \sigmaσ will lead to different mean and standard deviation, so the sampling results must be different, but the objective function is the same, or the modelϵ θ \epsilon_{\theta}ϵiare the same, that is, for different σ \sigmaσ , there is no need to retrain the model,σ \sigmaσ only affects the result of sampling.

σ \sigma When σ takes the following value, it degenerates into the generation process of DDPM (the generation process of Markov chain)

insert image description here

σ = 0 \sigma=0p=When it is 0 , it is deterministic sampling, and the generation process is deterministic, because the random item is gone, this time it is DDIM

insert image description here

The author also deduces why DDIM can be trained directly using the objective function of DDPM:

insert image description here

To summarize DDIMs:

Construct a more general non-Markov process, set the hyperparameters to 0, and turn it into a deterministic sampling process

2.4 Accelerated sampling - Respacing

The above DDIM itself is a model, and there is no acceleration. The acceleration is because a trick can be introduced on the model, and the damage is very small, so it can be accelerated

In DDPM, there are T steps in the forward process and T steps in the backward process, but L1 (that is, L sample L_{sample}LsampleThe process of actually does not depend on the forward process, whether it is a Markov chain or not, as long as q σ ( xt ∣ x 0 ) q_{\sigma}(x_t|x_0)qp(xtx0) is fixed on it.

So the author wants to speed up the sampling, originally from 1 ~ T sequence x 1 : T x_{1:T}x1:TIterate step by step, and now find a subset { x τ 1 , x τ 2 , . . . , x τ S } \{x_{\tau_1},x_{\tau_2},...,x_{\ tau_S}\}{ xt1,xt2,...,xtS} , a total of S were selected as a subset

So, define the forward process q ( x τ i ∣ x 0 ) = N ( α τ ix 0 , ( 1 − α τ i ) I ) q(x_{\tau_i}|x_0)=N(\sqrt{\alpha_ {\tau_i}}x_0,(1-\alpha_{\tau_i})I)q(xtix0)=N(ati x0,(1ati) I ) match the previously defined marginal distribution, so that the generation process can be done directly on the subsequence. That is to say, it is a complete sequence when training, and a subsequence when generating it. As long as the subsequence is small and the effect is not bad, it can be accelerated.

insert image description here

3. Effect

insert image description here

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/132656332