Understanding the Diffusion Model from a Unified Perspective (3)

7.『Drawbacks to Consider』

Although the diffusion model has successfully emerged in the past two years, igniting the industry, academia and even ordinary people's attention to AI models for text-generated images, the diffusion model itself still has some flaws:

  1. Although the theoretical framework of the diffusion model itself is relatively complete, the formula derivation is also very elegant. But still very unintuitive. At the very least, the process of continuous optimization from a completely noisy input is far from the human thinking process.
  2. Compared with GAN or VAE, the latent vectors learned by the diffusion model do not have any semantic and structural interpretability. As mentioned above, the diffusion model can be regarded as a special MHVAE, but the potential vectors in each layer are in the form of linear Gaussians, with limited changes.
  3. The latent vector of the diffusion model requires that the dimensions be consistent with the input, which further limits the representation ability of the latent vector.
  4. The multi-step iteration of the diffusion model results in the generation of the diffusion model often taking a long time.

However, the academic community has actually proposed many solutions to some of the above problems. For example, the interpretability problem of diffusion models. The author has recently discovered some work that directly applies score-matching to the sampling of latent vectors of ordinary VAE. This is a very natural innovation, just like flow-based-vae a few years ago. As for the time-consuming problem, this year's ICLR best paper also accelerated and compressed the sampling problem to a few dozen steps to generate very high-quality results.

However, there seems to be not many applications of diffusion models in the field of text generation recently, except for a paper by xiang-lisa-li, the author of prefix-tuning [3]

The author has not paid attention to any other work. Specifically, if the diffusion model is used directly for text generation, there are still many inconveniences. For example, the size of the input must be consistent throughout the diffusion process, which means that the user must decide in advance the length of the text they want to generate. Moreover, it is okay to do guided conditional generation. It is probably not easy to train an open-domain text generation model using the diffusion model.

This note focuses on the inference of the diffusion model from a unified perspective. However, the specific how to train score matching and how to guide the diffusion model to generate the conditional distribution we want has not yet been written out. The author plans to record and compare them in detail in the next article, which discusses some recent research methods that apply diffusion models in the field of controlled text generation.

8.『Supplement』

As for why the diffusion kernel is the inverse process of the diffusion process of Gaussian transformation, it is also a Gaussian transformation. A Zhihu answer from a Tsinghua expert [4] gave a relatively intuitive explanation. The second line is to approximate p_t-1 and p_t. The third line uses first-order Taylor expansion of logpt(x_t-1) to eliminate logpt(xt). The fourth line is the expression directly substituted into q(xt|xt-1). So we get an expression for a Gaussian distribution.

                                                     The inverse process of diffusion is also a Gaussian distribution

In both Equations 94 and 125, we model the approximate mu_theta of the mean mu_q of the real Gaussian distribution q into a form consistent with the mu_q we derived, and set the variance to a form consistent with the variance of q. Intuitively speaking, this kind of modeling has many benefits. On the one hand, based on the KL divergence of the analytical expressions of two Gaussian distributions, we can eliminate and offset most of the terms, which simplifies the modeling. On the other hand both the true distribution and the approximate distribution depend on xt. During training, our input is xt, and the expression is the same as the real distribution without leaking any information. And in engineering, DDPM has also verified that similar simplification is actually feasible. But behind the reason why this can actually be done is the goal explained by complex mathematical proofs in a series of papers since 2021. Also quoting the answer from Tsinghua University boss [4]:

                                   The method of simplifying the Gaussian distribution for denoising in DDPM actually contains profound truth.

In DDPM, the final optimization target is epsilon_t rather than epsilon_0. That is, whether the prediction error is the initial error or the initial error at a certain time step. Who is right and who is wrong? In fact, this misunderstanding comes from our misunderstanding of the expression of xt with respect to x0. The successive steps of derivation starting from Equation 63 all apply a Gaussian property, that is, the mean and variance of the sum of two independent Gaussian distributions are equal to the mean sum and variance of the original distribution. In fact, in the process of applying the re-parameterization technique to find xt, we recursively introduce new epsilon to replace the epsilon in x_n in the recursion. So in the end, the epsilon we get is nothing more than an epsilon that encompasses all diffusion processes. This noise can be said to be t, or it can be said to be 0. To be most accurate, it should not be equal to any time step, so just call it noise!

                                                                   Optimization goals of DDPM

  • On different simplified forms of lower bounds on evidence. Among them, we mentioned that the second approximation to noise is the modeling method used by DDPM. However, the approximation of the initial input is actually used in papers. This is the form used in the paper [3] that applies the diffusion model to controllable text generation mentioned above. This paper directly predicts the initial Word-embedding in each round. The third perspective of score-matching can be viewed with reference to Dr. SongYang’s series of papers [5]. The form of the optimization function inside uses the third type.
  • This note focuses on the derivation of the formula for the variational lower bound of the diffusion model. This note does not cover the relationship between the diffusion model and energy model, Langevin dynamics, stochastic differential equations and other terms. The author will sort out the relevant understanding in another note.

reference

  1. ^Improving Variational Inference with Inverse Autoregressive Flow https://arxiv.org/abs/1606.04934
  2. ^Deep Unsupervised Learning using Nonequilibrium Thermodynamics https://arxiv.org/abs/1503.03585
  3. ^abDiffusion-LM Improves Controllable Text Generation https://arxiv.org/abs/2205.14217
  4. ^abdiffusion model has become very popular in the field of image generation recently. How do you think its popularity has begun to surpass GAN? - I want to sing high C's answer - Zhihu https://www.zhihu.com/question/536012286/answer/2533146567
  5. ^SCORE-BASED GENERATIVE MODELING THROUGH STOCHASTIC DIFFERENTIAL EQUATIONS https://arxiv.org/abs/2011.13456

Guess you like

Origin blog.csdn.net/xifenglie123321/article/details/131986727