Diffusion model-related paper reading, the combination of diffusion model and knowledge distillation to improve prediction speed: Progressive Distillation for Fast Sampling of Diffusion Models

Paper address and code

The results of Google research, ICLR 2022
https://arxiv.org/abs/2202.00512
tenserflow official open source code: https://github.com/google-research/google-research/tree/master/diffusion_distillation
pytorch unofficial code: https https://github.com/lucidrains/imagen-pytorch

Quick Facts

The main problem to be solved - slow diffusion model prediction

  • 1. Although the diffusion model has achieved good results, the prediction speed is slow.
  • 2. The author proposes a step-by-step distillation method, as shown in the figure below:

insert image description here

0.Abstruct

0.1 Sentence-by-sentence translation

Diffusion models have recently shown great promise for generative modeling, out- performing GANs on perceptual quality and autoregressive models at density es- timation. A remaining downside is their slow sampling time: generating high quality samples takes many hundreds or thousands of model evaluations. Here we make two contributions to help eliminate this downside: First, we present new parameterizations of diffusion models that provide increased stability when using few sampling steps. Second, we present a method to distill a trained deterministic diffusion sampler, using many steps, into a new diffusion model that takes half as many sampling steps. We then keep progressively applying this distillation proce- dure to our model, halving the number of required sampling steps each time. On standard image generation benchmarks like CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers taking as many as 8192 steps, and are able to distill down to models taking as few as 4 steps without losing much perceptual quality; achieving, for example, a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive distillation procedure does not take more time than it takes to train the original model, thus representing an efficient solution for generative modeling using diffusion at both train and test time.

Diffusion models have recently shown great strengths in generative models, outperforming GANs in terms of perceptual quality and density estimation, while the downside is longer sampling times: hundreds or thousands of model evaluations are required to generate high-quality samples . Here, we make two contributions to help eliminate this shortcoming: First, we propose new parameterizations of the diffusion model that provide greater stability when using a small number of sampling steps. Second, we propose a method that converts a trained deterministic diffusion sampler to a new diffusion model over many steps, requiring only about half as many sampling steps. We then gradually apply this distillation process to our model, each time halving the required sampling steps. On standard image generation benchmarks such as CIFAR-10, ImageNet, and LSUN, we start with state-of-the-art samplers with up to 8192 steps and are able to distill them into models that require only 4 steps without losing much perception quality. For example, in CIFAR-10 we achieve FID=3.0 within 4 steps. Finally, we show that the full stepwise distillation process takes no more time than training the original model, so it is an efficient solution for using diffusion models at both training and testing time.

Summarize

There are mainly two innovations,

  • 1. Propose a method that can simplify the steps
  • 2. A knowledge distillation method is proposed, which can optimize a higher number of iterations to a lower number of iterations.

1.INTRODUCTION

1.1 Sentence-by-sentence translation

First paragraph (diffusion model achieves very good results in all aspects)

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are an emerg- ing class of generative models that has recently delivered impressive results on many standard gen- erative modeling benchmarks. These models have achieved ImageNet generation results outper- forming BigGAN-deep and VQ-VAE-2 in terms of FID score and classification accuracy score (Ho et al., 2021; Dhariwal & Nichol, 2021), and they have achieved likelihoods outperforming autore- gressive image models (Kingma et al., 2021; Song et al., 2021b). They have also succeeded in image super-resolution (Saharia et al., 2021; Li et al., 2021) and image inpainting (Song et al., 2021c), and there have been promising results in shape generation (Cai et al., 2020), graph generation (Niu et al., 2020), and text generation (Hoogeboom et al., 2021; Austin et al., 2021).
Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are an emerging class of generative models that have recently achieved impressive results on several standard generative model benchmarks. the result of. These models achieve ImageNet generation results in terms of FID score and classification accuracy score, which are much better than BigGAN-deep and VQ-VAE-2 (Ho et al., 2021; Dhariwal & Nichol, 2021), and in terms of generation probability than Autoregressive image models perform better (Kingma et al., 2021; Song et al., 2021b). They have also successfully achieved image super-resolution (Saharia et al., 2021; Li et al., 2021) and image inpainting (Song et al., 2021c), and performed well in shape generation (Cai et al., 2020), Promising results have been achieved in areas such as graph generation (Niu et al., 2020) and text generation (Hoogeboom et al., 2021; Austin et al., 2021).

The second paragraph (raise the problem of slow prediction of the diffusion model)

A major barrier remains to practical adoption of diffusion models: sampling speed. While sam- pling can be accomplished in relatively few steps in strongly conditioned settings, such as text-to- speech (Chen et al., 2021) and image super-resolution (Saharia et al., 2021), or when guiding the sampler using an auxiliary classifier (Dhariwal & Nichol, 2021), the situation is substantially differ- ent in settings in which there is less conditioning information available. Examples of such settings are unconditional and standard class-conditional image generation, which currently require hundreds or thousands of steps using network evaluations that are not amenable to the caching optimizations of other types of generative models (Ramachandran et al., 2017).

However, a major obstacle to the practical application of diffusion models remains: sampling speed. Although under strong conditioning settings, such as text-to-speech (Chen et al., 2021) and image super-resolution (Saharia et al., 2021), sampling can be done in a relatively short time, or guided by an auxiliary classifier sampling (Dhariwal & Nichol, 2021), but in less conditioned settings the situation is quite different. For example, unconditioned and standard-conditioned image generation currently requires the use of network evaluations that are not subject to caching optimizations of other types of generative models (Ramachandran et al., 2017).

The third paragraph (the author puts forward his own ideas)

In this paper, we reduce the sampling time of diffusion models by orders of magnitude in uncondi- tional and class-conditional image generation, which represent the setting in which diffusion models have been slowest in previous work. We present a procedure to distill the behavior of a N -step DDIM sampler (Song et al., 2021a) for a pretrained diffusion model into a new model with N/2 steps, with little degradation in sample quality. In what we call progressive distillation, we repeat this distilla- tion procedure to produce models that generate in as few as 4 steps, still maintaining sample quality competitive with state-of-the-art models using thousands of steps.

In this paper, we significantly increase the sampling speed of the diffusion model for both unconditional and standard-conditioned image generation, which is one of the slowest settings for the diffusion model in previous work. We propose a method to distill the behavior of an N-step DDIM sampler from a pretrained diffusion model into a new model in only N/2 steps with little loss in sample quality. We call this stepwise distillation, and we repeatedly apply this method to our models to produce models that generate images in as few as 4 steps, while maintaining comparable sample quality to state-of-the-art models that use thousands of steps.

insert image description here

text description

Figure 1: A visualization of two iterations of our proposed progressive distillation algorithm. A sampler f(z;η), mapping random noise ε to samples x in 4 deterministic steps, is distilled into a new sampler f(z;θ) taking only a single step. The original sampler is derived by approximately integrating the probability flow ODE for a learned diffusion model, and distillation can thus be understood as learning to integrate in fewer steps, or amortizing this integration into the new sampler.

Figure 1: A visualization of two iterations of our proposed progressive distillation algorithm. A sampler f(z;η), mapping random noise ε to 4 deterministic samples x, is distilled into a new sampler f(z;θ) requiring only one step. The original sampler is obtained by learning the probability flow ordinary differential equation (ODE) of the diffusion model by approximating the integral, so distillation can be understood as learning the integral in fewer steps, or amortizing the integral to the new sampler.

1.2 Summary

  • 1. Although the diffusion model has achieved good results, the prediction speed is slow.
  • 2. The author proposes a step-by-step distillation method, as shown in the figure below:

insert image description here

3 PROGRESSIVE DISTILLATION

The first paragraph (a brief introduction to how to reduce the number of steps by distillation)

To make diffusion models more efficient at sampling time, we propose progressive distillation: an algorithm that iteratively halves the number of required sampling steps by distilling a slow teacher diffusion model into a faster student model. illation stays very close to the implementation for training the original diffusion model, as described by eg Ho et al. (2020). Algorithm 1 and Algorithm 2 present diffusion model training and progressive distillation side-by-side, with the relative changes in progressive distillation highlighted in green
. To improve the efficiency of diffusion models when sampling, we propose the progressive distillation algorithm: an algorithm that iteratively halves the required sampling steps by distilling a slow teacher diffusion model into a faster student model. Our implementation of the progressive distillation algorithm is very similar to the implementation for training the original diffusion model, as described for example in Ho et al. (2020). Algorithm 1 and Algorithm 2 show both diffusion model training and progressive distillation, where the relative change in progressive distillation is highlighted in green.

insert image description here

second paragraph

We start the progressive distillation procedure with a teacher diffusion model that is obtained by training in the standard way. At every iteration of progressive distillation, we then initialize the student model with a copy of the teacher, using both the same parameters and same model definition. Like in standard training, we then sample data from the training set and add noise to it, before forming the training loss by applying the student denoising model to this noisy data zt. The main difference in progressive distillation is in how we set the target for the denoising model: instead of the original data x, we have the student model denoise towards a target x ̃ that makes a single student DDIM step match 2 teacher DDIM steps. We calculate this target value by running 2 DDIM sampling steps using the teacher, starting from zt and ending at zt−1/N , with N being the number of student sampling steps. By inverting a single step of DDIM, we then calculate the value the student model would need to predict in order to move from zt to zt−1/N in a single step, as we show in detail in Appendix G. The resulting target value x ̃(zt) is fully determined given the teacher model and starting point zt, which allows the student model to make a sharp prediction when evaluated at zt. In contrast, the original data point x is not fully determined given zt, since multiple different data points x can produce the same noisy data zt: this means that the original denoising model is predicting a weighted average of possible x values, which produces a blurry prediction. By making sharper predictions, the student model can make faster progress during sampling.

We start the progressive distillation process, starting with the teacher diffusion model obtained through standard training. In each iteration, we then initialize the student model, sample data from the training set and add noise using the same parameters and model definition as the teacher, and then form a training loss when applied to this noisy data zt using the student denoising model. The main difference in progressive distillation is how we set the target of the denoising model: instead of the raw data x, we let the student model denoise with a target x ̃ such that a single student DDIM step is equal to 2 teacher DDIM steps. We compute this target value by using the teacher to run 2 DDIM sampling steps, starting at zt and ending at zt-1/N, where N is the number of student sampling steps. By working backwards one DDIM step, we then compute the predictions required for the student model to move from zt to zt−1/N in one step, details are shown in Appendix G. The resulting target value x ̃(zt) is fully determined given the teacher model and the starting point zt, which allows the student model to make unambiguous predictions when evaluating zt. In contrast, the original data point x given zt is not fully determined because multiple different data points x can produce the same noisy data zt: this means that the original denoising model predicts a weighted average possible x value, which produces ambiguity Prediction. By making sharper predictions, the student model is able to progress faster when sampling.

The third paragraph (continue to describe that this iteration can be used recursively, and students become new teachers)

After running distillation to learn a student model taking N sampling steps, we can repeat the pro- cedure with N/2 steps: The student model then becomes the new teacher, and a new student model is initialized by making a copy of this model.

After running the distillation to learn a student model that takes N sampling steps, we can repeat the process for N/2 steps: the student model will become the new teacher, and a new student model is initialized by duplicating the model.

The fourth paragraph (I really don’t understand the adjustment of Alph1 to 0 here, I have to look at the code)

Unlike our procedure for training the original model, we always run progressive distillation in dis- crete time: we sample this discrete time such that the highest time index corresponds to a signal-to- noise ratio of zero, i.e. α1 = 0, which exactly matches the distribution of input noise z1 ∼ N (0, I) that is used at test time. We found this to work slightly better than starting from a non-zero signal- to-noise ratio as used by e.g. Ho et al. (2020), both for training the original model as well as when performing progressive distillation.

Unlike our procedure for training the original model, we always perform progressive distillation in discrete time: we sample discrete time such that the highest time index corresponds to a signal-to-noise ratio of zero, i.e. α1=0, which exactly matches the input used at test time The distribution of noise z1∼N(0,I). We find that this approach performs better when training the original model and performing progressive distillation relative to, for example, Ho et al. (2020) when starting training with a non-zero SNR and performing progressive distillation.

Guess you like

Origin blog.csdn.net/qq_43210957/article/details/129948059