High-Resolution Image Synthesis with Latent Diffusion Models笔记

I am just getting started, and I may not be able to understand if I try to read the paper by myself.

What is the source of the problem in this paper

What problem does this paper solve and what functionality does it implement
? Since these models usually operate directly in pixel space, optimization of a powerful DM typically consumes hundreds of GPU days, and inference is expensive due to sequential evaluation.
It should solve the resource problem in a way that reduces computational requirements and achieves a near-sweet spot between complexity reduction and detail preservation, greatly improving visual fidelity. Our latent diffusion models (LDMs)
achieve image New state-of-the-art scores for inpainting and class-conditional image synthesis, and highly competitive performance on a variety of tasks, including text-to-image synthesis, unconditional image generation, and super-resolution, while compared to pixel-based DMs, Computational requirements are significantly reduced .

How it was solved.
In order to perform DM training on limited computational resources while maintaining its quality and flexibility, we propose to circumvent this shortcoming by introducing an explicit separation of the compressive learning phase from the generative learning phase (see Figure 1B). 2) Situation. To achieve this, we leverage an autoencoder model that learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.

By introducing cross-attention layers in the model architecture, we turn the Diffusion model into a powerful and flexible generator for general conditional inputs such as text or bounding boxes, and can perform high-resolution synthesis in a convolutional manner.

Advantages of this approach
This approach offers several advantages:
(i) By leaving the high-dimensional image space, we obtain a computationally more efficient DM since the sampling is performed on the low-dimensional space.
(ii) We exploit the inductive bias of DMs inherited from their UNet architecture [71], which makes them particularly effective for data with spatial structure, thus alleviating the need for aggressive, quality-reducing levels of compression required by previous methods
(iii) Finally, we have generalized compressed models whose latent spaces can be used to train multiple generative models, as well as other downstream applications such as single-image CLIP-guided synthesis

Guess you like

Origin blog.csdn.net/qq_45560230/article/details/130760685