stable diffusion (LDM)--image generation model

1 Introduction

This article is based on the translation and summary of "High-Resolution Image Synthesis with Latent Diffusion Models" in April 2022. The address of the paper is https://arxiv.org/pdf/2112.10752.pdf . Source address: GitHub - CompVis/latent-diffusion: High-Resolution Image Synthesis with Latent Diffusion Models .

Previous diffusion models (diffusion models (DMs)) were based on the pixel level, which required hundreds of GPU days for training. Our approach to latent diffusion models (LDMs) is close to the best in reducing computational complexity and preserving details and improving fidelity.

Our approach latent diffusion models (LDMs) is a two-stage model. First compress the picture, compress the picture into a hidden variable representation (latent), reduce the computational complexity, and then input the diffusion model.

As shown in the figure below, our perceptual image compression does not lose much semantic information, but reduces the amount of computation.

2 related work

image generation model

  • The results of the GAN model are limited to the datasets for comparison because its adversarial learning process does not scale easily to model complexity and multimodal distributions. Although GAN can generate high-resolution images, it is difficult to optimize and capture the complete data distribution.
  • Variational autoencoders (VAE) and flow-based models can efficiently synthesize high-resolution images, but their effect is not as good as the GAN model.
  • autoregressive models (ARM) have strong performance in dense estimation, but the computation requires high architecture and sequential sampling process, so only low-resolution images can be generated. Because the pixel-level representation of images contains almost imperceptible, high-frequency details, maximum-likelihood training spends a lot of effort to model these details, resulting in a long training time. To scale to high resolutions, some two-stage methods use ARMs to build compressed image latent variable representations instead of raw pixel-level representations.
  • The diffusion model is a possibility-based (likelihood-based) model. Likelihood-based methods emphasize good density estimation, which makes them perform well.

Two-stage image generation

VQ-VAEs learn image priors using an autoregressive model (ARM) in a discretized space.

Our approach latent diffusion models (LDMs) is also a two-stage model.

3 methods

Our latent diffusion models (LDMs) are two-stage. The first part is the left half (red) below, which compresses the picture and compresses the picture into a hidden variable representation (latent), which can reduce the computational complexity; the second part is the diffusion model (diffusion and denoising), the green part in the middle. In addition, a cross-attention mechanism is introduced, the right half of the figure below, which is convenient for text or picture drafts to exert influence on the diffusion model, so as to generate the pictures we want, such as generating the pictures we want based on the text.

3.1 Perceptual Image Compression

Mainly talk about the left half (red) of the above model.

To avoid arbitrary highly variable latent spaces, we experiment with two kinds of regularization. The first is KL-reg, which imposes a slight KL penalty on the learned latent variables, similar to VAE. The other is VQ-reg, which uses a vector quantization layer in the decoder.

This encoder/decoder, we can train it only once, is suitable for different DM model training.

3.2 Latent Diffusion Models

Mainly talk about the middle part (green) of the above model.

  • The objective function of the diffusion model expressed by hidden variables is as follows:​

3.3 Adjustment mechanism/cross-attention

We achieve flexible image generation control by introducing cross-attention into the UNET network of the DM model. For different input modalities, attention-based models can be effectively learned.

The final objective function becomes the following form:

4 experiments

4.1 Perceptual Compression Tradeoffs

Encoder downsampling factor, we take f ∈ { 1 , 2 , 4 , 8 , 16 , 32 }, that is, LDM-f represents different models. Among them, LDM-1 means no compression, which is equivalent to the original pixel-based DM.

From the figure below, it can be seen that LDM-4 and LDM-8 are better at synthesizing high-quality images.

4.2 Image generation

As shown below, the LDM model works well.

LDM also has fewer parameters, 1.45B (1.45 billion parameters).

4.3 Condition generation

As shown in the figure below, we can generate a large image with high resolution according to the draft map of the space layout in the upper left corner.

The picture below generates a picture based on the text, and you can see that the effect is not bad.

4.4 High resolution generation

We can generate high-resolution images from low-resolution images, as shown in the middle part below.

4.5 Image Restoration

Part of the picture can be restored. The figure below shows the effect of cutout.

Guess you like

Origin blog.csdn.net/zephyr_wang/article/details/130270026