Stable Diffusion diffusion model + Consistency consistency model

1 The transition from GAN to Stable Diffusion

With the continuous accumulation of artificial intelligence in image generation, text generation and multi-modal generation 生成领域technologies, generative confrontation network (GAN), variable differential autoencoder (VAE), normalizing flow models, autoregressive model (AR), energy- based models and the diffusion model of fires in recent years (Diffusion Model).

GAN: Additional Discriminator
VAE: Aligning the Posterior Distribution
EBM Energy-Based Models: Handling Partition Functions
Normalization Flow: Imposing Network Constraints
insert image description here

The generation field GAN is a bit outdated, and Stable Diffusion takes its place. The generation field GAN is a bit outdated, and Stable Diffusion takes its placeThe generative field G A N has become a bit outdated, St ab l eD i ff u s i o n has taken his place

  • GAN needs to train two networks, which feels more difficult, prone to non-convergence, and poor diversity, so just focus on being able to fool the discriminator.
  • Diffusion Model uses a simpler method to explain how to learn and generate generative models, which actually feels simpler.

DALL E2 (based on CLIP multimodal image-text fusion model), Stable Diffusion diffusion model

2 History from DDPM to Stable Diffusion

2.1 DDPM

Diffusion models are a class of generative models that generate images directly from random noise. [Diffusion Model DDPM: Denoising Diffusion Probabilistic Models]

Idea : train a noise estimation model, and restore the input random noise into a picture, 其中噪声就是标签,还原的时候,模型根据噪声生成对应的图像.

Training process : Randomly generated noise ϵ \epsilonϵ , after N steps, the noise is gradually diffused to the input original imagex 0 x_0x0In, the picture after destruction is xn x_nxn, learn to destroy the estimated noise of the picture ϵ θ ( xn , n ) \epsilon_\theta( x_n,n)ϵi(xn,n ) , with L2 loss constraints andϵ \epsilonϵ distance from the original input noise.

Inference process : Input noise, and restore it to a picture after the estimated noise model.
insert image description here
Summary : How does the diffusion model work?
Training process: 前向扩散过程, gradually add Gaussian noise to an image until the image becomes random noise.
Reasoning process: 反向生成过程, starting from a random noise and gradually removing the noise until an image is generated.

insert image description here

Important formula for forward diffusion process :
xt x_txtis the image distribution at time t, zi z_iziis noise, we can pass the initial distribution x 0 x_0x0and noise zi z_izi, perform N-step diffusion to get the final noise image xn x_nxn
insert image description here
insert image description hereinsert image description here

The important formula of the reverse generation process :
learn the noise prediction model ϵ θ ( xn , n ) \epsilon_\theta( x_n,n)ϵi(xn,n ) , randomly generate an initial noisexn x_nxn, through this model, do N steps to generate noise and restore to x 0 x_0x0picture.
insert image description here

insert image description here
insert image description here
UNet prediction noise Z t Z_tZt
insert image description here

insert image description here

The key to Diffusion's function :
hidden variable model, both processes are a parameterized Markov chain, variational inference to model and solve

2.2 Stable Diffusion

The biggest problem with the diffusion model is that it is extremely "expensive" both in terms of time cost and economic cost. The emergence of Stable Diffusion is to solve the above problems. If we want to generate an image of size 1024 × 1024, U-Net will use noise of size 1024 × 1024 and then generate an image from it. The amount of calculation to do one-step diffusion here is very large, not to mention iterating multiple times until 100%. One solution is to split the large image into several small-resolution images for training, and then use an additional neural network to generate larger-resolution images (super-resolution diffusion).

insert image description here
Latent space (Lantent Space)
Latent space is simply a representation of compressed data. Compression refers to the process of encoding information with fewer bits than the original representation. For example, we use a color channel (black, white and gray) to represent a picture originally composed of RGB three primary colors. At this time, the color vector of each pixel changes from 3 dimensions to 1 dimension. Dimensionality reduction will lose some information, but in some cases, dimensionality reduction is not a bad thing. Through dimensionality reduction, we can filter out some less important information and keep only the most important information.

Suppose we train an image classification model through a fully connected convolutional neural network. When we say the model is learning, we mean that it is learning specific properties of each layer of the neural network, such as edges, angles, shapes, etc... Whenever the model learns using data (images that already exist), it takes the image is first reduced in size and then restored to its original size. Finally, the model uses the decoder to reconstruct the image from the compressed data while learning all previous relevant information. Therefore, the space becomes smaller in order to extract and preserve the most important attributes. This is why latent spaces are suitable for diffusion models.
insert image description here
Latent Diffusion
"Latent Diffusion Model" (Latent Diffusion Model) combines the perception ability of GAN, the detail preservation ability of diffusion model and the semantic ability of Transformer to create a more robust and efficient generation model than all the above models. Compared with other methods, Latent Diffusion not only saves memory, but also the generated images maintain diversity and high detail, while the images also preserve the semantic structure of the data.

Any generative learning method has two main phases: perceptual compression and semantic compression.

Perceptual Compression
In the perceptual compression learning phase, learning methods must remove high-frequency details to encapsulate data into abstract representations. This step is necessary to build a stable, robust representation of the environment. GANs are good at perceptual compression, which they do by projecting high-dimensional redundant data from pixel space to the hyperspace of the latent space. A latent vector in latent space is a compressed form of the original pixel image, which can effectively replace the original image. More specifically, perceptual compression is captured with an Auto Encoder structure. The encoder in an autoencoder projects high-dimensional data into a latent space, and the decoder recovers images from the latent space.

Semantic Compression
In the second stage of learning, image generation methods must be able to capture the semantic structure present in the data. This conceptual and semantic structure provides the preservation of the context and interrelationships of various objects in an image. Transformer is good at capturing semantic structure in text and images. The combination of Transformer's generalization ability and diffusion model's detail preservation ability provides the best of both worlds and provides a way to generate fine-grained highly detailed images while preserving the semantic structure in the image.

Autoencoder VAE
Autoencoder (VAE) consists of two main parts: encoder and decoder. The encoder will convert the image into a low-dimensional latent representation (pixel space –> latent space), which will be passed as input to U_Net. The decoder does the opposite, transforming the latent representation back into the image (latent space –> pixel space).
insert image description here
U-Net
U-Net also consists of an encoder and a decoder, both of which are composed of ResNet blocks. The encoder compresses the image representation into a lower resolution image, and the decoder decodes the lower resolution back to a higher resolution image. To prevent U-Net from losing important information when downsampling, a shortcut connection is usually added between the downsampling ResNet of the encoder and the upsampling ResNet of the decoder.
insert image description here
Furthermore, U-Net in Stable Diffusion is able to condition its output on text embeddings via cross-attention layers. Cross-attention layers are added to the encoder and decoder parts of U-Net, usually between ResNet blocks.

Text Encoder
A text encoder converts input cues into an embedding space that U-Net can understand. Generally a simple Transformer-based encoder that maps a sequence of tokens to a sequence of latent text embeddings. High-quality prompts are important to intuitive output quality, which is why there is so much emphasis on prompt design these days. Prompt design is to find certain keywords or expressions, so that the prompt can trigger the model to produce output with expected properties or effects.
insert image description here

3 Consistency ends Diffusion

Diffusion models rely on an iterative generative process, which makes such methods slow to sample, which in turn limits their potential for real-time applications.

In order to overcome this limitation, OpenAI proposed Consistency Models, which is a new type of generative model that 无需对抗训练can quickly obtain high-quality samples. Consistency Models support being fast one-step 生成while still allowing few-step 采样for a trade-off between computational effort and sample quality. They also support 零样本(zero-shot)数据编辑, for example, image inpainting, colorization, and super-resolution without requiring specific training for these tasks. Consistency Models can be trained as distilled pretrained diffusion models, or as standalone generative models.

Consistency Models As a generative model, 核心设计思想是支持 single-step 生成,同时仍然允许迭代生成,支持零样本(zero-shot)数据编辑,权衡了样本质量与计算量.

First Consistency Models are built on the probability flow (PF) ordinary differential equation (ODE) in the continuous time diffusion model. As shown in the figure below, given a PF ODE that smoothly transforms data into noise, Consistency Models learn to map any point to the initial point of the trajectory at any time step for generative modeling. A notable property of Consistency Models is self-consistency: points on the same trajectory map to the same initial point. This is why the models are named Consistency Models.
insert image description here

Consistency Models allow to generate data samples (initial point of ODE trajectory, eg x_0 in Figure 1) by evaluating transformed random noise vectors (end points of ODE trajectory, eg x_T in Figure 1) using only one network. More importantly, by chaining the output of Consistency Models models at multiple time steps, the method can improve sample quality and perform zero-sample data edits at the cost of more computation, similar to iterative optimization of diffusion models.

insert image description here
In terms of training, the research team provides two self-consistency-based methods for Consistency Models.

  • The first method relies on using a numerical ODE solver and a pretrained diffusion model to generate adjacent point pairs on the PF ODE trajectory. By minimizing the difference between the model outputs for these point pairs, this study effectively distills diffusion models into Consistency Models, allowing the generation of high-quality samples via one network evaluation.

  • The second method completely eliminates the dependence on the pre-trained diffusion model, and can train Consistency Models independently. This approach positions Consistency Models as an independent class of generative models.

It is worth noting that neither training method requires adversarial training, and both allow Consistency Models to flexibly adopt neural network architectures.

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/130174582