Stable Diffusion Technical Principle of Vincentian Diagram

Introduction to image generation models

In the field of image generation, there are four mainstream generative models: Generative Adversarial Model (GAN), Variational Autoencoder (VAE), Flow based Model, and Diffusion Model.

Starting in 2022, the main popular image generation model will be the Diffusion Model.

Diffusion Model: Diffusion model. The core of current DALL-E, Midjourney, and Stable Diffusion image generation is the Diffusion Model. It is a generation model that expects to obtain good results by constantly removing noise.

The early diffusion model did not work well in AI painting, and it took 10-15 minutes to generate a single image. Later, the British Stability AI company improved the model and open sourced it, greatly improving the stability and quality of image generation, and the speed of image generation. It has been improved by 100 times, which means that it only takes 6-10 seconds to generate a picture, which used to take 10-15 minutes (600-900 seconds).

Before the appearance of Stable Diffusion (stable diffusion model), there was a stable diffusion model (Latent Diffusion), which was the text2image model in the Latent Difusion paper.

Latent Diffusion Model: Latent Diffusion Model is a variant of the diffusion model. The biggest difference is that it compresses and reduces the dimension of the image. The compressed space is called Latent Space (latent space or latent space), which can greatly reduce calculations. With this technology, we can generate images on ordinary GPUs. In addition, the Diffusion model can not only generate images, but also audio and video.

Stability Al improves Latent diffusion and the new model is called Stable Diffusion. Improvements include:

(1) Training data: Latent diffusion is trained on laion-400M data, while Stable diffusion is trained on laion-2B.en data set. Obviously the latter uses more training data, and the latter also uses Data filtering to improve data quality, such as removing watermarked images and selecting images with higher aesthetic scores

(2) text-encoder: Latent diffusion uses a randomly initialized transformer to encode text, while Stable diffusion uses a pre-trained CLIP text encoder to encode text. Pre-trained text models are often better than models trained from scratch.

(3) Training size: Latent diffusion is only trained on 256x256 resolution, while Stable diffusion is pre-trained on 256x256 resolution and then Finetune on 512x512 resolution.

Summary: Stable diffusion uses a better text encoder to train on a larger data set, and can generate higher resolution images, so the current image generation effect of Stable Diffusion is better.

Stable Diffusion reasoning process

More detailed reasoning process:

过程:prompt text(cat girl) -> CLIP -> text embedding -> diffusion(U-Net + Scheduler)-> VAE -> generate image

The underlying working mechanism of Stable Diffusion

Step 1. Enter the prompt word and parse the prompt word: Text Image Encoder CLIP

Step 2. Generate image representation based on prompt word representation: Diffusion process based on U-Net (U-Net + Scheduler)

Step 3. Processing and conversion of image input and output: VAE (the image decoder is responsible for image generation from latent space to pixel space)

The principles of each step are analyzed below.

CLIP

CLIP (Contrastive Language-Image Pre-training): Pre-training model for image and text contrastive learning

CLIP does not fully understand the semantics, but just thinks of a way to match text and images:

Text encoding as text embedding is just an intermediate product of CLIP.

CLIP training set:

Training set: 400 million image-text pairs (400 million)

training process

In contrastive learning in the same batch, the diagonal lines are positive samples, and the others are negative samples. The training goal of CLIP is to maximize the similarity of N positive samples, while minimizing the similarity of N^2 - N negative samples.

It is hoped that through contrastive learning, the model can learn the matching relationship between text-image pairs: synonymous image-text pairs will have a high score, and different image-text pairs will have a low score.

To put it simply: it is to put text and pictures into a matrix space to solve the mapping and similarity intersection of text to pictures, so as to facilitate finding the distribution of corresponding images through text.

Trick: The larger the batch, the better the training effect.

How big a CLIP model is needed?

论文:《Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding》

paper:https://arxiv.org/abs/2205.11487

Description: FID↓ CLIP Score ↑

Increasing the encoder size of the language model improves image-text alignment more than increasing the size of the image diffusion model.

FID(Fréchet Inception Distance)

paper:https://arxiv.org/abs/1706.08500

FID measures the distance between the feature distribution (assumed to be a Gaussian distribution) of the real image and the generated image. It requires a lot of feature distributions (FID-10K is 10K images). The smaller the FID score, the better, which means the generated image is more like the real image. .

Diffusion Model

Implement DDPM (Denoising Diffusion Probabilistic Models) based on the idea of ​​Diffusion Model.

DDPM learns the denoising process by continuously adding noise to the data to become real noise, and continuously denoising the real noise and restoring it to the original data. Then it can randomly sample the real noise and restore (generate) it into various forms. A variety of data.

The forward process (also called diffusion process) refers to the process of gradually adding Gaussian noise to the data until the data becomes random noise.

The reverse process is a denoising process. If we know the true distribution of each step of the reverse process, then starting from a random noise and gradually denoising can generate a real sample, so the reverse process The process is the process of generating data.

Why add noise? Why add noise step by step?

1) Directly removing pixels will cause information loss, and adding noise can allow the model to learn image features;

2) Random noise can also increase the diversity of model generation;

3) This process can be controlled step by step while providing stability during the denoising process.

How much noise should be added at each step?

This is based on the schedule. Generally, it is better to start with less and then more. Image features will be lost slowly.

The process of denoising can be compared to sculpture. Michelangelo said: The statue is originally in the stone, I just remove the unnecessary parts.

How to train?

Directly add random noise to the image in steps. This process is called diffusion process (also called Forward process, diffusion/noise addition). Each step has a ground truth image, and the training model restores the image.

The process of restoring pictures:

1) Input the original image (covering the noise of step=50) and step=50, and use U-Net to predict the noise of the image. Each step here shares the same U-Net.

2) When there is a lot of noise, U-Net cannot predict precise picture details, but can only predict the rough outline of the model.

3) Repeat the prediction in this way until the original image is obtained

The actual noise adding process of DDPM does not require step-by-step processing. The Gaussian noise of the specified step can be added in place at one time, and then the noise can be predicted step by step.

论文:《Denoising Diffusion Probabilistic Models》

paper: https://arxiv.org/abs/2006.11239

论文:《Understanding Diffusion Models: A Unified Perspective》

paper:https://arxiv.org/abs/2208.11970

Noise sampling scheme

The big process of the diffusion model lies in the sampling of noise. Model sampling needs to start from pure noise pictures and continuously denoise step by step to finally obtain a clear picture. In this process, the model must calculate at least 50 to 100 steps serially to obtain a higher quality image. This results in the time required to generate an image 50 to 100 times that of other deep generation models, which greatly limits the model. deployment and implementation.

These sampling processes are mapped to Stable Diffusion mainly as Schedulers. The main function of Scheduler in Stable Diffusion is to output the coefficients of generating image noise according to the current step (Step) of generating noise. It is a simple calculation formula It is: (Image noise = randomly generated noise * coefficient of scheduler output).

Under the driving requirements of sampling frequency and speed, the diffusion model is very important to add noise and denoise sampling schemes, including DDPM, DDIM, PLMS, DPM-Solver, etc.

DDPM (Denoising Diffusion Probabilistic Model) uses a linear noise sampling scheme (linear schedule) by default.

DDIM (Denoising Diffusion Implicit Models, denoising diffusion implicit model), DDIM and DDPM have the same training objectives, but it no longer restricts the diffusion process to a Markav chain, which allows DDIM to use a smaller number of sampling steps To speed up the generation process, another feature of DDIM is that the process of generating samples from random noise is a deterministic process.

DPM-Solver (Diffusion Process Model Solver) is proposed by the TSAIL team led by Professor Zhu Jun of Tsinghua University. It is an efficient solver specially designed for diffusion models: this algorithm does not require any additional training and is applicable to For both discrete-time and continuous-time diffusion models, convergence can be achieved in 20 to 25 steps, and very high quality sampling can be obtained in only 10 to 15 steps. On Stable Diffusion, sampling speed is doubled.

Text embedding is added to the image generation process

1) In order to incorporate text features, U-Net adds an attention mechanism (QKV) to the network structure.

2) In order to strengthen the guidance effect of the text, the classifier free guidance method is used here. The 7.5 parameter here is guidance scale.

Picture effects of different guidance scales:

The core of the diffusion model is to train the noise prediction model . Since the noise and the original data are of the same dimension, we can choose to use the AutoEncoder architecture as the noise prediction model. The model used by DDPM is a U-Net model based on residual block and attention block.

Unet is a model proposed in "U-Net: Convolutional Networks for Biomedical Image Segmentation" in 2015.

UNet is a semantic segmentation model, and its execution process is:

First, convolution is used for downsampling, and then layer after layer of features are extracted. Using the layer after layer of features, upsampling is performed, and finally an image is obtained in which each pixel corresponds to its type.

U-Net network structure:

Advantages of Unet:

1. The deeper the network layer, the wider the field of view of the feature map obtained;

2. Shallow convolution focuses on texture features, while deep networks focus on essential features, so both deep and shallow features are meaningful;

3. The edges of larger feature maps obtained through deconvolution lack information. After all, every time downsampling refines features, some edge features will inevitably be lost, and the lost features cannot be upsampled. Therefore, edge feature retrieval is achieved through feature splicing;

4. Unet is simple, efficient, easy to understand, and easy to build. It can be built from small data sets and is simple and easy to use in diffusion models.

FEET

The role of VAE: performance-friendly, can interpolate and operate in latent space, and control image generation

The encoder and decoder here do not reduce or enlarge the image, but encode the image. For example, music is encoded into sheet music, and then the music is played through the sheet music. This can be compared to encoding sound into sheet music.

Structure of VAE

During the training of the encoder, VAE learns how to map the input data to a probability distribution in the latent space by minimizing the reconstruction error.

During the training of the decoder, VAE learns how to generate raw data from random vectors in the latent space by minimizing the KL divergence.

VAE vs Diffusion Model:

The encoder of VAE learns a probability distribution, so VAE can also randomly sample and generate images, but the VAE image restoration effect is very weak, the generated image is blurry, and the effect is not as good as the diffusion model

Benefits of VAE : reduced training and inference time, reduced GPU hardware requirements

The original image is 512x512x3->compressed to 64x64x4. Stable Diffusion uses the VAE of KL-f8. The downsampling coefficient is 8, which is reduced by 48 times.

Disadvantages of VAE : if you compress and then restore, image details will be lost.

Structure of Stable Diffusion

 Latent diffusion的论文:《High-Resolution Image Synthesis with Latent Diffusion Models》

paper:https://arxiv.org/abs/2112.10752

Split explanation:

 

The overall framework of Latent Diffusion Models is as shown in the figure. First, you need to train an autoencoding model (AutoEncoder, including an encoder E and a decoder D). In this way, we can use the encoder to compress the image, and then perform a diffusion operation on the potential representation space. Finally, we can use the decoder to restore the original pixel space. The paper calls this method perceptual compression. Compression).

Control method of Stable Diffusion

Textual Inversion

Adjust CLIP so that it outputs text features that match new images, such as binaural bell alarm clock and Ultraman Tiga. You only need to save the learned features.

ControlNet

Train a new network to adjust the resnet block of U-Net. This new network can input images used as control conditions, such as canny line drawings, skeleton diagrams, etc.

 

in conclusion

The Diffusion Model is different from the common generation model mechanisms such as GAN, VAE, and Flow in the past. It gradually "samples" a special distribution from Gaussian noise according to certain conditions. As the number of "sampling" rounds increases, it is finally generated. picture. In other words, the synthesis process of Diffusion Model is to extract the required image from the noise through iteration after iteration. As the number of iteration steps increases, the synthesis quality is getting better and better.

The benefits of this mechanism are obvious, as the relationship between synthesis quality and synthesis speed becomes controllable. When time is sufficient, high-quality synthetic samples can be obtained through high-round iterations, while fast synthesis at lower rounds can also produce synthetic samples without obvious flaws. There is no need to retrain the model between high and low iterations, and only some round-related parameters need to be manually adjusted.

This sounds a bit bizarre, but there is strong mathematical logic behind it. These mathematics are mainly Markov chain and Langevin formula.

references

English

Latent Diffusion paper: https://arxiv.org/pdf/2112.10752.pdf

Diffusion Models detailed formula: What are Diffusion Models? | Lil'Log

Comparison of various fine-tuning model methods: https://www.youtube.com/watch?v=dVjMiJsuR5o

Scheduler comparison chart comes from the paper: https://arxiv.org/pdf/2102.09672.pdf

Source of VAE structure diagram: https://towardsdatascience.com/vae-variational-autoencoders-how-to-employ-neural-networks-to-generate-new-images-bdeb216ed2c0

The corgi diagram comes from the DALLE2 paper: https://cdn.openai.com/papers/dall-e-2.pdf

Introduction to CLIP model: https://github.com/openai/CLIP

OpenCLIP:https://github.com/mlfoundations/open_clip

Textual Inversion: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

LoRA paper: https://arxiv.org/pdf/2106.09685.pdf

Dreambooth paper: https://arxiv.org/pdf/2208.12242.pdf

ControlNet paper: https://arxiv.org/pdf/2302.05543.pdf

Simple and easy-to-understand explanation of Diffusion Model: https://www.youtube.com/watch?v=1CIpzeNxIhU

很棒的Stable Diffusion解释:The Illustrated Stable Diffusion – Jay Alammar – Visualizing machine learning one concept at a time.

Also great SD explained: https://medium.com/@steinsfu/stable-diffusion-clearly-explained-ed008044e07e

GLIDE paper: https://arxiv.org/abs/2112.10741

CLASSIFIER-FREE DIFFUSION GUIDANCE论文:https://arxiv.org/pdf/2207.12598.pdf

Chinese

Stable Diffusion UNET structure: Stable Diffusion UNET structure - Zhihu

LoRA application experience: Do you really know how to use LORA? Super detailed explanation of LORA hierarchical control - Zhihu

Great explanation of Stable Diffusion: Diffusion Model [Translation]_Yu Jianmin's Blog-CSDN Blog

A very detailed introduction to Stable Diffusion: [Original] A long article of 10,000 words explaining the basic technical principles of Stable Diffusion’s AI painting - Zhihu

Explanation related to diffusion model: https://www.youtube.com/watch?v=hO57mntSMl0

Guess you like

Origin blog.csdn.net/shibing624/article/details/132486118