[Artificial Intelligence Frontier Trends] - Generative AI Series: Diffusion Model and Stable Diffusion Model

The emergence of VAE and GAN has made generative AI more and more popular. The emergence and rise of the diffusion model has pushed AIGC to the forefront of artificial intelligence, which is regarded as the main factor for the breakthrough in the field of artificial intelligence generative art. Compared with VAE and GAN, the image quality generated by the diffusion model is better. With the emergence of the transformer architecture and the rise of the prompt project, the technology for generating images based on text prompts has become more mature. The emergence and development of the stable diffusion model allows us to easily create wonderful artistic illustrations through text prompts. So in this article, I will explain how they work. This time I will not pile up complicated formulas, but explain the working principle of the diffusion model and the stable diffusion model in plain language.

1、Diffusion Model

1.1 Overview of Diffusion Model Principles

As shown in the figure below, unlike GAN, which uses generators and discriminators for confrontation training to generate images, the diffusion model generates images by cyclically denoising the generated random noise. It is a bit like carving, a piece of rough stone, with The master removes the excess bit by bit, and what is left is a perfect work of art. The random noise used needs to have the same height and width as the generated target image.

insert image description here

In the denoising process of the diffusion model, the number of denoising steps (step) is manually set in advance, such as 1000. This step is not only the number of steps, but also represents the severity information of the noise.

The Denoise module of each denoising station is the same and can be used repeatedly. The input of the Denoise module is not only a random noise image, but also its corresponding step.

1.2 Implementation principle of Denoise module

There is a Noise Preidicter inside the Denoise module , as shown in the figure below, its input is the original noise image and step, and a noise image is output. The noise of this output is a pure noise image, not the result image of denoising, and then use the input The generated pure noise image is subtracted from the original noise image to obtain the denoised result image. This is done because training to generate pure noise is simpler and easier to train than directly generating the denoised resulting image. Of course, you can try the following to directly generate a denoised image.
insert image description here

1.3 How to train Noise Preidicter?

The first is the acquisition of training data. The training data of the diffusion model is created manually. As shown in the figure below, noise is added to the original image step by step. This process is called the Forward Process, also known as the Diffusion Process.
insert image description here

Then what are the inputs of the Noise Preidicter and what are the outputs of the Noise Preidicter. As shown in the figure, the input of the Noise Preidicter is the noise-added image and the corresponding step. The output is the generated pure noise, which corresponds to the pure noise we added when making the data, which is our ground truth.

insert image description here

For the Vincent graph, it is necessary to have corresponding text information in the training data. On the basis of inputting the original noise image and step, it is also necessary to add the text description information corresponding to the image to the Denoise module (into the Noise Preidicter) for training.

insert image description here

insert image description here

insert image description here

1.4 Algorithm flow pseudo code

The following is the pseudo code of the algorithm flow of the diffusion model

insert image description here

2、Stable Diffusion

The stable diffusion model generates images based on text prompts, which mainly consists of three parts: ① text encoder ② Generation Model ③ Decoder , the three are trained separately and then combined. The overall structure of the model is shown in the figure below.
insert image description here

2.1 Text Encoder

Text Encoder, which converts our prompt (prompt word) into the embedding required by the generated model. Both GPT and Bert can be used as Text Encoder. We won't go into details about Text Encoder here.

As shown in the figure below, the effect of Text Encoder has a great influence on the generated image results.

insert image description here

Note: The smaller the FID and the larger the CLIP Score, the better the generated image.

FID : Use the pre-trained CNN network to extract the hidden variables of the generated image, and normalize the hidden variables through the softmax activation function, and use the normalized hidden variables to extract the normalized hidden variables from the real data. The distribution distance measurement is carried out, and the quality of the generated image is judged according to the measurement result. The distribution metric uses the Frechet distance. It is recommended to test the accuracy of FID on a large number of samples to be more reliable.

insert image description here

CLIP : Feed the generated image and the text corresponding to the training at that time into CLIP. If the generated vectors of the two are close, the generated image effect will be good, and if the vector is far away, the effect will be poor.

insert image description here

2.2 Generation Model

Generation Model, that is, the generation model, here, uses our stable diffusion model. The model input is: ①embedding generated by the text encoder ②random noise image (note that the random noise image here is not consistent with our target image, but its reduced version) ③step. The output is: a compressed version of the intermediate product - which may or may not be human-readable.

insert image description here

When making training data, noise is no longer added to the original image, but added to the compressed version of the hidden variable feature map (corresponding to the intermediate product output by the model during inference) that is extracted and encoded by the encoder. The input and output of the Denoise module during training are also changed accordingly. The input is the intermediate product with added noise, the embedding and step generated by the text encoder, and the output is pure noise.

insert image description here
insert image description here

2.3 Decoder

Decoder is the part that converts the intermediate product generated by Generation Model into our target image. It feels a bit like the decoder of the semantic segmentation model, or the super-resolution network model. We can directly use the trained VAE decoder as a Decoder.

insert image description here

Decoder training only requires images, not corresponding text.

If the intermediate product generated by the Generation Model is a small image that can be understood by humans - the input of the Decoder is the downsampling result of the image, and the output is the original image. We can downsample the image data we can get, use the downsampling result as the input of Decoder, and use the original image as output for Decoder training.

insert image description here

If the intermediate product generated by Generation Model is a latent representation that humans cannot understand, we need AUTO-ENCODER to help us restore the generated target image.

insert image description here

2.4 Example model

As shown in the figure below, we can see that Stable Diffusion, DALL-E series, Imagen, etc. are all the above-mentioned structural models.

insert image description here

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/qq_43456016/article/details/132222444