Stable Diffusion Principle Description

Reference in this article: Explain the principle of Stable Diffusion in simple terms, and novices can understand it-Knowledge

Table of contents

1. What can Stable Diffusion do?

2. Diffusion model

(1) Forward Diffusion

(2) Reverse Diffusion

(3) How to train

 3、Stable Diffusion Model

(1) Latent diffusion model

(2) Variational Autoencoder (Variational Autoencoder)

(3) Why are latent spaces possible?

(4) Backdiffusion of latent space

(5) What is a VAE file

(6) Conditioning

(6.1) Text conditions

 (6.2) Tokenizer tokenizer

(6.3) Send embeddings to the noise predictor

(6.4) Cross Attention Mechanism

4. Stable Diffusion generation steps

(1) Text to Image

 (2) image to image

5. CFG value

(1) Classifier Guidance (Classifier Guidance)

(2) Classifier-free guidance

6. Summary of Stable Diffusion


1. What can Stable Diffusion do?

In its simplest form, Stable Diffusion is a text-to-image mode, give it a text prompt and it will return images that match the text.

2. Diffusion model

Stable Diffusion is a model under the Diffusion model.

They are generative models, meaning that their purpose is to generate new data similar to the data they were trained on. For Stable Diffusion, data is the image.

 Why is it called Diffusion Model?

 It is divided into two parts , forward diffusion and reverse diffusion , corresponding to the noise addition and noise reduction in the above figure.

(1) Forward Diffusion

The process is to add noise to the training image, gradually transforming it into a featureless noisy image. The forward pass will turn any cat or dog image into a noisy image. Eventually, it will be impossible to tell whether they were originally dogs or cats.

Like a drop of ink dropped into a glass of water, the drop spreads through the water , and after a few minutes, it's randomly distributed throughout the water, and you can no longer tell whether it originally landed in the center or near the edges.

Below is an example of an image with forward diffusion, where the image of a cat becomes random noise.

(2) Reverse Diffusion

The reverse process is like playing a video backwards, going back in time and we will see where the ink drop was originally added. The reverse process is to restore the image.

 Starting from noisy, meaningless images, backdiffusion recovers images of cats or dogs.

Each backdiffusion process has two parts: one is the drift or directed motion, backdiffusion drifts towards the image of a cat or dog; the other is random motion.

(3) How to train

For backdiffusion, we need to know how much noise is added to the image, and the answer is to teach the neural network model to predict the added noise . It's called the noise predictor in Stable Diffusion, which is a U-Net model. The training is as follows:

  1. Choose a training image, e.g. a photo of a cat
  2. generate random noise image
  3. Corrupt the training image by adding this noisy image to a certain number of steps
  4. Training the noise predictor tells us how much noise we added, this is done by adjusting its weights and showing it the correct answer.

Noise is added sequentially at each step, and after training we have a noise predictor capable of estimating the noise added to an image.

Now that we have a noise predictor, how do we use it?

We first generate a completely random image and ask the noise predictor to tell us about the noise. We then subtract the entire estimated noise from the original image. Repeat this process a few times and you'll end up with an image of a cat or a dog.

 At this point, we have no control over generating images of cats or dogs. This will be addressed when we talk about conditioning later on. Image generation is currently unconditional.

 3、Stable Diffusion Model

 Part 2 talks about the principle of image generation by Diffusion model, but not the principle of Stable Diffusion Model. The reason is that the above diffusion process is in the image space, which is computationally intensive and cannot be run on any single GPU.

Image spaces are huge, a 3*512*512 image with three color channels (RGB) is a 786,432-dimensional space.

Diffusion models like google's Imagen and OpenAI's DALL-E are in pixel space, they use some tricks to make the model faster, but still not enough.

Of course, there are also advantages to using pixel space, that is, you can more precisely control the generated content, such as displaying text.

Stable Diffusion aims to solve the speed problem.

(1) Latent diffusion model

Stable Diffusion is a model of latent space diffusion . It does not operate in a high-dimensional image space, but first compresses the image into a latent space ( ps: similar to the centerNet algorithm in the image, it is also performed in the hidden layer 512*512->128*128 space training and inference ). Compared with the original pixel space, the latent space is 48 times smaller, so the speed becomes faster.

(2) Variational Autoencoder (Variational Autoencoder)

Stable Diffusion uses variational autoencoders to achieve image latent space compression. The variational autoencoder neural network consists of two parts: the encoder and the decoder, the encoder compresses the image into a low-dimensional representation in the latent space, and the decoder recovers the image from the latent space.

 The latent space of the Stable Diffusion model is 4*64*64, which is 48 times smaller than the original image pixel space of 3*512*512. All the forward and back diffusion we've talked about is actually done in the latent space .

So, during training, instead of generating noisy images, it generates random tensors (latent noise) in the latent space. Instead of corrupting the image with noise, it corrupts the representation of the image in the latent space with latent noise. The reason for this is that it is much faster.

(3) Why are latent spaces possible?

Why VAE can compress images into a smaller latent space without losing information. The reason is: natural images are not random, they have high regularity: faces follow a specific spatial relationship between eyes, nose, cheeks and mouth; dogs have 4 legs and are a special shape.

This may be related to the Manifold hypothesis in machine learning. If you think that natural data is represented by a low-dimensional manifold in space, it is an illusion in itself, and you can directly calculate it in a low-dimensional space to obtain results similar to those in high dimensions. This kind of compression of high-dimensional images to low-dimensional understanding seems to be an operating principle of human visual nerves. In other words, it is more helpful for AI to focus on the low-frequency, overall form, which is equivalent to allowing AI to focus more on judging the overall structure of an image .

However, judging from the output results, the manifold assumption is not completely correct. Because Latent Diffusion is not as good as pixel-level Diffusion like Dall-E in places where the face and hands account for a small proportion of the overall image, but the details cannot be ignored. So this kind of low-dimensional calculation is still lossy, which is why Stable Diffusion VAE decoding has added enhanced modules such as face correction. Replacing a better VAE seems to be a way to enhance the performance of Stable Diffusion.

(4) Backdiffusion of latent space

  1. Generate random latent space matrix
  2. The noise predictor estimates the noise of the latent matrix
  3. The estimated noise is then subtracted from the original latent space matrix
  4. Repeat steps 2 and 3 up to a specific sampling step
  5. The VAE's decoder converts the latent space matrix into the final image.

(5) What is a VAE file

VAE files are used in Stable Diffusion V1 for improved eye and face painting. They are the decoders of the autoencoders we just talked about. By further fine-tuning the decoder, the model can draw finer details.

(6) Conditioning

Where is the text prompt injected into the image? Conditions are required.

The purpose of the condition is to guide the noise predictor so that the predicted noise will give what we want when subtracted from the image.

(6.1) Text conditions

The tokenizer (Tokenizer) first converts each word in the hint into a number of tokens (token), then converts each token into a 768-value vector called Embedding, and then Embedding is processed by the text converter and is ready for noise The predictor is used.

 (6.2) Tokenizer tokenizer

Text prompts are first tokenized by the CLIP tokenizer . CLIP is a deep learning model developed by Open AI to generate a textual description of any image. Stable Diffusion v1 uses CLIP's tokenizer.

A tokenizer can only tokenize the words it has seen during training. For example, there are "dream" and "beach" in the CLIP model, but not "dreambeach". The Tokenizer decomposes the word "dreambeach" into two tokens "dream" and "beach". So, a word doesn't always mean a token .

The Stable Diffusion model is limited to 75 tokens in hints.

(6.3) Send embeddings to the noise predictor

The text converter needs to further process the embedding before feeding it to the noise predictor. Its input is a text embedding vector, but it can also be other things like class labels, images and depth maps. Converters not only further process the data, but also provide a mechanism to include different conditioning modes.

(6.4) Cross Attention Mechanism

The output of the text transformer is used multiple times by the noise predictors throughout the U-Net, which consumes it through the cross-attention mechanism , which is where the cue meets the image.

Take the cue "man with blue eyes," for example. Stable Diffusion pairs the words "blue" and "eyes" together via self- attention in the prompt , so that it generates a man with blue eyes rather than a man with a blue shirt. It then uses this information to guide backdiffusion to images containing blue eyes via a cross-attention mechanism between the prompt and the image prompt .

The Lora model modifies the weights of the cross-attention module to change the style. The fact that the Stable Diffusion model can be fine-tuned by modifying only this module illustrates the importance of this module.

4. Stable Diffusion generation steps

(1) Text to Image

In step 1, Stable Diffusion generates random tensors in the latent space. This tensor can be controlled at this point by setting the seed of the random number generator. What was generated was an image in the latent space, but now it's all noise.

In step 2, the noise predictor U-Net takes the latent noisy image and text cues as input and predicts the noise, also in a tensor of latent space 4*64*64.

Step 3, the latent noise is subtracted from the latent image, which will become the new latent image.

 Steps 2 and 3 are repeated for a certain number of sampling steps, say 20 times.

In step 4, the decoder of the VAE converts the latent image back to pixel space. Here is the image obtained after running Stable Diffusion.

 (2) image to image

Image-to-image is a method proposed for the first time in the SDEdit method. SDEdit can be applied to any expansion model, so it has the image-to-image function of Stable Diffusion.

The input image and text prompt are provided as image-to-image input, and the resulting image will be conditioned by the input image and text prompt. For example, using this amateur drawing and the prompt "photo of perfect green apple with stem, water droplets, dramatic lighting" as input, image-to-image can turn it into a professional drawing.

In step 1, the input image is encoded as a latent space

In step 2, noise is added to the latent image. Noise Reduction Strength controls the amount of noise added. If 0, no noise is added; if 1, the maximum amount of noise is added so that the latent image becomes a complete random tensor.

 In step 3, the noise predictor U-Net takes potentially noisy images and text cues as input and predicts the noise in the latent space (4*64*64 tensor).

Step 4, the latent noise is subtracted from the latent image, which will become the new latent image.

Steps 3 and 4 are repeated for a certain number of sampling steps, say 20 times.

In step 5, the VAE decoder converts the latent image back to pixel space, which is the image obtained after running image-to-image.

 In summary, all image-to-image does is set up an initial latent image with a bit of noise and a bit of the input image. Setting the denoising strength to 1 is equivalent to text-to-image, since the initial latent image is entirely random noise.

5. CFG value

 CFG: Classifier-Free Guidance No classifier guidance, this is the value that AI artists adjust every day.

(1) Classifier Guidance (Classifier Guidance)

Classifier bootstrapping is a method for incorporating image labels in diffusion models that can use the labels to guide the diffusion process. For example, the label "cat" leads the backdiffusion process to generate pictures of cats.

The classifier guidance scale is a parameter used to control how close the diffusion process should stay to the labels.

Suppose there are 3 sets of images with labels "cat", "dog" and "human". If Diffusion is unsupervised, the model will draw samples from the total data for each group (somewhat "evenly"), but sometimes it may draw images that fit both labels, such as a boy petting a dog.

Guided by a high classifier, the images generated by the diffusion model will be biased towards extreme or explicit examples. If you ask the model for a cat, it will return an image of an explicit cat and nothing else.

(2) Classifier-free guidance

 Although classifier guidance achieves record-breaking performance, it requires an additional model to provide that guidance, which poses some difficulties for training.

Classifier-free guidance is a way to achieve "classifier-free guidance", and text prompts provide this guidance in the form of text-to-image.

They condition the classifier part as a noise predictor, U-Nett, to achieve so-called "classifier-free" (i.e., no separate image classifier) ​​guidance in image generation.

Now that we have a classifier-less diffusion process via conditioning, how do we control how much guidance should be followed? The classifier-free-guided (CFG) scale is a value that controls how much the diffusion process is regulated by text cues . When image generation is set to 0, image generation is unconditional (i.e. cues are ignored), higher values ​​direct diffusion towards cues.

6. Summary of Stable Diffusion


1. It uses a diffusion model, which is divided into two parts: forward diffusion and reverse diffusion, corresponding to the process of noise addition and noise reduction 2.
Forward diffusion: similar to the diffusion of ink droplets in water, it becomes random noise; reverse diffusion: recovery image.
3. Training: teach the neural network U-Net to predict noise
4. Backdiffusion of latent space:
(1) Generate random latent space matrix
(2) Noise predictor estimates the noise of the latent matrix
(3) Subtract from the original latent space matrix Estimated noise
(4) Repeat steps (2) and (3) until specific sampling step
(5) VAE's decoder converts the
latent space matrix
into the final image Quantity
(2) The noise predictor takes as input a latent noisy image and a text cue, and predicts the noise
(3) Subtract the latent noise from the latent image to get a new latent image
(4) Repeat steps (2) and (3) until the specified Sampling step 
(5) The decoder of the VAE converts the latent space matrix into the final image
6. The step of graph generation
(1) The input image is encoded into a latent space, and noise is added to the latent image.
(2) Noise predictor takes latent noisy image and text hint as input, and predicts noise
(3) Subtract latent noise from latent image to get new latent image
(4) Repeat steps (2) and (3) until specific sampling Step 
(5) The decoder of VAE converts the latent space matrix into the final image
7. CFG: classifier-free guidance without classifier guidance, which means that a separate classification model is not required and text prompts are used for guidance. This value is used to control The degree to which textual cues modulate the diffusion process

Guess you like

Origin blog.csdn.net/benben044/article/details/130974891