How Stable Diffusion works

This article mainly describes the realization principle of {Stable Diffusion|Stable Diffusion}.


What is {Stable Diffusion|Stable Diffusion}

{Stable Diffusion|Stable Diffusion} is a series of image generation models developed by the team of StabilityAI, , CompVisand , RunwayMLoriginally 2022published in . Its main function is to generate images with beauty and detail based on text input , but it can also perform other tasks such as inpainting missing parts of images ( inpainting), extending images ( outpainting), and generating graphs .

In short, {Stable Diffusion|Stable Diffusion} is a text-to-image model. Give it a text hint. It will return an image that matches the text.


{Diffusion model|Diffusion model}

{Stable Diffusion|Stable Diffusion} belongs to a class of deep learning models called {Diffusion Model|Diffusion model}. They are generative models , meaning they are designed to generate new data similar to what they saw during training. In the case of steady diffusion, the data are images .

Why is it called {Diffusion Model|Diffusion model}? Because its mathematical model is very similar to the diffusion process in physics.


{Forward Diffusion|Forward Diffusion}

 The {Forward Diffusion|Forward Diffusion} process adds noise to the training image , gradually turning it featureless 噪声图像. 前向过程will convert any 猫或狗image into a noisy image. Ultimately, you won't be able to tell whether they were originally a dog or a cat.

Like a drop of ink dropped into a glass of water. Ink spreads in water. After a few minutes, it will be randomly distributed throughout the water. You can no longer tell whether it originally fell in the center or near the edge.

Below is an example of an image going through {Forward Diffusion|Forward Diffusion}. Images of cats are turned into random noise .


{Reverse Diffusion|Reverse Diffusion}

{Reverse Diffusion|Reverse Diffusion} is like playing a video backwards. Go back in time . We'll see where the ink drop was initially added.

Starting from one 噪声、无意义的图像, the {Reverse Diffusion|Reverse Diffusion} process can recover an image of a cat or a dog.

Technically, every diffusion process has two parts:

  1. drift or orienteering
  2. random motion

{Reverse Diffusion|Reverse Diffusion} The process drifts in the direction of the cat or dog image, but there are no intermediate states. That's why the result can only be a cat or a dog.


Recap history

Before we dive into the architecture and mechanics of {Stable Diffusion|Stable Diffusion}, let's quickly review the history of image generation and the evolution of Stable Diffusion.

  • 2015Year: Presented by the University of Toronto alignDRAW. This is a 文本到图像model. The model was only able to generate blurry images, but demonstrated the possibility of generating images "unseen" by the model from text input.
  • 20162010: Reed、Scottet al. proposed a method for generating images using generative adversarial networks ( , a neural network structure). GANThey successfully generated realistic bird and flower images from detailed textual descriptions. Following this work, a series GANof models based on .
  • 2021Year: Architecture-based (another neural network architecture) OpenAIwas published and attracted public attention.TransformerDALL-E
  • 2022Year: Google BrainReleased Imagen, in competition OpenAIwith DALL-E.
  • 2022Year: {Stable Diffusion|Stable Diffusion} was announced as an improvement to the latent space diffusion model . Due to its open-source nature, many variants and fine-tuned models based on it were created and attracted wide attention and applications.
  • 2023Year: Many new models and applications have emerged, even beyond the scope of text-to-image, extending to fields such as text-to-video or text-to-3D.

As you can see from the timeline, text-to-image is actually a fairly young field. The emergence of {Stable Diffusion|Stable Diffusion} is an important milestone, and as an open-source model, it requires fewer resources than previous models, promoting the exponential growth of the field.


Components of {Stable Diffusion|Stable Diffusion}

{Stable Diffusion|Stable Diffusion} is not a simple AI model, it is 神经网络a process of combining different ones. We can decompose the entire text-to-image generation process of {Stable Diffusion|Stable Diffusion} into different steps and explain these steps step by step.

Let's start with an overview of the text-to-image generation process.

  • {Image Encoder | Image Encoder}: Convert training images into vectors in latent space for further processing.
  • A latent space is a mathematical space where image information can be represented as vectors (ie 数字数组.
  • {Text Encoder|Text Encoder}: Convert text into a high-dimensional vector (which can be regarded as an array of numbers representing the meaning of the text) for machine learning models to understand.
  • {Diffusion model|Diffusion model} 文本: Generate a new image in the latent space with the condition. (i.e. the input text guides the generation of images in the latent space).
  • {Image Decoder | Image Decoder}: Convert the image information in the latent space into an actual image composed of pixels.

Each step is done using its own neural network.


{Text Encoder | Text Encoder} converts text input into embedding vectors

Imagine you want a foreign artist to paint a painting for you, but you don't understand their language , you might use Google Translate or find someone to help you translate what you want to say to them.

The same is true for image generation models - machine learning models cannot directly understand text , so they need a {Text Encoder|Text Encoder} to convert your text instructions into numbers they can understand . These numbers are not random numbers, they are called text embeddings , which are high-dimensional vectors that capture the semantic meaning of text (i.e. the relationship between words and their context).

There are several ways to convert text to 嵌入向量can. {Stable Diffusion|Stable Diffusion} uses a Transformerlarge language model to accomplish this task. If you know anything about language models, you may Transformerbe familiar with this term - it is GPTthe basic architecture of language models. If you want to know more about ChatGPTthe related principles. You can refer to ChatGPT: What you should know

In fact, {Stable Diffusion|Stable Diffusion} v1is used CLIP, which is OpenAIbased on GPTthe developed model. {Stable Diffusion|Stable Diffusion} V2used OpenClip, a larger CLIPversion.

These encoder models are trained on large datasets containing billions of text and image pairs in order to learn the meaning of the words and phrases we use to describe images. The dataset comes from images on the web, along with image labels ( altlabels) that we use to describe those images.


{Diffusion model|Diffusion model} draws an image through the diffusion process

{Diffusion model|Diffusion model} is the core component of stable diffusion , which is a component of the integrated image.

In order to train the diffusion model, there are two procedures:

  1. {Forward Diffusion|Forward Diffusion} process is used to prepare training samples
  2. {Reverse Diffusion|Reverse Diffusion} process is used to generate the image

In {Stable Diffusion|Stable Diffusion}, both processes are performed in the latent space for speed.

During {Forward Diffusion|Forward Diffusion}, the model gradually adds to the image 高斯噪声, turning a clean image into a noisy one. At each step, a small amount of noise is added to the image, and the process is repeated over multiple steps.

As the word "diffusion" suggests, the process is like dropping a drop of ink into water, where the ink gradually spreads until you can no longer see where it was originally a drop of ink. The noise pattern added to the image is random , just like the random diffusion of ink particles into water particles, but the amount of noise can be controlled . This process LAIONis performed on many images of the dataset, each with a different amount of noise added to create a large number of noisy samples to train the {Reverse Diffusion|Reverse Diffusion} model.


In the process of {Reverse Diffusion|Reverse Diffusion}, a {Noise Predictor|Noise Predictor} is trained to predict the noise added to the original image, so that the model can remove the predicted noise from the noisy image and get a clearer image ( 潜在空间in). You can think of this process as looking at a partially diffused drop of ink in water and trying to predict where it was before.

{Noise Predictor|Noise Predictor} is based on ResNetthe backbone structure U-Net(a neural network architecture).

{Noise Predictor | Noise Predictor} The training process is as follows:

  1. Choose a training image, say a picture of a cat.
  2. Generate a random noisy image.
  3. Corrupt the training image by adding this noisy image to the training image for a certain number of steps.
  4. The professor {Noise Predictor | Noise Predictor} tells us how much noise is added. This is 调整其权重done by showing it the correct answer.

After training, we have a {Noise Predictor | Noise Predictor} capable of estimating the noise added to an image.

It is trained on a training dataset previously prepared via {Forward Diffusion|Forward Diffusion}, with the goal of estimating the noise as accurately as possible so that the denoised image is as close as possible to the original image. Once trained, it will "remember" image representations in its weights and can be used to "generate" images from a random initial noise image tensor. The actual image and image quality depends heavily on the original image dataset as it tries to get back to the original image. This back-diffusion process proceeds step by step in multiple steps to remove noise. With more denoising steps, the image becomes clearer and clearer.

In the {Reverse Diffusion|Reverse Diffusion} process of obtaining a sharper image, researchers hope to control the appearance of the image through a process called conditioning . If we use text, this process is called text conditioning . 文本嵌入It does this by passing in the text encoding step to U-Net, and operates through a cross-attention mechanism .

交叉注意机制Basically will 文本嵌入merge with the result of each step of {Reverse Diffusion|Reverse Diffusion}. For example, if you have an input prompt "cat", you can 条件化interpret it as telling {Noise Predictor | Noise Predictor}: "For the next denoising step, the image should look more like a cat. Now proceed to the next step." Condition Humanization can also be guided by other modalities besides text, such as images, semantic maps, representations, etc.


{Image Decoder | Image Decoder} translates the image from the latent space into pixels

扩散Since the sums we perform 条件化are performed in the latent space , we cannot directly view the resulting images. We need to convert the latent image back into pixels that we can see. This conversion is done by {Image Decoder | Image Decoder}.

In {Stable Diffusion|Stable Diffusion}, this transformer is a variational autoencoder ( VAE). In the early {Forward Diffusion|Forward Diffusion} process, we used VAEthe encoder part to convert the original training image from pixels to the latent space to add noise. Now, VAEthe decoder part we use converts the latent image back into pixels.

What is VAE

{Variational Autoencoder|Variational Autoencoder}( VAE) The neural network consists of two parts:

  1. Encoder: compresses the image into a lower dimensional representation in the latent space
  2. Decoder: Restoring images from latent spaces

The latent space of the {Stable Diffusion|Stable Diffusion} model is 4x64x6448 times smaller than the image pixel space. All forward and backward diffusion actually takes place in the latent space .

So, during training, instead of generating an image with noise, it generates a random {tensor|Tensor} (latent noise) in the latent space. Instead of corrupting the image with noise, it corrupts the representation of the image in the latent space with latent noise. The reason for this is that it is faster because the potential space is smaller.

In deep learning, Tensorit is actually a multidimensional array.
The Tensorpurpose is to be able to create higher-dimensional matrices and vectors .

{Stable Diffusion|Stable Diffusion} The reason for doing all processing, diffusion, and conditioning in latent space instead of in pixel space is that the latent space is smaller. This way we can do this process much faster without consuming a lot of computing resources.


{Condition Control|Conditioning}

{Stable Diffusion|Stable Diffusion} is not a text-to-image model without text cues. You'll just get a picture of a cat or dog without any way to control it.

This is what {conditional control|Conditioning} does.

条件控制The purpose of is to guide the noise predictor such that after subtracting the predicted noise from the image, we get the result we want.


Text Conditioning (Text to Image)

Below is an overview of how text cues are processed and fed into the noise predictor.

  • First, the tokenizer converts each word in the hint into a tokennumber called a token ( ).
  • Each token is then converted into a 768-valued value 向量called {embedding|embedding}. These embedding vectors are then processed by the text transformer and ready for use by the noise predictor.

Tokenizer

The text hints are first CLIPsegmented by a tokenizer. CLIPis OpenAIa deep learning model developed to generate textual descriptions of any image. The tokenizer v1 used by {Stable Diffusion|Stable Diffusion}  .CLIP

Word segmentation is how computers understand words. We humans can read words, but computers can only read numbers. That's why the words in the text prompt are first converted to numbers.

A tokenizer can only tokenize words it has seen during training. For example, there are and CLIPin the model , but not . A tokenizer will split a word into two tokens and . Therefore, a word does not always correspond to a token!“dream”“beach”“dreambeach”“dreambeach”“dream”“beach”

Another detail to note is that space characters are also part of the markup. In the above case, the phrase “dream beach”produces two tokens “dream”and “[space]beach". These tokens are different from the token “dreambeach”sum produced (with no space before beach).“dream”“beach”

The {Stable Diffusion|Stable Diffusion} model limits the number of tokens used in text hints to 75.


{embed|embedding}

v1 The model used by OpenAI{  Stable Diffusion|Stable Diffusion}  ViT-L/14 CLIP. {embedding|embedding} is a string containing 768values 向量. Each token has its own unique {embedding|embedding} vector. {embedding|embedding} is CLIPfixed by the model and learned during training.

Why do we need {embed|embedding}? This is because certain words are closely related to each other. We wish to use this information. For example, the embeddings for man(man), gentleman(gentleman), and guy(guy) are almost the same because they are used interchangeably.


Feed the embedding to the noise predictor

The embedding vectors need to be further processed by text transformers 噪声预测器before being fed into them. 嵌入This transformer is like a universal adapter for conditional control . In this case, its input is a text embedding vector, but it can also be other things, such as class labels, images, and depth maps. Transformers not only further process the data, but also provide a mechanism to include different conditional modalities.


{Cross-attention|Cross-attention}

The output of the text transformer U-Netis used multiple times by the noise predictor throughout. U-NetTake advantage of it through the {Cross-attention|Cross-attention} mechanism. This is promptwhere image meets image.

Take the cue "a man with blue eyes," for example. {Stable Diffusion|Stable Diffusion} paired the words "blue" and "eyes" (self attention inside the prompt) so that it generated a man with blue eyes instead of a blue shirt man. It then uses this information to steer {Reverse Diffusion|Reverse Diffusion} toward images containing blue eyes. ( promptand cross note between images)


{Stable Diffusion|Stable Diffusion} optimization processing

So far, we have a better understanding of the training process in {Forward Diffusion|Forward Diffusion} and how images are generated in {Reverse Diffusion|Reverse Diffusion} from text input. But that's just the beginning, the more interesting part is how we can tune this process to our needs to produce higher quality images . Researchers and hobbyists have proposed many different techniques to improve the results of stable diffusion.

Most of these methods build on already trained stable diffusion models . A trained model means it has seen and learned how to generate images using its model weights (the numbers that guide the model's work).

Optimize {Text Encoder|Text Encoder}

One set of techniques is for the {Text Encoder|Text Encoder} part of {Stable Diffusion|Stable Diffusion}, including 文本反演and DreamArtist.

  • Text inversion works by learning a new keyword embedding for each new concept or style you want to generate . You can think of text inversion as telling the translator "remember this new object is called 'dog and cat', next time I say 'dog and cat', you should tell the artist to draw this object" .
  • DreamArtistDescribe reference images by learning positive and negative keywords . It's like telling a translator "here's a picture, remember what it looks like, and call it what you think best describes it".

Optimization U-Net(noise predictor section)

Another set of techniques focuses on U-Net, namely 图像生成组件, including DreamBooth, LoRAand Hypernetworks.

  • DreamBoothFine-tune新图像数据集 the diffusion model by using it until it understands new concepts.
  • LoRABy 交叉注意模型adding a small set of extra weights to , and training only those extra weights.
  • HypernetworksUse an auxiliary network to predict new weights and insert new styles using the cross-attention part in {Noise Predictor | Noise Predictor}.

These methods basically tell the artist to learn some new way of drawing, either on their own ( DreamBooth), tweaking an existing style ( LoRA), or with outside help ( Hypernetworks). DreamBoothVery efficient, but require more storage space, and the training of LoRAand Hypernetworksis relatively fast, because they do not need to train the entire stable diffusion model.


Controlling Noise to Improve Generated Images

Another technique is to improve methods of generating images by controlling noise , including DALL-E 2and Diffusion Dallying.

  • DALL-E 2is an improvement on the original DALL-Emodel by controlling the noise to generate more instructive images.
  • Diffusion DallyingIt is to add additional iterative steps in the {Reverse Diffusion|Reverse Diffusion} process, so that the model has more time to learn and generate higher-quality images.

These methods improve the results of stable diffusion by better controlling the introduction of noise and the iterative process of image generation.

In addition to the above methods, there are other techniques that can be used to improve the results of stable diffusion, such as using larger models, optimizing training strategies, tuning hyperparameters, etc. The goal of these technologies is to achieve higher quality and more predictable image generation results through continuous improvement of the individual components and processes of stable diffusion .

All in all, the results of stable diffusion can be improved by improving 文本编码器, U-Net, 噪声控制and other techniques. These techniques can be selected and applied according to specific needs to achieve better image generation effects.


How {Diffusion Model|Diffusion Model} works

Just now we looked at the internal mechanism of {Diffusion model|Diffusion model} from God's perspective, let's look at some specific examples to understand how it works.

text to image

In text-to-image, you give {diffusion model|Diffusion model} a text cue and it returns an image.

Step 1: {Stable Diffusion|Stable Diffusion} generates a random {tensor|Tensor} in the latent space .

You can control this by setting the seed of the random number generator张量 . If you set the seed to a specific value, you'll get the same random tensor every time. This tensor represents the image in the latent space, but is currently just a patch of noise.

The second step: {noise predictor | Noise Predictor} is to  U-Net receive 潜在噪声图像and 文本提示as input, and predict 潜在空间the noise in (a 4x64x64tensor).

Step Three: Subtract 潜在图像from 潜在噪声. This will be your new potential image. 

The second step and the third step will repeat the sampling steps for a certain number of times , for example 20 times.

Step 4: Finally, VAEthe decoder converts the latent image back to pixel space. This is the image I got after running {Stable Diffusion|Stable Diffusion}.

Here is how the image evolves at each sampling step.


noise scheduling

The image went from noisy to clear. The real reason is that we try to get the expected noise at each sampling step. This is noise scheduling .

Below is an example.

 The noise schedule is defined by ourselves. We can choose to subtract the same amount of noise at each step. Or we can subtract more noise at the beginning, like in the example above. 采样器Just enough noise is subtracted at each step to arrive at the desired noise in the next step. This is the process you see in the image above.


graph

Graph-generated graphs are SDEdita method first proposed in Methods. SDEditCan be applied to any diffusion model. Therefore, we can use the graph generation method to do it 稳定扩散.

The input of graph-generating graph includes input image and text prompt. The resulting image will be constrained by both the input image and the text cues .

For example, using the simple image on the left and {Prompt|Prompt} as input for "Shooting the perfect green apple with stem, water droplets, and dramatic lighting", Tushengtu can turn it into a professional painting:

The following is the detailed process.

Step 1: Put the input image {encode|Encoder} into the latent space

Step 2: Add noise to the latent image . Denoising Strength controls how much noise is added. If the denoising strength is 0, no noise will be added. If the denoising strength is 1, the maximum amount of noise will be added, turning the latent image into a completely random tensor. 

Step 3: {Noise Predictor | Noise Predictor} That is, U-Netit takes potentially noisy images and text cues as input and predicts the noise in the latent space (a 4x64x64tensor).

Step 4: Subtract the latent noise from the latent image. This becomes your new potential image.

The third and fourth steps are repeated for a certain number of sampling steps, for example, 20 times.

Step 5: Finally, VAEthe decoder converts the latent image back to pixel space. This is the image you get when running image-to-image.

The meaning of the graph : Its main role is to add some noise and some input images to the initial latent image. Setting the denoising strength to 1 is equivalent to the text-to-image ( text-to-image) process, since the initial latent image is entirely random noise.


{Repair|Inpainting}

{FIX|Inpainting} is actually just a special case of Inpainting . Noise is added to the parts of the image that need to be repaired. The amount of noise is also controlled by Denoise Strength.


{Depth-to-image|Depth-to-image}

{Depth-to-image|Depth-to-image} is an enhancement to the image-generating image , which uses the depth image to generate additional conditional constraints to generate a new image.

Step 1: Encode the input image into {latent state | latent state}.

step 2. MiDaS(an AI depth model) estimates a depth map from an input image. 

step 3. Add noise to the latent image. Denoising Strength controls the amount of noise added. If the denoising strength is 0, no noise is added. If the denoising strength is 1, add maximum noise, making the latent image a random tensor. 

Step 4. {Noise Predictor | Noise Predictor} Estimates the noise in the latent space based on text cues and depth maps. 

Step 5. Subtract the latent noise from the latent image to get a new latent image. 

Steps 4 and 5 are repeated in the sampling step.

Step 6. VAEThe decoder decodes the latent image. Now you have the final image from depth to image. 


CFG value

CFG is the abbreviation of {No Classifier Assistance|Classifier-Free Guidance}. In order to understand what a CFG is, we need to first understand its predecessor, 分类器辅助.

Classifier Auxiliary

分类器辅助is a way to take image labels into account in {diffusion model|Diffusion model} . You can use tags to guide the diffusion process. For example, the label "cat" will direct the {Reverse Diffusion|Reverse Diffusion} process to generate pictures of cats.

The classifier auxiliary scale is a parameter that controls how much the diffusion process should follow the labels.

Suppose there are 3 sets of images with labels "cat", "dog" and "human". If Diffusion is unguided, the model will draw samples from the population of each group, but may sometimes generate images that fit both labels, such as an image of a boy petting a dog.

Aided by high classifiers, {Diffusion model|Diffusion model} generates images that are biased towards extreme or explicit examples. If you ask the model to generate an image of a cat, it will return an image that is definitely a cat and nothing else.

The classifier auxiliary scale controls how tight the auxiliary is. In the figure above, the samples on the right have a higher classifier-assisted scale than the samples in the middle. In practice, this scale value is just a multiplicative factor for the drift term towards data with that label.


CFG value

The classifier-free-aid (CFG) scale is a value that controls the degree to which text cues influence the diffusion process. When the value is set to 0, image generation is unconditional (i.e. hints are ignored). Higher values ​​steer the diffusion process in the direction of the hint.


Stable Diffusion v1 vs v2

model difference

Stable Diffusion v2Use OpenClipfor text embedding and Stable Diffusion v1use Open AIfor CLIP ViT-L/14text embedding. The reasons for this change are as follows:

  1. OpenClipThe model of is CLIPfive times larger than the model. Larger text encoder models can improve image quality.
  2. Although Open AIthe CLIPmodels are open source, these models are trained using proprietary data. Switching to OpenClipmodels allows researchers to have more transparency when studying and optimizing models, which is beneficial for long-term development.

Differences in training data:

  • Stable Diffusion v1.4Trained on the following datasets:
    • laion2B-en237,000 steps were performed on the dataset at a resolution of 256×256.
    • steps were performed on laion-high-resolutionthe dataset at a resolution of .194,000512×512
    • 225,000 steps are performed on the "laion-aesthetics v2 5+" dataset with a resolution of 512×512 and 10% of the text conditions are de-conditioned.
  • Stable Diffusion v2Trained on the following datasets:
    • LAION-5B550,000 steps were performed on the filtered subset with a resolution of 256x256, LAION-NSFWfiltered using a classifier, set punsafe=0.1, aesthetic score >= 4.5.
    • 850,000 steps were performed on the same dataset with a resolution of 512x512 and were constrained to image resolutions >= 512x512.
    • 150,000 steps were performed v-objectiveon the same dataset.
    • 768x768An additional step is performed on the image of 140,000.
  • Stable Diffusion v2.1Fine-tuned on the basis of v2.0, including:
    • An additional 55,000 steps ( ) were performed on the same dataset punsafe=0.1.
    • Another 155,000 extra steps ( punsafe=0.98) were performed, and finally the NSFW filter was turned off.

Summary

In this article, I try to explain how stable diffusion works in simple terms. Here are some key points:

  • Stable Diffusion is a model that generates images primarily from text (constrained by the text), but can also generate images from other instructions such as images/representations.
  • The training process of stable diffusion consists of gradually adding noise to the image (forward diffusion) and training {Noise Predictor | Noise Predictor} to gradually remove noise to produce a sharper image (backward diffusion).
  • The generative process (backward diffusion) starts with a random noisy image (tensor) in the latent space and gradually diffuses into a clean image, conditioned on the cue.
  • There are a number of techniques to improve the results of stable diffusion, including , and DreamArtist , performed on the embedding layer, and , , and , 文本反演performed on the diffusion model . And the list of these technologies continues to grow.LoRADreamBoothHypernetworks

postscript

Sharing is an attitude .

References:

  1. How does Stable Diffusion work
  2. How SD works

Guess you like

Origin blog.csdn.net/bruce__ray/article/details/131144356