This article mainly describes the realization principle of {Stable Diffusion|Stable Diffusion}.
What is {Stable Diffusion|Stable Diffusion}
{Stable Diffusion|Stable Diffusion} is a series of image generation models developed by the team of StabilityAI
, , CompVis
and , RunwayML
originally 2022
published in . Its main function is to generate images with beauty and detail based on text input , but it can also perform other tasks such as inpainting missing parts of images ( inpainting
), extending images ( outpainting
), and generating graphs .
In short, {Stable Diffusion|Stable Diffusion} is a text-to-image model. Give it a text hint. It will return an image that matches the text.
{Diffusion model|Diffusion model}
{Stable Diffusion|Stable Diffusion} belongs to a class of deep learning models called {Diffusion Model|Diffusion model}. They are generative models , meaning they are designed to generate new data similar to what they saw during training. In the case of steady diffusion, the data are images .
Why is it called {Diffusion Model|Diffusion model}? Because its mathematical model is very similar to the diffusion process in physics.
{Forward Diffusion|Forward Diffusion}
The {Forward Diffusion|Forward Diffusion} process adds noise to the training image , gradually turning it featureless 噪声图像
. 前向过程
will convert any 猫或狗
image into a noisy image. Ultimately, you won't be able to tell whether they were originally a dog or a cat.
Like a drop of ink dropped into a glass of water. Ink spreads in water. After a few minutes, it will be randomly distributed throughout the water. You can no longer tell whether it originally fell in the center or near the edge.
Below is an example of an image going through {Forward Diffusion|Forward Diffusion}. Images of cats are turned into random noise .
{Reverse Diffusion|Reverse Diffusion}
{Reverse Diffusion|Reverse Diffusion} is like playing a video backwards. Go back in time . We'll see where the ink drop was initially added.
Starting from one
噪声、无意义的图像
, the {Reverse Diffusion|Reverse Diffusion} process can recover an image of a cat or a dog.
Technically, every diffusion process has two parts:
- drift or orienteering
- random motion
{Reverse Diffusion|Reverse Diffusion} The process drifts in the direction of the cat or dog image, but there are no intermediate states. That's why the result can only be a cat or a dog.
Recap history
Before we dive into the architecture and mechanics of {Stable Diffusion|Stable Diffusion}, let's quickly review the history of image generation and the evolution of Stable Diffusion.
2015
Year: Presented by the University of TorontoalignDRAW
. This is a文本到图像
model. The model was only able to generate blurry images, but demonstrated the possibility of generating images "unseen" by the model from text input.2016
2010:Reed、Scott
et al. proposed a method for generating images using generative adversarial networks ( , a neural network structure).GAN
They successfully generated realistic bird and flower images from detailed textual descriptions. Following this work, a seriesGAN
of models based on .2021
Year: Architecture-based (another neural network architecture)OpenAI
was published and attracted public attention.Transformer
DALL-E
2022
Year:Google Brain
ReleasedImagen
, in competitionOpenAI
withDALL-E
.2022
Year: {Stable Diffusion|Stable Diffusion} was announced as an improvement to the latent space diffusion model . Due to its open-source nature, many variants and fine-tuned models based on it were created and attracted wide attention and applications.2023
Year: Many new models and applications have emerged, even beyond the scope of text-to-image, extending to fields such as text-to-video or text-to-3D.
As you can see from the timeline, text-to-image is actually a fairly young field. The emergence of {Stable Diffusion|Stable Diffusion} is an important milestone, and as an open-source model, it requires fewer resources than previous models, promoting the exponential growth of the field.
Components of {Stable Diffusion|Stable Diffusion}
{Stable Diffusion|Stable Diffusion} is not a simple AI model, it is 神经网络
a process of combining different ones. We can decompose the entire text-to-image generation process of {Stable Diffusion|Stable Diffusion} into different steps and explain these steps step by step.
Let's start with an overview of the text-to-image generation process.
- {Image Encoder | Image Encoder}: Convert training images into vectors in latent space for further processing.
- A latent space is a mathematical space where image information can be represented as vectors (ie
数字数组
. - {Text Encoder|Text Encoder}: Convert text into a high-dimensional vector (which can be regarded as an array of numbers representing the meaning of the text) for machine learning models to understand.
- {Diffusion model|Diffusion model}
文本
: Generate a new image in the latent space with the condition. (i.e. the input text guides the generation of images in the latent space). - {Image Decoder | Image Decoder}: Convert the image information in the latent space into an actual image composed of pixels.
Each step is done using its own neural network.
{Text Encoder | Text Encoder} converts text input into embedding vectors
Imagine you want a foreign artist to paint a painting for you, but you don't understand their language , you might use Google Translate or find someone to help you translate what you want to say to them.
The same is true for image generation models - machine learning models cannot directly understand text , so they need a {Text Encoder|Text Encoder} to convert your text instructions into numbers they can understand . These numbers are not random numbers, they are called text embeddings , which are high-dimensional vectors that capture the semantic meaning of text (i.e. the relationship between words and their context).
There are several ways to convert text to 嵌入向量
can. {Stable Diffusion|Stable Diffusion} uses a Transformer
large language model to accomplish this task. If you know anything about language models, you may Transformer
be familiar with this term - it is GPT
the basic architecture of language models. If you want to know more about ChatGPT
the related principles. You can refer to ChatGPT: What you should know
In fact, {Stable Diffusion|Stable Diffusion} v1
is used CLIP
, which is OpenAI
based on GPT
the developed model. {Stable Diffusion|Stable Diffusion} V2
used OpenClip
, a larger CLIP
version.
These encoder models are trained on large datasets containing billions of text and image pairs in order to learn the meaning of the words and phrases we use to describe images. The dataset comes from images on the web, along with image labels ( alt
labels) that we use to describe those images.
{Diffusion model|Diffusion model} draws an image through the diffusion process
{Diffusion model|Diffusion model} is the core component of stable diffusion , which is a component of the integrated image.
In order to train the diffusion model, there are two procedures:
- {Forward Diffusion|Forward Diffusion} process is used to prepare training samples
- {Reverse Diffusion|Reverse Diffusion} process is used to generate the image
In {Stable Diffusion|Stable Diffusion}, both processes are performed in the latent space for speed.
During {Forward Diffusion|Forward Diffusion}, the model gradually adds to the image
高斯噪声
, turning a clean image into a noisy one. At each step, a small amount of noise is added to the image, and the process is repeated over multiple steps.
As the word "diffusion" suggests, the process is like dropping a drop of ink into water, where the ink gradually spreads until you can no longer see where it was originally a drop of ink. The noise pattern added to the image is random , just like the random diffusion of ink particles into water particles, but the amount of noise can be controlled . This process LAION
is performed on many images of the dataset, each with a different amount of noise added to create a large number of noisy samples to train the {Reverse Diffusion|Reverse Diffusion} model.
In the process of {Reverse Diffusion|Reverse Diffusion}, a {Noise Predictor|Noise Predictor} is trained to predict the noise added to the original image, so that the model can remove the predicted noise from the noisy image and get a clearer image (
潜在空间
in). You can think of this process as looking at a partially diffused drop of ink in water and trying to predict where it was before.
{Noise Predictor|Noise Predictor} is based on ResNet
the backbone structure U-Net
(a neural network architecture).
{Noise Predictor | Noise Predictor} The training process is as follows:
- Choose a training image, say a picture of a cat.
- Generate a random noisy image.
- Corrupt the training image by adding this noisy image to the training image for a certain number of steps.
- The professor {Noise Predictor | Noise Predictor} tells us how much noise is added. This is
调整其权重
done by showing it the correct answer.
After training, we have a {Noise Predictor | Noise Predictor} capable of estimating the noise added to an image.
It is trained on a training dataset previously prepared via {Forward Diffusion|Forward Diffusion}, with the goal of estimating the noise as accurately as possible so that the denoised image is as close as possible to the original image. Once trained, it will "remember" image representations in its weights and can be used to "generate" images from a random initial noise image tensor. The actual image and image quality depends heavily on the original image dataset as it tries to get back to the original image. This back-diffusion process proceeds step by step in multiple steps to remove noise. With more denoising steps, the image becomes clearer and clearer.
In the {Reverse Diffusion|Reverse Diffusion} process of obtaining a sharper image, researchers hope to control the appearance of the image through a process called conditioning . If we use text, this process is called text conditioning . 文本嵌入
It does this by passing in the text encoding step to U-Net
, and operates through a cross-attention mechanism .
交叉注意机制
Basically will 文本嵌入
merge with the result of each step of {Reverse Diffusion|Reverse Diffusion}. For example, if you have an input prompt "cat", you can 条件化
interpret it as telling {Noise Predictor | Noise Predictor}: "For the next denoising step, the image should look more like a cat. Now proceed to the next step." Condition Humanization can also be guided by other modalities besides text, such as images, semantic maps, representations, etc.
{Image Decoder | Image Decoder} translates the image from the latent space into pixels
扩散
Since the sums we perform 条件化
are performed in the latent space , we cannot directly view the resulting images. We need to convert the latent image back into pixels that we can see. This conversion is done by {Image Decoder | Image Decoder}.
In {Stable Diffusion|Stable Diffusion}, this transformer is a variational autoencoder ( VAE
). In the early {Forward Diffusion|Forward Diffusion} process, we used VAE
the encoder part to convert the original training image from pixels to the latent space to add noise. Now, VAE
the decoder part we use converts the latent image back into pixels.
What is VAE
{Variational Autoencoder|Variational Autoencoder}( VAE
) The neural network consists of two parts:
- Encoder: compresses the image into a lower dimensional representation in the latent space
- Decoder: Restoring images from latent spaces
The latent space of the {Stable Diffusion|Stable Diffusion} model is 4x64x64
48 times smaller than the image pixel space. All forward and backward diffusion actually takes place in the latent space .
So, during training, instead of generating an image with noise, it generates a random {tensor|Tensor} (latent noise) in the latent space. Instead of corrupting the image with noise, it corrupts the representation of the image in the latent space with latent noise. The reason for this is that it is faster because the potential space is smaller.
In deep learning,
Tensor
it is actually a multidimensional array.
TheTensor
purpose is to be able to create higher-dimensional matrices and vectors .
{Stable Diffusion|Stable Diffusion} The reason for doing all processing, diffusion, and conditioning in latent space instead of in pixel space is that the latent space is smaller. This way we can do this process much faster without consuming a lot of computing resources.
{Condition Control|Conditioning}
{Stable Diffusion|Stable Diffusion} is not a text-to-image model without text cues. You'll just get a picture of a cat or dog without any way to control it.
This is what {conditional control|Conditioning} does.
条件控制
The purpose of is to guide the noise predictor such that after subtracting the predicted noise from the image, we get the result we want.
Text Conditioning (Text to Image)
Below is an overview of how text cues are processed and fed into the noise predictor.
- First, the tokenizer converts each word in the hint into a
token
number called a token ( ). - Each token is then converted into a 768-valued value
向量
called {embedding|embedding}. These embedding vectors are then processed by the text transformer and ready for use by the noise predictor.
Tokenizer
The text hints are first CLIP
segmented by a tokenizer. CLIP
is OpenAI
a deep learning model developed to generate textual descriptions of any image. The tokenizer v1
used by {Stable Diffusion|Stable Diffusion} .CLIP
Word segmentation is how computers understand words. We humans can read words, but computers can only read numbers. That's why the words in the text prompt are first converted to numbers.
A tokenizer can only tokenize words it has seen during training. For example, there are and CLIP
in the model , but not . A tokenizer will split a word into two tokens and . Therefore, a word does not always correspond to a token!“dream”
“beach”
“dreambeach”
“dreambeach”
“dream”
“beach”
Another detail to note is that space characters are also part of the markup. In the above case, the phrase “dream beach”
produces two tokens “dream”
and “[space]beach
". These tokens are different from the token “dreambeach”
sum produced (with no space before beach).“dream”
“beach”
The {Stable Diffusion|Stable Diffusion} model limits the number of tokens used in text hints to
75
.
{embed|embedding}
v1
The model used by OpenAI
{ Stable Diffusion|Stable Diffusion} ViT-L/14 CLIP
. {embedding|embedding} is a string containing 768
values 向量
. Each token has its own unique {embedding|embedding} vector. {embedding|embedding} is CLIP
fixed by the model and learned during training.
Why do we need {embed|embedding}? This is because certain words are closely related to each other. We wish to use this information. For example, the embeddings for man
(man), gentleman
(gentleman), and guy
(guy) are almost the same because they are used interchangeably.
Feed the embedding to the noise predictor
The embedding vectors need to be further processed by text transformers 噪声预测器
before being fed into them. 嵌入
This transformer is like a universal adapter for conditional control . In this case, its input is a text embedding vector, but it can also be other things, such as class labels, images, and depth maps. Transformers not only further process the data, but also provide a mechanism to include different conditional modalities.
{Cross-attention|Cross-attention}
The output of the text transformer U-Net
is used multiple times by the noise predictor throughout. U-Net
Take advantage of it through the {Cross-attention|Cross-attention} mechanism. This is prompt
where image meets image.
Take the cue "a man with blue eyes," for example. {Stable Diffusion|Stable Diffusion} paired the words "blue" and "eyes" (self attention inside the prompt) so that it generated a man with blue eyes instead of a blue shirt man. It then uses this information to steer {Reverse Diffusion|Reverse Diffusion} toward images containing blue eyes. ( prompt
and cross note between images)
{Stable Diffusion|Stable Diffusion} optimization processing
So far, we have a better understanding of the training process in {Forward Diffusion|Forward Diffusion} and how images are generated in {Reverse Diffusion|Reverse Diffusion} from text input. But that's just the beginning, the more interesting part is how we can tune this process to our needs to produce higher quality images . Researchers and hobbyists have proposed many different techniques to improve the results of stable diffusion.
Most of these methods build on already trained stable diffusion models . A trained model means it has seen and learned how to generate images using its model weights (the numbers that guide the model's work).
Optimize {Text Encoder|Text Encoder}
One set of techniques is for the {Text Encoder|Text Encoder} part of {Stable Diffusion|Stable Diffusion}, including 文本反演
and DreamArtist
.
- Text inversion works by learning a new keyword embedding for each new concept or style you want to generate . You can think of text inversion as telling the translator "remember this new object is called 'dog and cat', next time I say 'dog and cat', you should tell the artist to draw this object" .
DreamArtist
Describe reference images by learning positive and negative keywords . It's like telling a translator "here's a picture, remember what it looks like, and call it what you think best describes it".
Optimization U-Net
(noise predictor section)
Another set of techniques focuses on U-Net
, namely 图像生成组件
, including DreamBooth
, LoRA
and Hypernetworks
.
DreamBooth
Fine-tune新图像数据集
the diffusion model by using it until it understands new concepts.LoRA
By交叉注意模型
adding a small set of extra weights to , and training only those extra weights.Hypernetworks
Use an auxiliary network to predict new weights and insert new styles using the cross-attention part in {Noise Predictor | Noise Predictor}.
These methods basically tell the artist to learn some new way of drawing, either on their own ( DreamBooth
), tweaking an existing style ( LoRA
), or with outside help ( Hypernetworks
). DreamBooth
Very efficient, but require more storage space, and the training of LoRA
and Hypernetworks
is relatively fast, because they do not need to train the entire stable diffusion model.
Controlling Noise to Improve Generated Images
Another technique is to improve methods of generating images by controlling noise , including DALL-E 2
and Diffusion Dallying
.
DALL-E 2
is an improvement on the originalDALL-E
model by controlling the noise to generate more instructive images.Diffusion Dallying
It is to add additional iterative steps in the {Reverse Diffusion|Reverse Diffusion} process, so that the model has more time to learn and generate higher-quality images.
These methods improve the results of stable diffusion by better controlling the introduction of noise and the iterative process of image generation.
In addition to the above methods, there are other techniques that can be used to improve the results of stable diffusion, such as using larger models, optimizing training strategies, tuning hyperparameters, etc. The goal of these technologies is to achieve higher quality and more predictable image generation results through continuous improvement of the individual components and processes of stable diffusion .
All in all, the results of stable diffusion can be improved by improving
文本编码器
,U-Net
,噪声控制
and other techniques. These techniques can be selected and applied according to specific needs to achieve better image generation effects.
How {Diffusion Model|Diffusion Model} works
Just now we looked at the internal mechanism of {Diffusion model|Diffusion model} from God's perspective, let's look at some specific examples to understand how it works.
text to image
In text-to-image, you give {diffusion model|Diffusion model} a text cue and it returns an image.
Step 1: {Stable Diffusion|Stable Diffusion} generates a random {tensor|Tensor} in the latent space .
You can control this by setting the seed of the random number generator张量
. If you set the seed to a specific value, you'll get the same random tensor every time. This tensor represents the image in the latent space, but is currently just a patch of noise.
The second step: {noise predictor | Noise Predictor} is to U-Net
receive 潜在噪声图像
and 文本提示
as input, and predict 潜在空间
the noise in (a 4x64x64
tensor).
Step Three: Subtract 潜在图像
from 潜在噪声
. This will be your new potential image.
The second step and the third step will repeat the sampling steps for a certain number of times , for example 20 times.
Step 4: Finally, VAE
the decoder converts the latent image back to pixel space. This is the image I got after running {Stable Diffusion|Stable Diffusion}.
Here is how the image evolves at each sampling step.
noise scheduling
The image went from noisy to clear. The real reason is that we try to get the expected noise at each sampling step. This is noise scheduling .
Below is an example.
The noise schedule is defined by ourselves. We can choose to subtract the same amount of noise at each step. Or we can subtract more noise at the beginning, like in the example above. 采样器
Just enough noise is subtracted at each step to arrive at the desired noise in the next step. This is the process you see in the image above.
graph
Graph-generated graphs are SDEdit
a method first proposed in Methods. SDEdit
Can be applied to any diffusion model. Therefore, we can use the graph generation method to do it 稳定扩散
.
The input of graph-generating graph includes input image and text prompt. The resulting image will be constrained by both the input image and the text cues .
For example, using the simple image on the left and {Prompt|Prompt} as input for "Shooting the perfect green apple with stem, water droplets, and dramatic lighting", Tushengtu can turn it into a professional painting:
The following is the detailed process.
Step 1: Put the input image {encode|Encoder} into the latent space .
Step 2: Add noise to the latent image . Denoising Strength controls how much noise is added. If the denoising strength is 0, no noise will be added. If the denoising strength is 1, the maximum amount of noise will be added, turning the latent image into a completely random tensor.
Step 3: {Noise Predictor | Noise Predictor} That is, U-Net
it takes potentially noisy images and text cues as input and predicts the noise in the latent space (a 4x64x64
tensor).
Step 4: Subtract the latent noise from the latent image. This becomes your new potential image.
The third and fourth steps are repeated for a certain number of sampling steps, for example, 20 times.
Step 5: Finally, VAE
the decoder converts the latent image back to pixel space. This is the image you get when running image-to-image.
The meaning of the graph : Its main role is to add some noise and some input images to the initial latent image. Setting the denoising strength to 1 is equivalent to the text-to-image (
text-to-image
) process, since the initial latent image is entirely random noise.
{Repair|Inpainting}
{FIX|Inpainting} is actually just a special case of Inpainting . Noise is added to the parts of the image that need to be repaired. The amount of noise is also controlled by Denoise Strength.
{Depth-to-image|Depth-to-image}
{Depth-to-image|Depth-to-image} is an enhancement to the image-generating image , which uses the depth image to generate additional conditional constraints to generate a new image.
Step 1: Encode the input image into {latent state | latent state}.
step 2. MiDaS
(an AI depth model) estimates a depth map from an input image.
step 3. Add noise to the latent image. Denoising Strength controls the amount of noise added. If the denoising strength is 0, no noise is added. If the denoising strength is 1, add maximum noise, making the latent image a random tensor.
Step 4. {Noise Predictor | Noise Predictor} Estimates the noise in the latent space based on text cues and depth maps.
Step 5. Subtract the latent noise from the latent image to get a new latent image.
Steps 4 and 5 are repeated in the sampling step.
Step 6. VAE
The decoder decodes the latent image. Now you have the final image from depth to image.
CFG value
CFG is the abbreviation of {No Classifier Assistance|Classifier-Free Guidance}. In order to understand what a CFG is, we need to first understand its predecessor, 分类器辅助
.
Classifier Auxiliary
分类器辅助
is a way to take image labels into account in {diffusion model|Diffusion model} . You can use tags to guide the diffusion process. For example, the label "cat" will direct the {Reverse Diffusion|Reverse Diffusion} process to generate pictures of cats.
The classifier auxiliary scale is a parameter that controls how much the diffusion process should follow the labels.
Suppose there are 3 sets of images with labels "cat", "dog" and "human". If Diffusion is unguided, the model will draw samples from the population of each group, but may sometimes generate images that fit both labels, such as an image of a boy petting a dog.
Aided by high classifiers, {Diffusion model|Diffusion model} generates images that are biased towards extreme or explicit examples. If you ask the model to generate an image of a cat, it will return an image that is definitely a cat and nothing else.
The classifier auxiliary scale controls how tight the auxiliary is. In the figure above, the samples on the right have a higher classifier-assisted scale than the samples in the middle. In practice, this scale value is just a multiplicative factor for the drift term towards data with that label.
CFG value
The classifier-free-aid (CFG) scale is a value that controls the degree to which text cues influence the diffusion process. When the value is set to 0, image generation is unconditional (i.e. hints are ignored). Higher values steer the diffusion process in the direction of the hint.
Stable Diffusion v1 vs v2
model difference
Stable Diffusion v2
Use OpenClip
for text embedding and Stable Diffusion v1
use Open AI
for CLIP ViT-L/14
text embedding. The reasons for this change are as follows:
OpenClip
The model of isCLIP
five times larger than the model. Larger text encoder models can improve image quality.- Although
Open AI
theCLIP
models are open source, these models are trained using proprietary data. Switching toOpenClip
models allows researchers to have more transparency when studying and optimizing models, which is beneficial for long-term development.
Differences in training data:
Stable Diffusion v1.4
Trained on the following datasets:laion2B-en
237,000 steps were performed on the dataset at a resolution of256×256
.- steps were performed on
laion-high-resolution
the dataset at a resolution of .194,000
512×512
- 225,000 steps are performed on the "laion-aesthetics v2 5+" dataset with a resolution of 512×512 and 10% of the text conditions are de-conditioned.
Stable Diffusion v2
Trained on the following datasets:LAION-5B
550,000 steps were performed on the filtered subset with a resolution of 256x256,LAION-NSFW
filtered using a classifier, setpunsafe=0.1
, aesthetic score >= 4.5.- 850,000 steps were performed on the same dataset with a resolution of 512x512 and were constrained to image resolutions >= 512x512.
- 150,000 steps were performed
v-objective
on the same dataset. 768x768
An additional step is performed on the image of140,000
.
Stable Diffusion v2.1
Fine-tuned on the basis of v2.0, including:- An additional 55,000 steps ( ) were performed on the same dataset
punsafe=0.1
. - Another 155,000 extra steps (
punsafe=0.98
) were performed, and finally the NSFW filter was turned off.
- An additional 55,000 steps ( ) were performed on the same dataset
Summary
In this article, I try to explain how stable diffusion works in simple terms. Here are some key points:
- Stable Diffusion is a model that generates images primarily from text (constrained by the text), but can also generate images from other instructions such as images/representations.
- The training process of stable diffusion consists of gradually adding noise to the image (forward diffusion) and training {Noise Predictor | Noise Predictor} to gradually remove noise to produce a sharper image (backward diffusion).
- The generative process (backward diffusion) starts with a random noisy image (tensor) in the latent space and gradually diffuses into a clean image, conditioned on the cue.
- There are a number of techniques to improve the results of stable diffusion, including , and DreamArtist , performed on the embedding layer, and , , and ,
文本反演
performed on the diffusion model . And the list of these technologies continues to grow.LoRA
DreamBooth
Hypernetworks
postscript
Sharing is an attitude .
References: