[The interpretation of the principle of stable diffusion is easy to understand, and the epic-level 10,000-word bursting liver long text, feed it to your mouth]

img


personal website

1. Preface (skipable)

Hello, everyone. I am Tian-Feng. Today I will introduce some principles of stable diffusion. The content is easy to understand. Because I usually play Ai painting, so it is like writing an article to explain its principle. It's been a long time, if it is useful to you, I hope to give it a thumbs up, thank you.

As a Stability-AI open-source image generation model, stable diffusion is not inferior to ChatGPT, and its development momentum is no less than that of midjourney. With the support of many plug-ins, its launch is also wirelessly improved. Of course, the method is also slightly more complicated than midjourney point. Thesis source code

As for why open source, **Founder: The reason I did it is because I think it's part of the shared narrative, and someone needs to publicly show what's going on. Again, this should be open source by default. Because value does not reside in any proprietary models or data, we will build open source models that are auditable, even with permissioned data. **Let's not talk much, let's start.

Two, stable diffusion

It may be difficult for you to understand the pictures in the original paper above, but it doesn’t matter. I will divide the above pictures into individual modules for interpretation, and finally combine them together. I believe you can understand what each step of the picture does.

First, I will draw a simplified model diagram against the original diagram for easy understanding. Let's start with the training phase. You may find that VAEdecoder is missing. This is because our training process is completed in the latent space. We put the decoder in the second phase of the sampling phase. The stablediffusion webui drawing we use is usually in the sampling phase. As for the training phase, most of us ordinary people can’t complete it at all. The training time it needs should be measured by GPUyear (a single V100 GPU takes one year). If you have 100 cards, it should be possible for one month. Finish. As for ChatGPT’s photoelectric cost of tens of millions of dollars and tens of thousands of GPU clusters, it feels that AI is fighting for computing power now. Pulling away again, come back
img

1.clip

Let's start with the prompt word first. We input a prompt word a black and white striped cat (a black and white striped cat), and clip will map the text to a vocabulary, and each word punctuation mark has a corresponding number. We Call each word a token. Before the stablediffusion input was limited to 75 words (now gone), that is, 75 tokens. Looking at the above, you may find how 6 words correspond to 8 tokens. This is because it also contains The start token and the end token, each number corresponds to a 768-dimensional vector, which can be regarded as the ID card of each word, and the 768-dimensional vectors corresponding to words with very similar meanings are basically the same. After clip we get a (8,768) text vector corresponding to the image.

The stable diffusion uses the pre-training model of openAi's clip, that is, it can be used after training by others. How is the clip trained? How does he match pictures with text information? (The following extension can be seen or skipped, and it does not affect understanding. You only need to know that it is used to convert the prompt word into a text vector corresponding to the generated image)

The data required by CLIP are images and their titles, and the dataset contains about 400 million images and descriptions. It should be obtained directly from the crawler, and the image information is directly used as the label. The training process is as follows:

CLIP is a combination image encoder and text encoder, using two encoders to encode data separately. The resulting embeddings are then compared using cosine distance. At the beginning of training, even if the text description matches the image, the similarity between them must be very low.

img

As the model is continuously updated, the encoder's embeddings for image and text encoding will gradually become similar in subsequent stages. Repeating this process across the dataset, and using a large batch size encoder, is eventually able to generate an embedding vector in which there is a similarity between the image of a dog and the sentence "picture of a dog".

Give some prompt text, and then calculate the similarity of each prompt, and find the one with the highest probabilityimg

2.diffusion model

Above we have obtained an input of unet, we still need an input of a noise image, if we input a 3x512x512 cat image, we do not process the cat image directly, but convert the 512x512 image through the VAE encoder Compress from pixel space (pixel space) to latent space (latent space) 4x64x64 for processing, and the amount of data is nearly 64 times smaller.

[img

A latent space is simply a representation of compressed data. Compression refers to the process of encoding information with fewer bits than the original representation. Dimensionality reduction will lose some information, but in some cases, dimensionality reduction is not a bad thing. Through dimensionality reduction, we can filter out some less important information and keep only the most important information.

After getting the latent space vector, now we come to the diffusion model, why the image can be restored after adding noise, the secret is in the formula, here I use the DDPM paper as a theoretical explanation, the paper, and of course the improved version DDIM, etc., if you are interested, see for yourself

forward diffusion (forward diffusion)

  • The first is forward diffusion (forward diffusion), which is the process of adding noise, and finally becomes a pure noise

  • Gaussian noise is added at every moment, and the latter moment is obtained by adding noise at the previous moment

img

So do we have to get it from the previous step every time we add noise? Can we get it in the first step of adding noise? The answer is YES, the effect is: we add noise to the image during the training process is random, if we randomize to 100 steps of noise, (assuming that the number of time steps is set to 200 steps), if we want to add noise from the first step, we get the second step , go round and round, it is too time-consuming, in fact, these added noises are regular, our current goal is to get the image with noise added to the image at any time as long as we have the original image X0, without having to get the desired noise image step by step .

insert image description here

Let me explain the above. In fact, I have clearly marked what should be marked.

First, the range of αt is 0.9999-0.998,

Second, the noise added to the image conforms to the Gaussian distribution, that is, the noise added to the latent space vector conforms to a mean value of 0 and a variance of 1. When Xt-1 is brought into Xt, why the two items can be combined, because Z1Z2 are both It conforms to the Gaussian distribution, then their addition Z2' also conforms, and the sum of their variances is the new variance , so sum their respective variances, (the one with the root sign is the standard deviation), if you can’t understand, you can Consider it a theorem. Let me say one more thing, for Z–>a+bZ, then the Gaussian of Z is also from (0, σ)–>(a, bσ), and now we get the relationship between Xt and Xt-2

Thirdly, if you bring Xt-2 in again, get the relationship with Xt-3, and find the law, which is the cumulative multiplication of α, and finally get the relationship between Xt and X0, now we can directly get any Noisy image at time.

Fourth, because the image initialization noise is random, assuming you set the number of time steps (timesteps) to 200, that is, divide the interval 0.9999-0.998 into 200 equal parts, representing the α value at each moment, according to the formula of Xt and X0 , because α is multiplied (smaller), it can be seen that the further you go, the faster the noise is added, about the interval of 1-0.13, 0 time is 1, then Xt represents the image itself, and 200 represents the image with α being about 0.13 The noise occupies 0.87. Because it is multiplication, the noise is getting bigger and bigger. It is not an average process.

Fifth, add a sentence, if the Reparameterization Trick (Reparameterization Trick)
is X(u,σ2), then X can be written in the form of X=μ+σZ, where Z~N(0,1). This is the reparameterization trick.

The reparameterization technique is to sample from a distribution with parameters. If you sample directly (the sampling action is discrete and not differentiable), there is no gradient information, then in BP backpropagation When , the parameter gradient will not be updated. The reparameterization trick allows us to sample from , while preserving gradient information.

reverse diffusion

  • After the forward diffusion is completed, the next step is reverse diffusion. This may be more difficult than the above one. How to get the original image step by step from a noisy image is the key.

img

  • Starting from the reverse direction, our goal is to obtain a noise-free X0 from the Xt noise image. We start by finding Xt-1 from Xt. Here we first assume that X0 is known (ignore why the assumption is known), and we will replace it later. As for how to replace it, Isn't the relationship between Xt and X0 known in forward diffusion? Now we know Xt, and Xt is used to represent X0, but there is still a Z noise that is unknown. At this time, Unet will come into play. It needs to put Noise is predicted.
  • Here we use the Bayesian formula (that is, the conditional probability), we use the results of the Bayesian formula, we have written a document before

img

That is, knowing Xt to find Xt-1, we don’t know how to find the reverse direction, but looking for the forward direction, if we know X0, then we can find these items.
insert image description here
(https://tianfeng.space/wp-content/uploads/2023/05/download-3.png)(https://tianfeng.space/wp-content/uploads/2023/05/download-3.png)
Let’s start to interpret, since these three items are all in line with Gaussian separation, then bring in Gaussian distribution (also called normal distribution), why is their multiplication equal to addition, because e2 * e3 = e2+3, this is understandable (belongs to exp, which is the power of e), well, now we have got an overall formula, and then continue to simplify

First of all, we expand the square. Now the unknown is only Xt-1, and it is formatted as AX2+BX+C. Don’t forget that even the addition is in line with the Gaussian distribution. Now we make the original Gaussian formula into the same format, and the red is the variance. The reciprocal of the blue multiplied by the variance and divided by 2 to get the mean value μ (the result of the simplification is shown below, if you are interested in yourself, simplify it yourself), and return to X0. It was said that X0 was known before, and now it is converted to Xt ( Known) means, substituting μ, now only Zt remains unknown,insert image description here

  • Zt is actually the noise we want to estimate at each moment
    - here we use the Unet model to predict - there are three input parameters of the model, which are the distribution Xt
    at the current moment and time t , as well as the previous text vector , and then output the predicted noise , this is the whole process,

img

  • Algorithm 1 above is the training process,

The second step is to fetch data. Generally speaking, it is a kind of cat, dog or something, or a kind of style picture. You can’t come with all kinds of pictures in a mess, and the model can’t learn it.

The third step is to say that each picture is randomly given a moment of noise (as mentioned above),

In the fourth step, the noise conforms to the Gaussian distribution,

The fifth step is to calculate the loss of the real noise and the predicted noise (the DDPM input has no text vector, if you don’t write it, you understand it as adding an additional input), and update the parameters. Until the noise of the training output is very different from the real noise, the Unet model is trained

  • Next we come to the Algorithm2 sampling process
  1. Doesn't it mean that Xt conforms to the Gaussian distribution?
  2. Execute T times, and find Xt-1 to X0 in turn, isn’t it T times?
  3. Isn’t Xt-1 the formula we derived from reverse diffusion, Xt-1=μ+σZ, the mean and variance are known, the only unknown noise Z is predicted by the Unet model, εθ refers to the trained Unet ,

sampling map

  • For the convenience of understanding, I draw the Vincent diagram and the Tusheng diagram respectively. People who use the stable diffusion webui to draw pictures must feel familiar. If it is a Vincent diagram, it is to directly initialize a noise and perform sampling.
  • The picture is to add noise on your original basis, and the noise weight is controlled by yourself. Does the webui interface have a redrawing range, that's it,
  • The number of iterations is the number of sampling steps of our webui interface.
  • The random seed seed is a noise image we randomly obtained initially, so if you want to reproduce the same image, the seed must be consistent

img

img

Stage summary

Let's look at this picture again now. I didn't talk about it except Unet (which will be introduced separately below). Isn't it much simpler? The far left is the encoder-decoder in the pixel space, and the far right is the clip that turns the text into a text vector. In the middle The above is the noise addition, and the bottom is the Unet prediction noise, and then the output image is obtained by continuous sampling and decoding. This is the sampling map of the original paper, without drawing the training process.

img

3. Unet model

The unet model is believed to be more or less known by everyone, it is multi-scale feature fusion, like FPN image pyramid, PAN, many of which are similar ideas, generally use resnet as the backbone (down-sampling) and act as an encoder, so that we can Get feature maps of multiple scales, and then in the process of upsampling, upsample stitching (feature maps obtained by downsampling before), this is an ordinary Unet

img

What is the difference between stablediffusion's Unet? I found a picture here. I admire this young lady for her patience. Let me borrow her picture

img

Let me explain and the ResBlock module and the SpatialTransformer module. The input is timestep_embedding, context and input. The three inputs are the number of time steps, text vectors, noised images, and the number of time steps. You can understand it as the position code in the transformer, in natural language It is used to tell the model the position information of each word in a sentence. Different positions may have very different meanings. Here, adding time step information can be understood as telling the model to add the time information of the first step of noise (of course this is my understanding).

timestep_embedding uses sine and cosine encoding

img

The input of the ResBlock module is time encoding and convolutional image output, adding them together, this is its function, not to mention the specific details, it is convolution, full connection, these are very simple.

The input of the SpatialTransformer module is the text vector and the output of the previous step ResBlock,

img

It mainly talks about cross attention, and others are transformations of some dimensions, convolution operations and various normalized Group Norm, Layer norm,

Use cross attention to fuse the features of the latent space (latent space) with the features of another modal sequence (text vector), and add it to the reverse process of the diffusion model, reversely predict the noise that needs to be reduced in each step through Unet, and use GT noise and The loss function for prediction noise computes gradients.

Looking at the picture in the lower right corner, you can know that Q is the feature of latent space (latent space), and KV is obtained by connecting two full connections of text vectors, and the rest is the normal transformer operation. After multiplying QK and softmax, a score is obtained , then multiplied by V, transforming the dimension output, you can think of the transformer as a feature extractor, it can show us important information (only to help understand), almost like this, after the operation is similar, and finally output the predicted noise.

Here you must be familiar with the transformer, know what is self attention, what is cross attention, if you don’t know how to find an article, I feel that it can’t be explained clearly.

Finished, bye, show some webui comparison chart

3. Stable diffusion webui extension

parameter clip

insert image description here

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/weixin_62403633/article/details/131022283