AI Painting Stable Diffusion Research (7) An article to understand the working principle of Stable Diffusion


Hello everyone, I am rain or shine.


This article is suitable for the crowd:

  • Friends who want to understand the basic principles of AI drawing.

  • Friends who are interested in Stable Diffusion AI drawing.


Contents of this issue:

  • What Stable Diffusion Can Do

  • What is Diffusion Model

  • Diffusion model realization principle

  • Stable Diffusion Latent Diffusion Model

  • How Stable Diffusion text affects image generation

  • Stable Diffusion Cross-attention Technology

  • Stable Diffusion noise schedule 技术

  • Demo of Stable Diffusion Vincent graph bottom layer operation


1. What can Stable Diffusion do?


After going through the previous articles about the installation of the Stable Diffusion integration package, the introduction of the ControlNet plug-in, the installation and use of the sd model, and the introduction of the Vincent diagram function, I believe that friends who have read it should clearly know what Stable Diffusion does. ?


For new friends, if you want to learn more, please go to:

AI Painting Stable Diffusion Research (1) sd Integration Package v4.2 Version Installation Instructions
AI Painting Stable Diffusion Research (2) sd Model ControlNet1.1 Introduction and Installation
AI Painting Stable Diffusion Research (3) sd Model Types Introduction and Detailed Installation and Use
AI Painting Stable Diffusion Research (4) Detailed Explanation of sd Wensheng Diagram Function (Part 1)
AI Painting Stable Diffusion Research (5) Detailed Explanation of sd Wensheng Diagram Function (Part 2)
AI Painting Stable Diffusion Research (6) sd Prompt Word Plugin


Let me put it more bluntly here: SD is a text-to-image model, through a given text prompt (text prompt), a picture matching the text can be generated.


2. What is the Diffusion Model


Everyone often hears that Stable Diffusion is a potential diffusion model (Diffusion Models).

So let's first figure out what is a diffusion model?

Why is it called a diffusion model? Because its mathematical formula looks very much like the phenomenon of physical diffusion.


1. Forward diffusion

Suppose we train a model as follows:

insert image description here


As shown in the figure above, it is a process of forward diffusion, which gradually adds noise to the training image, and finally becomes a completely random noise image, and finally cannot identify the initial image corresponding to the noise image.


This process is like a drop of ink dripping into a glass of clear water, it will slowly spread and finally be evenly distributed in the clear water, and it is impossible to judge whether it was originally dripped from the center of the water cup or from the edge. This is how the name of diffusion comes from. of.


2. Back diffusion


The idea of ​​reverse diffusion is: input a noise map, reverse diffusion (Reverse Diffusion), let the above process obtain the reverse process of generating a clear image from a random noise map.


From the perspective of back diffusion, we need to know how much "noise" has been added to a certain image.


Then the way to know the answer is: train a neural network to predict the added noise, which is called the noise predictor (Noise Predicator) in SD, and its essence is a U-Net model.


The training process is:

(1), select a training image (such as a picture of a cat)

(2), generate random noise map

(3), continue to add multiple rounds of noise to this picture

(4) Train the Noise Predicator, predict how much noise is added, train the weights through the neural network, and show the correct answer.


insert image description here


The focus of backdiffusion training is the noise predictor (Noise Predicator) in the figure below, which can be trained to obtain the noise that needs to be subtracted each time, and how much noise needs to be subtracted each time is predicted, so as to achieve the purpose of restoring a clear picture .


3. Realization principle of diffusion model


The success of Diffusion Models did not come out of nowhere, it suddenly appeared in people's field of vision. In fact, as early as 2015, someone had proposed a similar idea, and finally proposed the generation technology of the diffusion model in 2020.


The following is the derivation formula of the diffusion model:

insert image description here


insert image description here


insert image description here


More detailed principles:

Reference: Diffusion Model Detailed Principle + Code


Through the previous introduction, we probably understand what a diffusion model is, but this is not how Stable Diffusion works.

This is because: the above-mentioned diffusion process is completed in the image space. Whether it is model training or the process of generating images, it requires massive computing power support and memory requirements.


Imagine: a 512 x 512 picture (contains 3 color channels: red, green, blue), its space is 786432 dimensions, which means we have to specify so many values ​​for a picture. Therefore, it is basically impossible to run on a single GPU.


Stable Diffusion is a solution to reduce computing power and memory requirements. It makes it possible for Stable Diffusion to run on consumer-grade GPUs.


4. Stable Diffusion potential diffusion model

Stable Diffusion It is a Latent Diffusion Model (potential diffusion model). The way is to compress the picture into a "latent space" (Latent Space), rather than working in a high-dimensional picture space. The latent space is 48 times smaller than the image space, so it saves a lot of computation and runs faster.


The diffusion process will be divided into many steps, and the process of each step is shown in the figure below. Values ​​such as text description, hidden variables, and number of steps are passed into UNet to generate new hidden variables, and this process involves some models.

insert image description here


In the last loop, the latent features are decoded into images via Variational Autoencoder (VAE).

insert image description here


The core idea of ​​this process is: compressing the image, which compresses the image to the extreme through the Variational Autoencoder (VAE) model. We call this type of compression method dimensionality reduction, and this dimensionality reduction level of compression is not lost. Important information.


After compression, the image is called a low-dimensional latent (Latent) "image". As the input of U-net, after denoising step by step in the latent space (Latent Space), the low-dimensional "picture" that completes the back diffusion is also It is necessary to convert the image from the latent space back to the pixel space (Pixel Space) through the VAE decoder.


VAE consists of 2 parts: Encoder and Decoder.

  • Encoder compresses a picture into a low-dimensional space representation in the "latent space"

  • Decoder restores a picture from the representation in the "latent space"

insert image description here


The following code demonstrates the use of the VAE model, where load_vae is to initialize the model according to the configuration init_config, and then read the parameters from the pre-training model model.ckpt, and the first_stage_model of the pre-training model refers to the VAE model.


from ldm.models.autoencoder import AutoencoderKL
#VAE模型
def load_vae():
    #初始化模型
    init_config = {
        "embed_dim": 4,
        "monitor": "val/rec_loss",
        "ddconfig":{
          "double_z": True,
          "z_channels": 4,
          "resolution": 256,
          "in_channels": 3,
          "out_ch": 3,
          "ch": 128,
          "ch_mult":[1,2,4,4],
          "num_res_blocks": 2,
          "attn_resolutions": [],
          "dropout": 0.0,
        },
        "lossconfig":{
          "target": "torch.nn.Identity"
        }
    }
    vae = AutoencoderKL(**init_config)
    #加载预训练参数
    pl_sd = torch.load("model.ckpt", map_location="cpu")
    sd = pl_sd["state_dict"]
    model_dict = vae.state_dict()
    for k, v in model_dict.items():
        model_dict[k] = sd["first_stage_model."+k]
    vae.load_state_dict(model_dict, strict=False)

    vae.eval()
    return vae

#测试vae模型
def test_vae():
    vae = load_vae()
    img = load_image("girl_and_horse.png")  #(1,3,512,512)   
    latent = vae.encode(img).sample()       #(1,4,64,64)
    samples = vae.decode(latent)            #(1,3,512,512)
    save_image(samples,"vae.png")

test_vae()

5. How Stable Diffusion text affects image generation


In the Stable Diffusion model, the prompt controls U-Net through the guidance vector. Specifically, the prompt is encoded into a text embedding vector (text embeddings), which is then passed to U-Net along with other inputs.

In this way, the prompt can affect the output of U-Net, thereby guiding the model to produce the expected results during the generation process, that is, to generate the graph we want through the prompt .


In the Stable Diffusion model the prompt is limited to 75 words.


The following figure is the process of how the text prompt is processed and input to Noise Predictor:


insert image description here


According to the above figure, we can see this process:

First, the Tokenizer (tokenizer) converts each input word into a number, which we call token.

Then, each token is converted into a 768-dimensional vector, called a word embedding (embedding).

Finally, word embeddings are processed by Text Transformer and can be consumed by Noise predictor.


1. Tokenizer

Humans can read words, but computers can only read numbers. So that's why text prompts are converted to words first.

The text prompt is first segmented by a CLIP tokenizer .

CLIP is a deep learning model developed by Open AI to generate textual descriptions for any picture.


The following are specific examples of CLIP

It shows how to convert the text "apple" into tokens data that can be input into the neural network for training through CLIP technology.

This is achieved using Python and the OpenAI library.


(1), install dependent libraries

pip install torch openai

(2), import related libraries

import torch import openai

(3), load the CLIP model

model, preprocess = openai.clip.load("ViT-B/32")

(4), ready to enter text

text_description = "苹果"

(5), convert text to tokens

tokenizeConvert text to tokens using the method of the CLIP model

text_tokens = openai.clip.tokenize(text_description)

Here, text_tokensis a PyTorch tensor of shape (1, N), where N is the number of tokens in the text description.

In this example, N=3 because "apple" is divided into 3 tokens.


(6), view tokens

print(f"Tokens: {text_tokens}")

The output might look like:

Tokens: tensor([[49406, 3782, 49407]])

Here, 49406it represents the start symbol (start-of-sentence), 3782represents "apple", and 49407represents the end symbol (end-of-sentence).

Through the above steps, we converted the text "apple" into tokens.


PS:

  • Stable Diffusion v1 uses the tokenizer of the CLIP model

  • Tokenizer can only segment the words it has seen during training

    Example: Suppose the CLIP model has the words "dream" and "beach", but not the word "dreambeach".

    The Tokenizer will split "dreambeach" into two words "dream" and "beach".

  • 1 word does not represent 1 token, but it is possible to further split

  • Spaces are also part of the token

    For example: the phrase "dream beach" yields two tokens "dream" and "[space]beach".

    These tokens are different from the tokens produced by "dreambeach", which are "dream" and "beach" (no space before beach).


2. Embedding


(1), why do you need word embedding (Embedding)?

Since some words are very similar to each other, we want to take advantage of this semantic information.


For example:

The word embeddings of man, gentleman, and guy are very similar, so they can be replaced by each other.

Monet, Manet, and Degas all painted in the Impressionist style, but in different ways.

These names look very similar, but they are not the same in word embedding (Embedding).


(2) How does Embedding work?


Embedding converts the input tokens into a continuous vector representation, which can capture the semantic information in the text. In our example, the "apple" tokens encode_textwill get a feature vector after going through the method of CLIP model.


This eigenvector is a point in a high-dimensional space, usually with a fixed dimensionality (in the CLIP model, dimensionality is 512). Note that the resulting feature vectors may vary slightly from run to run due to model weights and randomness. Here is a sample output:


print(f"Text features: {
      
      text_features}")

The output might look like:

Text features: tensor([[-0.0123,  0.0345, -0.0678, ...,  0.0219, -0.0456,  0.0789]])

Here, text_featuresis a (1, 512)PyTorch tensor of shape containing a vector representation of the word "apple". Neural networks can use this vector representation for training and prediction tasks.


Stable diffusion v1 uses Open AI's ViT-L/14 model, and the word embedding is a 768-dimensional vector.


3. Text converter


(1) Why is it needed text transformer?


Since embeddingit can be directly input into the model for training after passing, why is it necessary to use the conversion as the input of the model in stable embeddingdiffusion text transformer?


This is because the Stable Diffusion model is an image generation model, which needs to understand the semantic information of the input text to generate images related to it. Directly using basic text embeddings may not adequately capture complex semantic relationships in text. By using a text transformer, a richer and more expressive text representation can be obtained, which helps to improve the quality and relevance of the generated images to the input text.


When using text transformer to capture text semantic information, more context and abstract concepts can be considered .

This converter is like a general condition (conditioning) adapter.


(2), text transformer conversion example


Let's take "Apple" as an example to illustrate.

Suppose we have obtained the basic embedding of "apple" (a (1, 512)PyTorch tensor of shape ):

text_features = tensor([[-0.0123,  0.0345, -0.0678, ...,  0.0219, -0.0456,  0.0789]])

Next, we feed this tensor into the text transformer:

transformed_text_features = text_transformer(text_features)

After text transformer processing, we may get a new tensor like:

print(f"Transformed text features: {
      
      transformed_text_features}")

The output might look like:

Transformed text features: tensor([[ 0.0234, -0.0567,  0.0890, ..., -0.0321,  0.0672, -0.0813]])

This new tensor (still of shape (1, 512)) contains richer semantic information such as contextual relations and abstract concepts.

This helps the Stable Diffusion model better understand the input text and generate images related to it.


Please note:

The resulting feature vectors may vary slightly from run to run due to model weights and randomness.

In addition, the specific change process depends on the used text transformer structure and parameters.


6. Stable Diffusion Cross-attention technology


Cross-attention is the core technology for generating pictures through prompt words.

The output of the text converter will be used multiple times by the noise predictor in U-Net.

U-Net uses it in a way called the cross-attention mechanism. The cross-attention mechanism allows the model to focus on relevant areas at different feature levels, thereby improving the quality of the generated results. This is where prompt fits the picture.


The following code is the transformers block used by stable diffusion, which implements cross-attention:

class SpatialTransformer(nn.Module):
    """
    Transformer block for image-like data.
    First, project the input (aka embedding)
    and reshape to b, t, d.
    Then apply standard transformer action.
    Finally, reshape to image
    """
    def __init__(self, in_channels, n_heads, d_head,
                 depth=1, dropout=0., context_dim=None):
        super().__init__()
        self.in_channels = in_channels
        inner_dim = n_heads * d_head
        self.norm = Normalize(in_channels)

        self.proj_in = nn.Conv2d(in_channels,
                                 inner_dim,
                                 kernel_size=1,
                                 stride=1,
                                 padding=0)

        self.transformer_blocks = nn.ModuleList(
            [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim)
                for d in range(depth)]
        )

        self.proj_out = zero_module(nn.Conv2d(inner_dim,
                                              in_channels,
                                              kernel_size=1,
                                              stride=1,
                                              padding=0))

    def forward(self, x, context=None):
        # note: if no context is given, cross-attention defaults to self-attention
        b, c, h, w = x.shape
        x_in = x
        x = self.norm(x)
        x = self.proj_in(x)
        x = rearrange(x, 'b c h w -> b (h w) c')
        for block in self.transformer_blocks:
            x = block(x, context=context)
        x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w)
        x = self.proj_out(x)
        return x + x_in

7. Stable Diffusion noise schedule technology


1. What is noise schedule?

The noise is processed by U-Net multiple times, and finally the picture we want will be output.

In these multiple processings, the magnitude of each noise reduction is different, so we need to use schedulers to control the magnitude of each noise reduction (the magnitude is generally decreasing). This technique is called noise schedule.

As shown in the picture:


insert image description here


So why use noise scheduletechnology?


In the generation model of Stable Diffusion, U-Net is a core component used to gradually restore the original image from the noisy image. The reason why the denoising amplitude gradually decreases during multiple iterations is to restore the details and structure of the image more finely.


The process of Stable Diffusion can be seen as a reverse diffusion process, which starts from a highly noisy image, and then gradually removes the noise through multiple steps to reconstruct the original image. In this process, U-Net is used to predict the noise reduction operation at each step.


In the first few iterations, the image is more noisy, so a larger denoising magnitude is needed to remove this noise. As the number of iterations increases, the noise in the image gradually decreases, so the noise reduction magnitude should also decrease accordingly. The purpose of this is to avoid over-smoothing or damage to already restored image details.


By gradually reducing the noise reduction magnitude, U-Net can better control the denoising process, making it effectively remove noise while preserving image details. This helps produce sharper, more realistic images.


Here is a code for Vincent diagram to illustrate noise schedulethe technique:

def txt2img():
    #unet
    unet = load_unet()
    #调度器
    scheduler = lms_scheduler()
    scheduler.set_timesteps(100)
    #文本编码
    prompts = ["a photograph of an astronaut riding a horse"]
    text_embeddings = prompts_embedding(prompts)
    text_embeddings = text_embeddings.cuda()     #(1, 77, 768)
    uncond_prompts = [""]
    uncond_embeddings = prompts_embedding(uncond_prompts)
    uncond_embeddings = uncond_embeddings.cuda() #(1, 77, 768)
    #初始隐变量
    latents = torch.randn( (1, 4, 64, 64))  #(1, 4, 64, 64)
    latents = latents * scheduler.sigmas[0]    #sigmas[0]=157.40723
    latents = latents.cuda()
    #循环步骤
    for i, t in enumerate(scheduler.timesteps):  #timesteps=[999.  988.90909091 978.81818182 ...100个
        latent_model_input = latents  #(1, 4, 64, 64)  
        sigma = scheduler.sigmas[i]
        latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)
        timestamp = torch.tensor([t]).cuda()

        with torch.no_grad():  
            noise_pred_text = unet(latent_model_input, timestamp, text_embeddings)
            noise_pred_uncond = unet(latent_model_input, timestamp, uncond_embeddings)
            guidance_scale = 7.5 
            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

            latents = scheduler.step(noise_pred, i, latents)

    vae = load_vae()
    latents = 1 / 0.18215 * latents
    image = vae.decode(latents.cpu())  #(1, 3, 512, 512)
    save_image(image,"txt2img.png")

txt2img()

Eight, Stable Diffusion Vincent graph underlying operation demonstration


In the scenario of text generation graph, we input a set of text prompts to the SD model, and it can return a picture.


In the first step, Stable Diffusion generates a random tensor in the latent space.

We control the generation of this tensor by setting a random seed . If we set this random seed to a specific value, we will get the same random tensor. This is our picture in latent space. But it's still full of noise.

insert image description here


The second step, Noise predictor U-Net takes the latent noise map and text prompts as input, and predicts the noise

This noise is also in the latent space (a 4 x 64 x 64 tensor)

insert image description here


The third step is to extract latent noise points from the latent image and generate a new latent image

insert image description here


The second step and the third step are repeated for a specific number of sampling times, for example, 20 times.


The fourth step, VAE's decoder converts the latent image back to the pixel space

This is the picture we finally got with the SD model.

insert image description here


References:

1. How does Stable Diffusion work?

2. stable-diffusion

3. Diffusion model detailed principle + code

Guess you like

Origin blog.csdn.net/lizhong2008/article/details/132257722