Ridiculously powerful! Hard core interpretation of Stable Diffusion (full version)

Original link:

Hard core interpretation of Stable Diffusion (full version)

2022 can be said to be the first year of AIGC (AI Generated Content) . In the first half of the year, there were large Vincent graph models DALL-E2 and Stable Diffusion . In the second half of the year, OpenAI's large text dialogue model ChatGPT was released. This made the cooled AI boil again, because AIGC can allow more people to truly feel the power of AI. This article will introduce the popular Vincent diagram model Stable Diffusion (SD for short) . Stable Diffusion is not only a completely open source model (code, data, and models are all open source), but also its parameter size is only about 1B. Most people Inference and even fine-tuning of models can be performed on ordinary graphics cards. It is no exaggeration to say that the emergence of Stable Diffusion and open source have greatly promoted the popularity and development of AIGC, because it allows more people to quickly get started with AI painting. Here we will give an in-depth explanation of the technical principles and some implementation details of SD based on the Hugging Face diffusers library, and then also introduce the common functions of SD. Note that this article mainly takes the SD V1.5 version as an example, and will also briefly introduce the SD 2.0 version at the end . And SD-based extended applications.

SD model principle

SD is a Vincentian graph model developed by companies such as CompVis, Stability AI and LAION. Its model and code are open source, and the training data LAION-5B is also open source. SD has gained 33K stars in its github repository within 90 days of open source , which shows how popular this model is.

SD is a latent-based diffusion model that introduces text condition into UNet to generate images based on text. The core of SD comes from the work of Latent Diffusion . The conventional diffusion model is a pixel-based generation model, and Latent Diffusion is a latent-based generation model. It first uses an autoencoder to compress the image into latent space, and then uses the diffusion model to generate the image. latents, and finally sent to the decoder module of autoencoder to get the generated image. The advantage of the latent-based diffusion model is that the calculation efficiency is more efficient, because the latent space of the image is smaller than the image pixel space, which is also the core advantage of SD . Vincent graph models tend to have relatively large parameters, and pixel-based methods are often limited to computing power and only generate 64x64-sized images, such as OpenAI's DALL-E2 and Google's Imagen, and then use super-resolution models to increase the image resolution to 256x256 and 1024x1024. ; Latent-based SD operates in latent space, and it can directly generate images of 256x256 and 512x512 or even higher resolutions.

The main structure of the SD model is shown in the figure below, which mainly includes three models:

  • autoencoder : encoder compresses the image into latent space, and decoder decodes latent into an image;

  • CLIP text encoder : Extract the text embeddings of the input text and send them to the UNet of the diffusion model as a condition through cross attention;

  • UNet : The main body of the diffusion model, used to achieve text-guided latent generation.

For the SD model, the autoencoder model parameter size is 84M, the CLIP text encoder model size is 123M, and the UNet parameter size is 860M, so the total parameter size of the SD model is about 1B .

autoencoder

The autoencoder is an image compression model based on the encoder-decoder architecture. For an input image of size , the encoder module encodes it into a latent of size , where is the downsampling factor. In the process of training the autoencoder, in addition to using L1 reconstruction loss , perceptual loss (LPIPS, see the paper The Unreasonable Effectiveness of Deep Features as a Perceptual Metric for details) and patch-based adversarial training are also added . The auxiliary loss is mainly to ensure the local authenticity of the reconstructed image and avoid blur. For the specific loss function, see the loss part of latent diffusion. At the same time, in order to prevent the standard deviation of the obtained latent from being too large, two regularization methods are used: the first is KL-reg , which is similar to VAE by adding a latent and KL loss of the standard normal distribution. However, in order to ensure the reconstruction effect here, we use Relatively small weight (~10e-6); the second one is VQ-reg , which introduces a VQ (vector quantization) layer. The model at this time can be regarded as a VQ-GAN, but the VQ layer is in the decoder module. Here VQ's codebook samples a higher dimension (8192) to reduce the impact of regularization on the reconstruction effect. In the latent diffusion paper, the autoencoder model under different parameters was experimented, as shown in the table below. It can be seen that when the smaller and larger values ​​are smaller, the reconstruction effect is better (the PSNR is larger), which is also more in line with expectations. After all, the compression rate at this time Small.

The paper further experiments different autoencoder on the diffusion model, and trains the same number of steps (2M steps) on the ImageNet data set. The generation quality of the training process is as follows, you can see that it is too small (such as 1 and 2) If the convergence speed is slow, the perceived compression rate of the image is small, and the diffusion model requires longer learning; if it is too large, the generation quality is poor, and the compression loss is too large.

When it is between 4 and 16, relatively good results can be achieved. SD uses an autoencoder based on KL-reg, where the downsampling rate and feature dimension are. When the input image is 512x512 in size, a latent of 64x64x4 size will be obtained. The autoencoder model is trained on the OpenImages data set based on the size of 256x256, but since the autoencoder model is fully convolutional (based on ResnetBlock), it can be extended to images of size >256. Below we use the diffusers library to load the autoencoder model, and use autoencoder to achieve image compression and reconstruction. The code is as follows:

import torch
from diffusers import AutoencoderKL
import numpy as np
from PIL import Image

#加载模型: autoencoder可以通过SD权重指定subfolder来单独加载
autoencoder = AutoencoderKL.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="vae")
autoencoder.to("cuda", dtype=torch.float16)

# 读取图像并预处理
raw_image = Image.open("boy.png").convert("RGB").resize((256, 256))
image = np.array(raw_image).astype(np.float32) / 127.5 - 1.0
image = image[None].transpose(0, 3, 1, 2)
image = torch.from_numpy(image)

# 压缩图像为latent并重建
with torch.inference_mode():
    latent = autoencoder.encode(image.to("cuda", dtype=torch.float16)).latent_dist.sample()
    rec_image = autoencoder.decode(latent).sample
    rec_image = (rec_image / 2 + 0.5).clamp(0, 1)
    rec_image = rec_image.cpu().permute(0, 2, 3, 1).numpy()
    rec_image = (rec_image * 255).round().astype("uint8")
    rec_image = Image.fromarray(rec_image[0])
rec_image

Here we give a comparison of the reconstruction effects of two pictures under 256x256 and 512x512, as shown below. The first column is the original picture, the second column is the reconstructed image under 512x512 size, and the third column is the reconstructed image under 256x256 size. . From the comparison, it can be seen that the autoencoder compresses the image to latent and then reconstructs it, which is actually lossy. For example, distortion of text and faces will occur. This is more obvious at 256x256 resolution, and the effect will be much better at 512x512.

This kind of lossy compression will definitely have a certain impact on the image quality generated by SD, but fortunately, the SD model is basically used at a resolution of 512x512 or above. In order to improve this distortion, when stabilityai released SD 2.0, it also released two autoencoder fine-tuned on the LAION sub-dataset. Note that only the decoder part of the autoencoder is fine-tuned here. SD's UNet only needs the encoder part during the training process, so In this way, the fine-tuned autoencoder can be directly used on the previously trained UNet (this technique is relatively common. For example, Google's Parti also expands and fine-tunes the decoder module of ViT-VQGAN after training the auto-regressive generation model. to improve the generation quality). We can also use these autoencoders directly in diffusers, such as the mse version (a model that uses mse loss to finetune):

autoencoder = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse/")

For the same two pictures, the reconstruction effect of this mse version is as follows. It can be seen that compared with the original version of the autoencoder, the distortion has been improved to some extent.

Since the autoencoder used by SD is based on KL-reg, this autoencoder actually gets a Gaussian distribution DiagonalGaussianDistribution (mean and standard deviation of the distribution) when encoding the image, and then samples a specific latent by calling the sample method (calling mode method to get the mean). Since the weight coefficient of KL-reg is very small, the actual standard deviation of the latent is still relatively large. A rescaling method is proposed in the latent diffusion paper: first calculate the standard deviation of the latent in the first batch data, and then use coefficient to rescale latent, so as to ensure that the standard deviation of latent is close to 1 (to prevent the SNR of the diffusion process from being high, which affects the generation effect, see the discussion in Part D1 of the latent diffusion paper for details), and then the diffusion model is also applied to the rescaling latent. When decoding, you only need to divide the generated latent and then send it to the decoder of the autoencoder. For the autoencoder used by SD, this rescaling coefficient is 0.18215.

CLIP text encoder

SD uses CLIP text encoder to extract text embeddings from the input text . Specifically, it uses the largest CLIP model currently open sourced by OpenAI: clip-vit-large-patch14 . The text encoder of this CLIP is a transformer model (only the encoder module): layer The number is 12, the feature dimension is 768, and the model parameter size is 123M. For the input text, after sending it to the CLIP text encoder, the final hidden states (that is, the features obtained by the last transformer block) are obtained. The feature dimension size is 77x768 (77 is the number of tokens). This fine-grained text embeddings will be generated by cross attention . into UNet . In the transofmers library, the CLIP text encoder can be used as follows:

from transformers import CLIPTextModel, CLIPTokenizer

text_encoder = CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder").to("cuda")
# text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda")
tokenizer = CLIPTokenizer.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="tokenizer")
# tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

# 对输入的text进行tokenize,得到对应的token ids
prompt = "a photograph of an astronaut riding a horse"
text_input_ids = text_tokenizer(
    prompt,
    padding="max_length",
    max_length=tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt"
).input_ids

# 将token ids送入text model得到77x768的特征
text_embeddings = text_encoder(text_input_ids.to("cuda"))[0]

It is worth noting that the maximum length of the tokenizer here is 77 (the setting used during CLIP training). When the number of tokens in the input text exceeds 77, it will be truncated. If it is insufficient, padding will be performed. This will ensure that no matter how long the input text is, Text (even empty text) gets 77x768 size features. During the training of SD, the CLIP text encoder model is frozen . In early work, such as OpenAI's GLIDE and LDM in latent diffusion, a randomly initialized transformer model was used to extract text features, but the latest work uses pre-trained text models. For example, Google's Imagen uses the pure text model T5 encoder to propose text features, while SD uses the CLIP text encoder. Pre-trained models have often been trained on large-scale data sets, and they are better than directly training from scratch. The model is better.

UNet

The diffusion model of SD is an 860M UNet. Its main structure is shown in the figure below (here, the input latent is 64x64x4 dimension as an example). The encoder part includes 3 CrossAttnDownBlock2D modules and 1 DownBlock2D module, and the decoder part includes 1 UpBlock2D module and 3 CrossAttnUpBlock2D modules, with a UNetMidBlock2DCrossAttn module in the middle. The two parts of encoder and decoder are completely corresponding, and there is a skip connection in the middle. Note that the three CrossAttnDownBlock2D modules all have a 2x downsample operation at the end, and the DownBlock2D module does not include downsampling.

The main structure of the CrossAttnDownBlock2D module is shown in the figure below. The text condition will be embedded through the CrossAttention module. At this time, the query of Attention is the intermediate feature of UNet, and the key and value are text embeddings. SD uses the same method of predicting noise to train UNet as DDPM, and its training loss is also the same as DDPM: here is text embeddings, and the model at this time is a conditional diffusion model. Based on the diffusers library, we can quickly implement SD training. The core code is as follows (refer here to the finetune code in the examples under the diffusers library):

import torch
from diffusers import AutoencoderKL, UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
import torch.nn.functional as F

# 加载autoencoder
vae = AutoencoderKL.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="vae")
# 加载text encoder
text_encoder = CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder")
tokenizer = CLIPTokenizer.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="tokenizer")
# 初始化UNet
unet = UNet2DConditionModel(**model_config) # model_config为模型参数配置
# 定义scheduler
noise_scheduler = DDPMScheduler(
    beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000
)

# 冻结vae和text_encoder
vae.requires_grad_(False)
text_encoder.requires_grad_(False)

opt = torch.optim.AdamW(unet.parameters(), lr=1e-4)

for step, batch in enumerate(train_dataloader):
    with torch.no_grad():
        # 将image转到latent空间
        latents = vae.encode(batch["image"]).latent_dist.sample()
        latents = latents * vae.config.scaling_factor # rescaling latents
        # 提取text embeddings
        text_input_ids = text_tokenizer(
            batch["text"],
            padding="max_length",
            max_length=tokenizer.model_max_length,
            truncation=True,
            return_tensors="pt"
  ).input_ids
  text_embeddings = text_encoder(text_input_ids)[0]
    
    # 随机采样噪音
    noise = torch.randn_like(lat

Guess you like

Origin blog.csdn.net/qq_41771998/article/details/130074864
Recommended