SD core code analysis in stable-diffusion-webui

Stable Diffusion Principle Introduction and Source Code Analysis (1) - Zhihu Preface (It has nothing to do with the text, can be ignored) Stable Diffusion is an open-source AI Wensheng graph diffusion model of Stability AI Company. In the previous article Diffusion Model Brief Introduction and Source Code Analysis, I introduced the principle of Diffusion Model and some algorithm codes, satisfying the basic... https://zhuanlan.zhihu.com/p/613337342 icon-default.png?t=N7T8Stable Diffusion Principle Introduction and Source Code Analysis (2. DDPM, DDIM, PLMS) - Zhihu series of articles Stable Diffusion principle introduction and source code analysis (1. Overview) Preface (not related to the text, can be ignored) found that the title is getting more and more strange... This article continues to introduce Stable Diffusion Framework implementation. In the previous article, Stable Diffusion principle introduction and source code analysis (… icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/615310965Combining the above materials and stable-diffusion-webui to see the process of img2img and txt2img, webui is written based on stable-diffusion of stabilty, and stable-diffusion is written based on the library ldm, which is the same as stable-diffusion of compvis, basically no difference, The same is true for controlnet. Of course, webui has undergone many transformations, such as negative samples and optimization of various prompts. Of course, diffusers is also well written, but since it is based on webui to encapsulate the api library, it is also necessary to look at the implementation in webui. The packaging of diffusers is also in place, and the implementation details are basically invisible. At present, there are only two mainstream generation task codes based on the diffusion model, one is ldm, and the other is diffusers. I recommend using ldm for reasoning, because there are so many optimizations and three-party libraries based on webui, training can use diffusers, packaged It's very convenient. Of course, if you don't encapsulate the interface yourself, it doesn't matter.

The above is well written, the cores are in Predict Noise with UNetModel and Denoise with UNetModel. In DDPM, note that although the noise is theoretically added multiple times and predicted during training, in fact, the noise is only added once and predicted, but in the reasoning is constantly denoising.

1.UNetModel

The unet in sd is the core. Of course, unet also includes self-attention and cross-attention. SD uses unet to predict noise. This code is in stable-diffusion-stability-ai/ldm/modules/diffusionmodules/model.py ,

The model uses downsample and upsample to downsample and upsample samples, resblock and spatialTransformer, where each resblock receives input from the previous module and embedding corresponding to timesteps, and each spatialTransformer receives input from the previous module and context (prompt text embedding representation), use cross attention, and use context as the condition to learn the matching relationship between images and text. unet does not change the input and output sizes of the model.

1.1 ResBlock

timestep_embedding, a method commonly used in data input training in sdxl

Embedding implementation of prompt text

spatialTransformer

The image information is query, and the text information is key and value. Notice that the first cross-attention in the basic transformerblock actually has no text input, it is self-attention.

DDPM

The ddpm code is in ldm/models/diffusion/ddpm.py

It can be seen that the noise addition of ddpm is completed at one time

In reasoning: 

Improvements for DDPM

The first line is DDIM, and the last line is the result of DDPM. FID is used to evaluate the image quality. S is the number of iterations. When the number of steps is the same, DDIM has better effect.

NO

Define noise addition as a non-Markovian process.

The subsequent samplers are basically improved based on DDPM.

Guess you like

Origin blog.csdn.net/u012193416/article/details/132696760