The principle behind Stable Diffusion (Latent Diffusion Models)

foreword

The first blog in 2023, Happy New Year everyone~

This time, let's focus on the principle behind Stable Diffusion, that is, the paper High-Resolution Image Synthesis with Latent Diffusion Models .
Those previously focused on work only up to 256 × 256 256 \times 256256×256 pixels (resize into this before inputting the model), or even lower.
However, this Latent Diffusion Models can reach512 × 512 512 \times 512512×512 , and the generated quality is better.

This article, like the previous article, will be analyzed from two perspectives: papers and codes. This article will be continuously updated...

DDPM principle and code analysisIDDPM
principle and code analysisDDIM
principle and code (Denoising diffusion implicit models)
Classifier Guided Diffusion



theory

Summary

(1) In the abstract part, the author said that the previous diffusion model can also achieve SOTA, but it requires huge computing power.
“However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.”

(2) The author thought of a way, which is also the origin of the model name latent. We don't want to derive it on the original pixels, we let the diffusion model learn in the latent space (which can be understood as a feature map space).
"We apply them in the latent space of powerful pretrained autoencoders."
Specifically, it can be that after the picture passes through the encoder (which can be CNN), a feature map is obtained, and then a standard diffusion process is performed on this feature map, and finally a decoder Maps back to image pixel space.

(3) 优势很显然
Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.


Introduction

(1) In the introduction, the author analyzed that the probability density correlation model can be divided into two stages, one is perceptual, which is the image texture details, and the other is semantic, for example, a handsome guy becomes a beautiful woman.
As with any likelihood-based model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes high-frequency details but still learns little semantic variation. In the second stage, the actual generative model learns the semantic and conceptual composition of the data ( semantic compression ).

So, the author wants to find the point on the perceptual first, sacrifice a little texture accuracy, in exchange for generating a high-definition image ( 512 × 512 512\times 512512×512 ) capabilities.

“Compared to pixel-based diffusion approaches, we also significantly decrease inference costs.”



Method

(1) The picture passes through an encoder to get the feature zzz, 即
z = E ( x ) z = E(x) z=E ( x )

Halfway is the regular DDPM, but the denoise is z, not x.


Finally, return the predicted x ^ \hat{x} through the decoderx^
x ^ = D ( z ^ ) \hat{x} = D(\hat{z}) x^=D(z^)


(2) If Conditioning Mechanisms are required, you can input the feature
ϵ θ ( zt , t , y ) \epsilon_θ(zt, t, y) of the relevant conditionsϵi(zt,t,y ) , thisy = E c ( xc ) y=E_c(x_c)y=Ec(xc)
For example, if you need to input text, first pass the text encoder to get the text features, and then input to the condition embedding of the Unet network, by adding or concatenating with the step embedding, etc. This is normal condition ddpm operation.

但是作者认为这样不好, “however, combining the generative power of DMs with other types of conditionings beyond class-labels [15] or blurred variants of the input image [72] is so far an under-explored area of research.”

This paper introduces a cross-attention mechanism,

Here τ θ \tau_\thetatiIs to deal with prompt yyEncoder for y , such as text yyy corresponding toτ θ \tau_\thetatiIt is a text encoder. Finally ϵ θ \epsilon_\thetaϵiSum τ θ \tau_\thetatiUpdate with the following formula:

Guess you like

Origin blog.csdn.net/weixin_43850253/article/details/128530913