AIGC-Stable Diffusion

Stable Diffusion is a large generative model that marks a new milestone in the field of AI and reveals to us that the future will be the era of AIGC. Traditional deep learning models are gradually transitioning to AIGC, which also means that we need to learn more about AIGC.

If you are a beginner of AIGC like mebeginner, then it is very important to learn the basic knowledge of AIGC model. As a powerful model, Stable Diffusion has high applicability, especially in generative tasks. By learning its basic theory and application, you can better understand the rules of information propagation in complex networks and master the generation technology in different scenarios.

In short, Stable Diffusion is a compelling model. Its emergence marks a new development direction in the field of AI, and the future trend will be dominated by the AIGC model. If you are interested in this, it will be very beneficial to study the content of AIGC in depth. [The end of the article includes SD construction and use]


Before learning Stable Diffusion, it is necessary to understand the content of DDPM.

In my previous article, I briefly introduced the content of DDPM. If you are interested, you can take a look:AIGC-Understanding DDPM (Diffusion Model) from a code perspective a>

Because the local environment is limited (video memory, computing power), some content may be relatively simple to analyze. Please forgive me~


Stable Diffusion (SD) is a generative model jointly developed by Stability AI and LAION. The model can be applied to the Vincentian and Image-generated tasks, and also includes subsequent customized generated image tasks, such as ControlNet.

As can be seen from the model name, the SD model contains the word "Diffusion", which means that it is similar to DDPM and has a denoising process. For the image-to-image task, the process of adding noise will also be involved.

This article will mainly introduce the Vincentian graph task and explore the application of the SD model in this task.


A Vincentian image means that the user inputs a piece of text, and after a certain number of iterations, the model outputs an image that matches the text description.

SD model composition

The SD model mainly includes the following parts:

1.CLIP Text Encoder (text encoder)

Function:Encode text information to generate the corresponding feature matrix to facilitate input into the SD model.

2.VAE Encoder (variational autoencoder)

Function:Generate Latent Feature (latent space feature) and text feature as model input at the same time. If it is a graph-generated image task, the image is encoded to generate a Latent Feature; if it is a Vincent-generated image, a randomly generated Gaussian noise matrix is ​​used as the Latent Feature as input. [That is, there are two inputs before entering the SD model, text features and latent space features]

3.U-Net network

Function: Used to continuouslypredict noise, and add it in each noise prediction process TextSemanticsFeatures.

4.Schedule

Function: Optimize the noise predicted by UNet (dynamically adjust the predicted noise, Control the intensity of U-Net prediction noise)< /span>

5.VAE Decoder (decoder)

Function: Pass the final Latent Feature through the decoder to generate an image.

In the iterative process of SD (denoising process), the noise will continue to decrease, and the image information and text semantic information will continue to increase.

The general process is as follows:


SD basic principles

In fact, whether it is GAN, DDPM or SD model, like other deep learning algorithms, they all learn the data distribution of the training set during training.

SD, like DDPM, has a diffusion process (noising process) and a generation process (noising process).

In the forward diffusion process, random Gaussian noise distribution will be obtained through continuous noise addition. During the generation process, the noisy image is continuously denoised to obtain the final image. The process is as follows. The entire process of noise addition and denoising is Markov chain.

 Forward diffusion process (noise addition):

The forward diffusion process is a process of continuous noise addition. We can continuously add noise to a picture until a random noise matrix is ​​generated (just control the number of noise addition steps), which is controlled by the Schedule mentioned earlier. .

Reverse generation process (denoising):

Backward generation is the opposite of forward diffusion. This process is to know a noise distribution and use the model to infer and predict to obtain the predicted noise process.

Then the training process is to establish a loss between the predicted noise and the actual input noise for training [This part is in my other This article has been discussed in DDPM].


Quickly build SD models

There are many ways to build SD. Here I will take diffusers to build SD as an example (only the reasoning part is included).

Install the diffusers library and dependencies:

pip install diffusers==0.18.0 -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

pip install transformers==4.27.0 accelerate==0.12.0 safetensors==0.2.7 invisible_watermark -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

Then you can quickly call SD

from diffusers import StableDiffusionPipeline


#初始化SD模型,加载预训练权重
pipe = StableDiffusionPipeline.from_pretrained("F:/BaiduNetdiskDownload/stable-diffusion-v1-5")


pipe.to("cuda")

#如GPU的内存不够,可以加载float16
pipe = StableDiffusionPipeline.from_pretrained("F:/BaiduNetdiskDownload/stable-diffusion-v1-5",revision="fp16",torch_dtype=torch.float16)

#输入prompt
prompt = "a photograph of an astronaut riding a horse"
steps = 50
image = pipe(prompt, height=512, width=512, num_inference_steps=steps).images[0]
image.save('SD_image.png')

Among them:num_inference_steps represents the number of optimizations. The larger the value, the better, but it will also take more time.

The output size model is 512x512, and lower resolution generation results are not good.

If the computing power is low, or using CPU for inference is also possible, but the effect is very good ~

For example, I use the CPU for inference on my computer [my graphics card 1650 4G is too stretched], and the effect is as follows:

Article reference

[1] Rocky Ding. Complete and in-depth analysis of the core basic knowledge of Stable Diffusion (SD)

[2] Bubbliying.AIGC Column 2 - Stable Diffusion Structural Analysis - Taking Text to Generate Image (Vincent Picture, txt2img) as an example

Guess you like

Origin blog.csdn.net/z240626191s/article/details/134851961