Stable Diffusion: A New Type of Deep Learning AIGC Model

Potential Diffusion Model | AIGC | Diffusion Model 

Image perception compression | GAN | Stable Diffusion

As the capabilities of generative AI technology improve, more and more attention is paid to improving R&D efficiency through AI models. There are many popular AI models in the industry, such as the drawing artifact Midjourney, the versatile Stable Diffusion, and the DALL-E 2 that OpenAI just iterated before.

For the R&D team, although Midjourney is powerful and does not require local installation, it has high requirements for hardware performance, and even the results obtained by the same command are different every time. Relatively speaking, Stable Diffusion has become a more ideal choice due to its multi-function, open source, fast running speed, low energy consumption and small memory footprint.

 

The detonation and hurricane of AIGC and ChatGPT4 technologies have greatly improved the fields of text generation, audio generation, image generation, video generation, strategy generation, GAMEAI, and virtual humans. It can not only improve the quality of creation, but also reduce costs and increase efficiency. At the same time, the demand for GPU and computing power is getting higher and higher, so GPU server manufacturers have begun to flock to this track to provide better support for this field.

This article will focus on how to install Stable Diffusion, the working principle of Stable Diffusion, and the advantages and disadvantages of Diffusion model compared with GAN.

How to install Stable Diffusion

Stable Diffusion is a very useful tool that can help users generate desired scenes and pictures quickly and accurately. Its installation is also very simple, just follow the steps above. If you need to quickly generate pictures and scenes, Stable Diffusion is a tool worth trying.

1. Environmental preparation

1. Hardware

1) Video memory

Starting with 4G, 4G video memory supports generating 512*512 size pictures, if the size exceeds this size, the card will burst and fail. The editor here recommends using RTX 3090.

2) hard disk

Starting from 10G, the models are basically above 5G, so it is not an exaggeration to have a 30G hard drive? Hard drive capacity should not be an issue now.

2. Software

1)Git

https://git-scm.com/download/win

Just download the latest version, there is no requirement for the version.

2)Python

https://www.python.org/downloads/

3) Nvidia CUDA

https://developer.download.nvidia.cn/compute/cuda/11.7.1/local_installers/cuda_11.7.1_516.94_windows.exe

Version 11.7.1, with Nvidia driver 516.94, can use the latest version.

4)stable-diffusion-webui

https://github.com/AUTOMATIC1111/stable-diffusion-webui

Of course, use the latest version of the core components~~ but pay attention to the compatibility of the above three versions.

5) Chinese language pack

https://github.com/VinsonLaro/stable-diffusion-webui-chinese

Download chinese-all-0306.json and chinese-english-0306.json files

6) Extension (optional)

https://github.com/Mikubill/sd-webui-controlnet

Download the entire sd-webui-controlnet compressed package

https://huggingface.co/Hetaneko/Controlnet-models/tree/main/controlnet_safetensors

https://huggingface.co/lllyasviel/ControlNet/tree/main/models

https://huggingface.co/TencentARC/T2I-Adapter/tree/main

Download the control_openpose.safetensors in the first link or the control_sd15_openpose.pth file in the second link for trial

7) Model

https://huggingface.co/models

https://civitai.com

You can find some recommended models on the Internet. The general suffixes are ckpt, pt, pth, safetensors, and sometimes VAE (.vae.pt) or configuration files (.yaml) are attached.

2. Installation process

1) Install Git

Just installed normally, no problem.

2) Install Python

It is recommended to install it in a non-program files, non-C drive directory to prevent directory permission problems.

Pay attention to check Add Python to PATH during installation, so that the Python path required by the windows environment variable PATH can be automatically added during installation.

3) Install Nvidia CUDA

Installed normally, no problems.

4) Install stable-diffusion-webui

Proxy and mirroring are needed in China, please follow the steps below:

a) Edit the launch.py ​​file in the root directory

Replace https://github.com with https://ghproxy.com/https://github.com, that is, use Ghproxy proxy to accelerate domestic Git.

b) Execute the webui.bat file in the root directory

The tmp and venv directories will be generated under the root directory.

c) Edit the pyvenv.cfg file in the venv directory

Change include-system-site-packages=false to include-system-site-packages=true.

d) Configure python library manager pip

For convenience, execute the following command after opening cmd under \venv\Scripts:

xformer will be installed into \venv\Lib\site-packages, if the installation fails, you can use pip install -U xformers command to try.

e) Install language packs

Put the files chinese-all-0306.json and chinese-english-0306.json into the directory \localizations directory. Configure after running the webui, the operation method is as follows.

f) Install extensions (optional)

Unzip sd-webui-controlnet into the \extensions directory. Copy the control_sd15_openpose.pth file into the /extensions/sd-webui-controlnet/models directory. Different extensions may also need to install the corresponding system. For example, to use controlnet normally, you need to install ffmpeg, etc.

g) Install the model

The downloaded models can be placed in the \models\Stable-diffusion directory.

h) Execute the webui.bat file in the root directory again

Use a browser to open the URL provided by webui.bat to run.

Which provides the URL: http://127.0.0.1:7860.

After opening the URL, set the language in Settings -> User interface -> Localization (requires restart), select chinese-all-0220 in the menu (provided that the corresponding language pack has been placed in the directory, see above), and click Apply Settings Confirm, and click Reload UI to restart the interface.

The principle behind Stable Diffusion

The overall framework of Latent Diffusion Models (potential diffusion model) is shown in the figure below. First, you need to train a self-encoder model, so that you can use the encoder to compress the image, then perform a diffusion operation on the latent representation space, and finally use the decoder to restore the original pixel space. This approach is called Perceptual Compression. I personally think that this method of compressing high-dimensional features to low-dimensional and then operating on low-dimensional space is universal and can be easily extended to text, audio, video and other fields.

The main process of performing the diffusion operation on the latent representation space is not much different from the standard diffusion model. The specific implementation of the diffusion model used is time-conditional UNet. However, the paper introduces a conditional mechanism (Conditioning Mechanisms) for the diffusion operation, and realizes multi-modal training through cross-attention, so that the conditional image generation task can also be realized.

Next, we will expand on the specific details of perceptual compression, diffusion model, and conditional mechanism.

1. Perceptual Image Compression

Perceptual compression is essentially a tradeoff. Many previous diffusion models can be performed without using this technique, but there is a big problem with the original non-perceptual compression diffusion model, that is, when training the model on the pixel space, if you want to generate high-resolution images, then The training space is also high-dimensional. Perceptual compression uses the self-encoding model to ignore high-frequency information and only retain important basic features, thereby greatly reducing the computational complexity of the training and sampling stages, enabling tasks such as text and image generation to generate images within 10 seconds on consumer-grade GPUs , lowering the landing threshold.

Perceptual compression leverages a pretrained autoencoder model to learn a latent representation space that is perceptually equivalent to the image space. The advantage of this method is that it only needs to train a general autoencoder model, which can be used for training different diffusion models and used on different tasks.

Therefore, the training of the diffusion model based on perceptual compression is essentially a two-stage training process. In the first stage, an autoencoder needs to be trained, and in the second stage, the diffusion model itself needs to be trained. When training the autoencoder in the first stage, in order to avoid a high degree of alienation in the latent representation space, the author used two regularization methods, one is KL-reg and the other is VQ-reg, so in the official release In the stage pre-training model, you will see two implementations of KL and VQ. The implementation of AutoencoderKL is mainly used in Stable Diffusion.

2. Latent Diffusion Models

First, a brief introduction to the ordinary diffusion model (DM), which can be interpreted as a time series denoising autoencoder (equally weighted sequence of denoising autoencoders) 

 , whose goal is to predict a corresponding denoised variant, or predictive noise, based on the input , where is the noisy version of the input . The corresponding objective function can be written as follows:

. where is uniformly sampled from .

While in the latent diffusion model, a pre-trained perceptual compression model is introduced, which includes an encoder and a decoder . In this way, the encoder can be used to obtain during training, so that the model can learn in the potential representation space, and the corresponding objective function can be written as follows:

3. Condition mechanism

In addition to unconditional image generation, we can also generate conditional images, which is mainly achieved by expanding a conditional denoising autoencoder, so that we can control the process of image synthesis by . Specifically, the paper achieves this by adding a cross-attention mechanism to the UNet backbone network. In order to be able to preprocess from multiple different modalities, the paper introduces a domain specific encoder, which is used to map to an intermediate representation, so that we can easily introduce various forms of conditions ( text, category, layout, etc.). The final model can integrate the control information into the middle layer of UNet through a cross-attention layer mapping. The implementation of the cross-attention layer is as follows:

where is an intermediate representation of UNet. The corresponding objective function can be written as follows:

4. The trade-off between efficiency and effect

Analyze the effect of different downsampling factors f ∈ {1, 2, 4, 8, 16, 32} (LDM-f for short, where LDM-1 corresponds to pixel-based DMs). In order to obtain comparable test results, experiments were performed on a fixed NVIDIA A100 and the models were trained with the same number of steps and parameters. Experimental results show that small downsampling factors like LDM-{1,2} are slow to train because it leaves most of the perceptual compression to the diffusion model. While f is too large, fidelity stagnates after relatively few training steps due to too much compression in the first stage, leading to loss of information and thus limiting the achievable quality. LDM-{4-16} achieves a good balance between efficiency and perceptual results. Compared with the pixel-based LDM-1, LDM-{4-8} achieves lower FID scores while significantly improving sample throughput. For complex datasets like ImageNet, the compression rate needs to be reduced to avoid loss of quality. In conclusion, LDM-4 and -8 provided higher quality synthesis results.

Advantages and disadvantages of Diffusion model compared with GAN

1. Advantages

Compared with GAN, Diffusion Model has the obvious advantage of avoiding troublesome confrontation learning. In addition, there are several less obvious benefits: first, Diffusion Model can "perfectly" use latent to represent pictures, because we can use an ODE to change from latent to pictures, and the same ODE can change from pictures to latent in turn . However, it is difficult for GAN to find what latent the real picture corresponds to, so it may not be easy to modify the picture generated by non-GAN. Secondly, Diffusion Model can be used to do "color block-based editing" (SDEdit), but GAN does not have such properties, so the effect will be much worse. Again, due to the connection between the Diffusion Model and the score, it can be used as a learned prior for the inverse problem solver. For example, if I have a generative model for a clear picture and see a blurred picture, I can use the generative model as a priori to make the picture more accurate. clear. Finally, the Diffusion Model can find the model likelihood, but this GAN is difficult to handle. Part of the recent popularity of the Diffusion Model may also be due to the fact that the GAN volume is not moving. Although strictly speaking, the Diffusion Model was first published by Jascha Sohl-Dickstein in ICML 2015, which is not much different from GAN's NeurIPS 2014; but DCGAN/WGAN, which makes GAN Walker's work, came out in 2015-17 , and the Diffusion Model is basically a walker in the eyes of everyone in NeurIPS 2020, so it seems to be more popular recently and it is normal.

2. Inadequacies

Compared with GAN, the Diffusion model also has some defects. First, the dimensionality of the latent space cannot be modified directly, which means that image styles cannot be manipulated with AdaIN as in StyleGAN. Second, since there is no discriminator, it will be difficult if the supervision condition is "I want the network to output something that looks like a certain object, but I am not sure what it is". And GAN can easily achieve this, for example to generate images of giraffes. In addition, due to the need for iteration, the generation speed is relatively slow, but it has been solved in pure image generation. The current research on conditional image generation is not enough, but you can try to apply the diffusion model to this field.

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/130009118