Improved version of SDXL-Stable Diffusion


论文: 《SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis》
github: https://github.com/Stability-AI/generative-models

1. Summary

SDXL is used for Vincent graphs. Compared with the previous SD, SDXL uses a 3 times larger UNet backbone: due to more attention blocks and larger cross-attention. The author designs a variety of novel conditional mechanisms, and introduces a refinement module to improve the fidelity of generated images. Compared with the previous version SD, SDXL has greatly improved performance.

2. Algorithm:

insert image description here
The overall structure of SDXL is shown in Figure 1.

2.1 Structure:

The diffusion generation model mainly uses the UNet structure. With the development of DM, the network structure changes: from increasing self-attention, improving the upsampling layer, increasing cross-attention to a transformer-based structure.
For the sake of efficiency, the author removes the transformer block in the shallowest feature layer, sets the number of transformer blocks used in the middle layer to 2 and 10, and removes the deepest feature layer (8 times downsampling layer), as shown in Table 1. The author and SD1.x and x2 . x for comparison.
insert image description here
At the same time, the author uses a more powerful pre-trained text encoder to concat OpenCLIP ViT-bigG and the penultimate output layer of CLIP ViT-L . In addition to adding a cross-attention layer to constrain the input text, the merged text code is also used as a model condition input, resulting in a model parameter amount of 2.6B, of which the text encoder is 817M.

2.2 Tiny Conditioning Mechanism

A significant disadvantage of LDM is that training the model requires a relatively small image size due to its two-stage structure. For this problem, one solution is to discard pictures with a resolution below a certain resolution in the training set, for example: pictures with a resolution below 512 in Stable Diffusion 1.4/1.5; the other is to upsample pictures that are too small. However, the former will lead to the discarding of a large amount of data in the training set, and the latter will introduce artificial factors in the upsampling process, resulting in fuzzy samples output by the model .
The author uses the original image resolution csize = ( horiginal , woriginal ) c_{size} = (h_{original}, w_{original})csize=(horiginal,woriginal) as the Unet condition input, specifically, the Fourier feature is used to encode the picture, concat is a vector, and the vector is added to the timestep embedding. The process is as Algorithm 1
insert image description here

During inference, the user can set the required image resolution scale, as shown in Figure 3. As the image size increases, the image quality improves.
insert image description here
The author compares only the data set above 512 * 512 resolution (CIN-512-only), all data (CIN-nocond), and increased resolution conditions (CIN-size-cond). The results are shown in Table 2, based on the cropping parameter adjustment
insert image description here
model
insert image description here
As shown in the first two lines of Figure 4, the target generated by the previous model may be cropped, because the training process uses random cropping to align the data size in the batch. To solve this problem, the author uses uniform sampling to clip coordinates ctop, cleft c_{top}, c_{left} during the data loading processctopcleft(distance from the upper left corner point), and send it into the model as a conditional parameter through Fourier feature encoding . The process is shown in Algorithm 1. Set ( ctop , cleft ) = ( 0 , 0 ) during inference (c_{top}, c_{left}) = (0, 0)(ctop,cleft)=(0,0 ) can generate samples with the target at the center of the graph, as shown in Figure 5.
insert image description here

2.3 Multi-Aspect Ratio Training

The picture generated by the current Wensheng graphical model is a square, 512 * 512 or 1024 * 1024, which is different from the real picture. For this, the author uses a variety of aspect ratio images for training , but the total number of pixels is close to 1024*1024, and the width and height are 64 times.
Use a fixed aspect ratio and resolution in the pre-training stage, and only use multiple aspect ratio training in the finetune stage;

2.4 Improved Autoencoder

The author trains an autoencoder with the same network structure as the original Stable Diffusion, and additionally increases the exponential moving average of the tracking weight . The experimental results are shown in Table 3, and SD-VAE 2.x is the improved result.
insert image description here

2.5 Putting it all together

The authors train the final model, SDXL, using the autoencoder mentioned in Section 2.4 .
First pre-train the basic model , use the internal data set, the width and height distribution is shown in Figure 2, use the resolution 256*256 during training, and use the size and crop condition at the same time , as described in Section 2.2; then further train on the 512 * 512 picture ; Finally, at 1024 * 1024 resolution, training based on different aspect ratios .
insert image description here
Refinement stage
As shown in Figure 6, the author found that some generated samples were of low local quality , so the author trained LDM separately in the latent space based on high-quality, high-resolution data, as mentioned by SDEdit, using the basic model to generate latent vectors for noise addition and noise reduction process . During inference, as shown in Figure 1, the hidden vector from the basic SDXL model is rendered. Based on this vector, the same text input is used to perform diffusion denoising using the refined model . The visualization results are shown in Figures 6 and 13.
insert image description here
The user evaluation results of the generated pictures are shown on the left in Figure 1, and the refinement module has an obvious effect . However, in terms of FID and CLIP indicators, the more consistent samples with the text, the indicators are lower than SD-1.5 and SD-2.1, as shown in Figure 12. The author analyzed that Kirstain et al. proved that the COCO zero-shot FID score is negatively correlated with visual evaluation. Manual evaluation shall prevail, and the author's experiment is consistent with this.
insert image description here

2.6 Comparison of mainstream solutions

Figure 8 is a comparison of the results of various mainstream generation schemes
insert image description here

3. Future of work

Single-stage : SDXL is a two-stage method that requires additional refinement of the model, increased memory and sampling speed, and a single-stage solution will be studied in the future;
text synthesis : a larger text encoder improves the text expression ability compared to the previous SD model, but Inserting a token or amplifying the model may also be helpful;
structure : the author has experimented with transformer-based structures: UViT and DiT, but there is no gain, and further research on hyperparameters is needed;
distillation : although the quality of SDXL generation has improved, the cost of reasoning has increased. In the future, use Distillation reduces this cost;

The model training process is discrete and needs to deviate from the noise. The EDM framework proposed by Karras et al. may be a future model training solution with continuous time, flexible sampling, and no noise correction mechanism.

4. Restrictions

  1. It is challenging to generate complex structures, such as human hands, as shown in Figure 7. Although a lot of training data is used, the complexity of human body structure makes it difficult to obtain accurate representation consistency, which may be difficult to model due to the relatively high variance of hands and similar targets in the figure;
    insert image description here
  2. Certain nuances, such as subtle lighting effects or small texture changes, result in unrealistic resulting images;
  3. The current model training relies on large-scale data sets, which may introduce social and racial issues, and generate images to cause this problem;
  4. When the sample contains multiple targets, the model has a "concept bleeding" phenomenon, that is, different elements are merged or stacked, as shown in Figure 14. This problem may be caused by the text encoder, all information is compressed into a single token, and it is difficult to connect to the appropriate target and attribute. Feng et al. solved it by encoding word relations; adversarial loss can also cause this phenomenon, because negative samples of different connections appear in the same batch;
    insert image description here
  5. It is difficult to present long and easy-to-read text, as shown in Figure 8. To overcome this problem, it is necessary to further strengthen the model's text generation ability;
    insert image description here

Guess you like

Origin blog.csdn.net/qq_41994006/article/details/132152984