SDXL:Improving latent diffusion models for high-resolution image synthesis

SDXL of the Vincent graph model - Zhihu's previous article Stable Diffusion of the Vincent graph model has introduced the relatively popular Vincent graph model Stable Diffusion. Recently, Stability AI has released a new upgraded version SDXL. At present, SDXL code, models and technical reports have all been open source: Official code: https://github.… https://zhuanlan.zhihu.com/p/642496862 GitHub - Stability-AI/generative-models: Generative Models by Stability AI Generative Models by Stability AI. Contribute to Stability-AI/generative-models development by creating an account on GitHub. https://github.com/Stability-AI/generative-models training and model weights are open source, read sdxl , When working on a project, adding data when you feel that the model effect is not good is not necessarily a good way. A better way should be to analyze the use of data and the problems in it for further refinement and improvement. The author of the above material analyzed very well, and it is very clear after reading it. At present, sdxl can also be tried in version 1.5 of stable_diffusion_webui. In addition, it has been integrated in diffusers.

1.introduction

3 major improvements: 1. The model parameter size of sdxl is increased to 2.3B, and 2 clip text encoders are used to extract text features; 2. sdxl uses additional conditional injection to improve the data processing problem in training, and adopts multiple Scale fine-tuning; 3. sdxl cascades a thinning model to improve image quality (the thinning model can also be used alone to enhance details).

2.Improving stable diffusion

The first picture is scored manually, and sdxl is far ahead. 

2.1 architecture & scale

sdxl uses a larger version of unet, 2.6B is 3 times larger than the previous version.

        Combine the above two pictures to explain the transformer blocks and channel mult in the table. The first picture is the unet structure diagram of sdxl, and the second is the unet in sd. The first stage in sdxl is ordinary downblock2d, and Crossattndownblock2d is not used, because sdxl finally directly generates a 1024x1024 image, and the corresponding latent is 128x128x4. If the first stage uses attention (including self-attention), the amount of video memory and calculation is very large. In addition, sdxl only has There are 3 stages. In the table, you can see that the list of sdxl has 3 values, and sd has 4 values ​​corresponding to 4 modules. 3 stages means that only 2 times of 2x downsampling have been performed, while the previous sd used 4 stage, including three 2x downsampling; the number of network width feature channels of sdxl has not changed compared to before, and the three stages are 320, 640, and 1280 respectively. The main parameter increase of sdxl comes from the use of more transformer blocks. In sd, each block including attnetion only uses one transformer block (self-attention->cross-attention->ffn), but stage 2 in sdxl The transformer blocks in the two crossattndownblock2d modules of stage 3 are 2 and 10 respectively, and the transformer block of midblock2dcrossattn in the middle is also 10, which is consistent with the last stage. This corresponds to row 2 in the table, and sdxl is 0,2, 10, sd are all 1, and 1, 2, 4 in the third row are multiples of 320.

        Another change of sdxl is the text encoder, sd 1.x uses the openai clip vit-l/14 with a text encoder of 123m, and sd 2.x upgrades the text encoder to a 354m openclip vit-h/14 (openclip is laion output), sdxl uses openclip vit-bigG and openai clip vit-l/14 with a parameter amount of 694m, and extracts the penultimate layer features of two text encoders here, among which the openclip vit-bigG feature dimension is 1280, and the clip vit The feature dimension of -l/14 is 768, and after concat it is 2048 dimensions, which is the context dim of sdxl.

After the above adjustments, the total parameter amount of unet of sdxl is 2.6b, and the unet of sdxl has changed, but the setting of the diffusion model is the same as that of the original sd, using 1000 steps of ddpm, and the noise scheduler remains unchanged.

2.2 micro-conditioning

        The second optimization point of sdxl is to use additional conditional injection to solve the data processing problem during the training process, including data utilization efficiency and image cropping problems.

The first question, the training of sd is often pre-trained on 256x256, and then continue training on 512x512. When using 256x256, it is necessary to filter out those images whose width and height are less than 256. It can only be used when training in 512x512 size For images above 512x512, the data needs to be filtered, which reduces the actual training data samples. As shown in the above figure, images smaller than 256 are filtered out, accounting for 39%. A direct solution is to perform super-resolution on the original data first, but sometimes there will be problems with super-resolution (generally when the original image is not clear, we also perform image repair on the training data), sdxl uses the original size width and height as conditions Embed in unet, which is equivalent to letting the model learn the image resolution parameters. During the training process, the image can be directly resized without filtering the data. During inference, only the target resolution needs to be input to ensure the quality of the generated image. The original size of the image is embedded Like the embedding of timesteps, the width and height are now encoded with Fourier feature encoding, and then the features are concat together and loaded on the time embedding. In the figure below, when the 512x512 model is obtained, the image is blurred when the input is low resolution, and the image quality is improved when the resolution is high.

The second problem is the image cropping problem in the training process. At present, the fixed image size is often used in the pre-training of the Vincent graph model, and the original image needs to be pre-processed. Generally, the shortest side of the image is resized to the target size, and then along the The longest side of the image is cropped (random crop or center crop), but image cropping often results in missing images.                

sdxl injects the coordinates of the upper left vertex cropped during the training process into unet as a condition, and adds it to the time embedding through Fourier encoding. During inference, you only need to set the coordinates to 0,0 to get a centered image, which is very good Understand that during training, the coordinate information of the upper left corner of the cropped image is learned by the model, which is an incomplete picture.

In sdxl training, two conditional injections can be used together, as long as the original width/height and the coordinates of the upper left corner of the image crop are additionally saved, sdxl is trained on 256x256 size for 600,000 steps (bs=2048) based on this conditional injection, Then use the 512x512 size to continue training for 200,000 steps, which is equivalent to sampling about 1.6 billion samples, and finally use multi-scale fine-tuning on 1024x1024.

2.3 multi-aspect traininng

        After pre-training, sdxl adopts multi-scale fine-tuning, adopts novelAI's scheme, and divides the images in the data set into different buckets according to different aspect ratios. During the training process, each step can be switched in different buckets. The data of each batch is sampled from the same bucket. In addition, sdxl also injects the bucket size, that is, the target size, into unet as a condition, the same as before.

2.4 improved autoencoder

        sdxl and sd are also based on the latent diffusion architecture. For the latent diffusion architecture, an autoencoder model is first used to compress the image into latent, and then the diffusion model is used to generate latent. The generated latent can be used to reconstruct the image through the decoder of the autoencoder. Note that it is not It was done in the pixel space, but it was actually done in the latent space. The kl used by sdxl's autoencoder is not vq. Based on the same architecture, a larger bs (256vs9) is used for retraining. The vae model structure in the table below is the same, but sd-vae 2.x is only in sd-vae On the basis of 1.x, the decoder part is fine-tuned, and the encoder part is the same, so the latent distribution of the two is consistent and can be used universally. But sdxl is retrained, its latent distribution has changed and cannot be mixed. Before sending latent into the diffusion model, latent should be scaled so that the standard deviation of latent is 1 as much as possible. Since the weight has changed, the scaling factor of sdxl is different from sd. sd is 0.18215, sdxl is 0.13025, and sdxl- When using float16 for vae reasoning, it will overflow, use float32 for reasoning, and select --no-half-vae in webui.

2.5 putting everything together

sdxl also has a refinement stage. The first model glues the sdxl base model, and the second model is the refiner model, which continues to improve the details of the image on the basis of the base model. The refiner model and the base model share the same vae, but the refiner model only Train on a lower noise level (the first 200 timesteps), and only use the graph generation ability of the refiner model during inference. The structure of the refiner model and the base model is somewhat different. The structure of the UNet is shown in the figure below. The refiner model uses 4 stages. The first stage also uses DownBlock2D without attention. The feature dimension of the network is 384, while the base model It is 320. In addition, the number of transformer blocks in the attention module of the refiner model is set to 4. The parameter amount of the refiner model is 2.3B, which is slightly smaller than the base model.

In addition, the text encoder of the refiner model only uses OpenCLIP ViT-bigG, which also extracts the penultimate layer features and pooled text embed. Like the base model, the refiner model also uses size and crop conditioning. In addition, the aesthetic-score of the image is added as a condition, and the processing method is the same as before. The refiner model should not use multi-scale fine-tuning, so the target size is not introduced as a condition (the refiner model is only used to generate graphs, and it can directly adapt to various scales). 

3.future work

The hand-generated structure is not good; the lighting and texture deviate from the truth; when the generated image includes multiple entities, the attributes are confused, and the attributes bleed or overflow.

single stage: single stage model;

text synthesis: use a better text encoder, imgen also said that the text encoder is very important;

architecture: Some pure transformer architectures are used, such as dit, which has no effect;

distillation: reduce the number of sampling steps;

diffusion model: adopt a better diffusion architecture

Guess you like

Origin blog.csdn.net/u012193416/article/details/132390358