Realistic reproduction of "Perfect Chinese Couple"! Enhanced version of Stable Diffusion free experience, the latest technical report released

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Source | Xinzhiyuan ID | The latest technical report of AI-era "enhanced version" Stable Diffusion is released!

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Report address: https://github.com/Stability-AI/generative-models/blob/main/assets/sdxl_report.pdf After the public beta was launched in April, Stable Diffusion XL has been favored by many people, known as "the open source version Midjourney". In terms of hand drawing, writing and other details, SDXL can control the overall situation, and the most important thing is that it can be realized without super long prompts. Not only that, but SDXL 0.9 can be experienced for free compared to Midjourney, which requires krypton gold!

edit

Add picture annotations, no more than 140 words (optional)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Interestingly, the research team even thanked "ChatGPT for writing assistance" in the final appendix.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

"Small victory" Midjourney So, compared with Midjourney, how good is SDXL? In the report, the researchers randomly selected 5 cues from each category and generated four 1024×1024 images for each cue using Midjourney (v5.1, seed set to 2) and SDXL. These images were then submitted to the AWS GroundTruth task force, which voted on following prompts. Overall, SDXL is slightly better than Midjourney when it comes to following prompts.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Feedback comparisons from 17,153 users cover all "categories" and "challenges" in the PartiPrompts (P2) benchmark. It is worth noting that SDXL outperforms Midjourney V5.1 in 54.9% of cases. Preliminary tests have shown that the recently released Midjourney V5.2 has reduced comprehension of prompts. However, the cumbersome process of generating multiple prompts hampers the speed at which more extensive testing can be performed.

edit

Add picture annotations, no more than 140 words (optional)

Each hint in the P2 benchmark is organized into categories and challenges, each of which focuses on a different pain point of the generation process. The comparison results for each category (Fig. 10) and challenge (Fig. 11) in the P2 benchmark are shown below. In 4 of 6 categories, SDXL outperformed Midjourney, and in 7 out of 10 challenges, there was no significant difference between the two models, or SDXL outperformed Midjourney.

edit

Add picture annotations, no more than 140 words (optional)

You can also guess which ones are generated by SDXL and which ones are generated by Midjourney in the following group of pictures. (The answer will be revealed below)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

SDXL: The strongest Vincent graph in open source Last year, Stable Diffusion, the model known as the strongest Vincent graph, was open sourced, igniting the torch of global generative AI. Compared to OpenAI's DALL-E, Stable Diffusion enables people to achieve Vincent graph effects on consumer graphics cards. Stable Diffusion is a latent text-to-image diffusion model (DM), which is widely used. Recent studies on the reconstruction of brain images based on functional magnetic resonance imaging (fMRI) and music generation are based on DM. And Stability AI, the start-up company behind this explosive tool, launched again in April this year, an improved version of Stable Diffusion - SDXL.

edit

Add picture annotations, no more than 140 words (optional)

According to user research, the performance of SDXL has always exceeded all previous versions of Stable Diffusion, such as SD 1.5 and SD2.1. In the report, the researchers propose design choices that lead to this performance improvement, including: 1) a 3-fold increase in the UNet backbone architecture compared to previous Stable Diffusion models; 2) two simple and effective additional conditioning techniques, not Any form of additional supervision is required; 3) a separate diffusion-based refinement model that applies denoising to the latent signal produced by SDXL to improve the visual quality of the samples. Improving Stable Diffusion Researchers have improved the Stable Diffusion architecture. These are modular and can be used individually, or together to extend any model. Although the strategies below are carried out as extensions to latent diffusion models, most of them also apply to their pixel-space counterparts, the report says.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Currently, DM has been proven to be a powerful generative model for image synthesis, and the convolutional UNet architecture becomes the dominant architecture for diffusion-based image synthesis. With the development of DM, the underlying architecture is also evolving: from adding self-attention and improving the upgrade layer, to cross-attention for text-image synthesis, to purely based on Transformer architecture. In the continuous improvement of Stable Diffusion, researchers are also following this trend, moving most of the Transformer calculations to the lower-level features in UNet. In particular, the researchers used a different heterogeneous distribution of Transformer blocks in UNet compared to the original SD architecture. For efficiency, and omit the Transformer block in the highest feature level, use 2 and 10 blocks in the lower levels, and also completely remove the lowest level (8x downsampling) in UNet, as shown in the figure below.

edit

Add picture annotations, no more than 140 words (optional)

Comparison of different versions of SDXL and Stable Diffusion models The researchers chose a more powerful pre-trained text encoder for text conditioning. Specifically, using OpenCLIP ViT-bigG in conjunction with CLIP ViT-L, here the penultimate text encoder output is connected along the channel axis. In addition to using a cross-attention layer to constrain the model's text input, the researchers followed, and the hybrid embeddings of the OpenCLIP model additionally constrained the model on the text. As a result, these factors lead to a model parameter size of 2.6B in UNet and a total text encoder parameter size of 817M. The biggest disadvantage of fine-tuning Latent Diffusion Model (LDM) is that training a model requires a minimum image size due to its two-stage architecture. There are mainly 2 ways to solve this problem, either discard training images below a certain minimum resolution, (SD 1.4/1.5 discards all images below 512 pixels) or choose super small high-level images. However, the first approach results in potentially large amounts of training data being discarded and image performance loss. The researchers made a visualization for the SDXL pre-training dataset. For special data selection, discarding all samples below the 256×256 pixel pre-training resolution will result in 39% data loss. While the second approach, usually brings upscaling artifacts that may leak into the final model output, resulting in blurred samples.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

In response, the researchers propose to condition the UNet model on the native resolution. This is very easy to obtain during training. In particular, the height and width of the original image is provided as a model

edit

Add picture annotations, no more than 140 words (optional)

additional conditions. Each component encodes independent embeddings using Fourier features and is concatenated into a vector that the research team feeds back into the model by adding to the time-step embeddings. At inference time, this resizing allows the user to set the desired intuitive resolution of the image. Clearly, the model has learned to incorporate the condition

edit

Add picture annotations, no more than 140 words (optional)

Image characteristics related to resolution. As shown, the researchers show 4 samples taken from SDXL with the same random seed and varying size adjustments. When resizing larger image sizes, image quality improves significantly.

edit

Add picture annotations, no more than 140 words (optional)

Below is the output of SDXL, compared with previous versions of SD. For each prompt, researchers take 3 random samples at steps 50 of the DDIM sampler and cfg-scale 8.0. In previous SD models, composited images could be cropped incorrectly, such as the cat head produced by SD 1.5 and SD 2.1 in the example on the left. It is not difficult to see from the following comparisons that SDXL has basically solved this problem.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

edit

Add picture annotations, no more than 140 words (optional)

edit

Add picture annotations, no more than 140 words (optional)

Such a dramatic improvement is possible because the researchers proposed a simple yet effective conditionalization method: During data loading, the clipping coordinates are uniformly sampled

Add picture annotations, no more than 140 words (optional)

and

Add picture annotations, no more than 140 words (optional)

(integers specifying the number of pixels to crop from the top-left along the height and width axes, respectively), and feed them into the model as conditioning parameters via Fourier feature embedding, similar to the size conditioning method described above. Then use the connection embed

Add picture annotations, no more than 140 words (optional)

as an additional conditional parameter. Here, the research team especially emphasizes that this is not the only technique applicable to LDMs, as cropping and size adjustment can be easily combined. In this case, the embedded features are concatenated along the channel dimension and then added to the time-step embedding UNet. As shown in the figure, by tuning

edit

Add picture annotations, no more than 140 words (optional)

, which can successfully simulate the amount of clipping during inference.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Comprehensive training Inspired by the techniques above, the researchers also fine-tuned the model to handle multiple aspect ratios simultaneously: split the data into buckets of different aspect ratios, here keeping the pixel count as close to 1024×1024 as possible, correspondingly by 64 A multiple of changes in height and width. Improved autoencoders Although most of the semantic composition is done by LDM, researchers can improve local, high-frequency details in generated images by improving autoencoders. To this end, the researchers train the same autoencoder structure used for the original SD with a larger batch size (256 vs 9), and track the weights with an exponential moving average. The resulting autoencoders outperformed the original model in all reconstruction metrics evaluated.

edit

Add picture annotations, no more than 140 words (optional)

SDXL was born by researchers in a multi-stage process to train the final model SDXL. SDXL uses an autoencoder, and a 1000-step discrete-time diffusion schedule. First, a base model is pretrained on an in-house dataset with height and width distributions shown for 600,000 optimization steps at a resolution of 256 × 256 with a batch size of 2048, using the size and crop adjustments described above. The researchers then proceeded to train on 512×512 images, followed by another 200,000 optimization steps, and finally utilized full training, combined with an offset noise level of 0.05, to train the model on regions of approximately 1024×1024 pixels with different aspect ratios. On the refinement stage, the researchers found that the resulting model sometimes produced locally low-quality samples, as shown in the figure below. To improve sample quality, they trained a separate LDM in the same latent space, specialized for high-quality, high-resolution data, and adopted SDEdit to introduce denoising processing on the base model samples. During inference, the researchers use the same text input, render latent information from base SDXL, and use a refinement model to directly diffuse and denoise in the latent space.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

1024×1024 samples from SDXL upscaled without (left) and with (right) thinning model It is worth mentioning that this step improves the sample quality of the background and faces and is optional.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

Limitations Although the performance of SDXL has been greatly improved, the model still has obvious limitations.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

First of all, SDXL still cannot handle complex structures such as human hands well. The reason for this problem, the researchers speculate, is that it is difficult for the model to extract the true 3D shape due to the large variability in different images of human hands and other objects with complex structures. Second, the images produced by SDXL are nowhere near as photorealistic. In some subtle details, such as low-light effects or texture changes, AI-generated images may appear missing or not perform accurately. In addition, when an image contains multiple objects or subjects, the model may suffer from what is known as "concept overflow". This problem manifests itself as the unexpected merging or overlapping of different visual elements. For example, the orange sunglasses in the picture below are due to the concept overflow of "orange sweater".

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

In Figure 8, the penguin that should be wearing "blue hat" and "red gloves" is wearing "blue gloves" and "red hat" in the generated image.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

At the same time, the "text generation" part that has always plagued Vincent's graphical model is still a big problem. As shown in Figure 8, the text generated by the model may sometimes contain random characters, or be inconsistent with the given prompt. For future work, the researchers said that further improvement of the model will mainly focus on the following aspects: • Single-stage generation Currently, the team uses an additional refinement model in a two-stage manner, to generate the best sample of SDXL. This requires loading two bulky models into memory, reducing accessibility and sampling speed. • Text synthesis scale and a larger text encoder (OpenCLIP ViT-bigG) help improve text rendering, while introducing a byte-level tokenizer or scaling the model to a larger scale may further improve the quality of text synthesis. • Architecture During the exploration stage, the team tried Transformer-based architectures such as UViT and DiT, but there was no significant improvement. However, the team still believes that with more careful hyperparameter studies, scaling to larger Transformer-based architectures can eventually be achieved. • Distillation Although the original Stable Diffusion model has been significantly improved, at the cost of increased inference costs (including video memory and sampling speed). Therefore, future work will focus on reducing the amount of computation required for inference and increasing the sampling speed. For example, through guided distillation, knowledge distillation and progressive distillation. Currently, the latest report is only available on GitHub. The CEO of Stability AI said that it will be uploaded to arxiv soon.

Edit toggle to center

Add picture annotations, no more than 140 words (optional)

References: https://twitter.com/emostaque/status/1676315243478220800?s=46&t=iBppoR0Tk6jtBDcof0HHgghttps://github.com/Stability-AI/generative-models/blob/main/assets/sdxl_report.pdf

Guess you like

Origin blog.csdn.net/lqfarmer/article/details/131604514