【Paper】2307.SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (open source, with UI)

Paper: 2307. Improving Latent Diffusion Models for High-Resolution Image Synthesis (improved latent variable expansion model for high-resolution image synthesis)
insert image description here

Related Links and Interpretation

Code: https://github.com/Stability-AI/generative-models
official website model description: https://huggingface.co/stabilityai/
【SDXL0.9 local 安装部署教程https://www.bilibili.com/video/BV1oV4y18791
【 Model download] https://pan.baidu.com/s/1wuOibq3dYW_e_LrIgnr2Jg?pwd=0710 Extraction code: 0710

I. Overview

SDXL, DeepFloyd IF, DALLE-2, Bing, Midjourney v5.2

Each column from left to right corresponds to a generative model or software
insert image description here

1.1 Improve performance

1. SDXL 用户偏好效果seems to have greatly ** surpassed v1.5 and v2.1, and even midjourney v5.1tied with ! !
2. SDXL is very big (2.6B Unet reference) --> slower than previous SD + more VRAM
3. Two CLIP txt-encoders, instead of one adjustment vector in series, have better text image alignment (more 4.
Slightly improved VAE
5. Handles low-resolution training images (model conditioned on image size), randomly cropped (model conditioned on crop location), and non-square images (model conditioned on aspect ratio model)
6. SDXL has an optional refinement stage that is trained specifically for denoising small amounts of noise (when there is already a lot of information) for high-quality images.
from: Qinglong Saint at Station B

1.2 Specific open source model

SD-XL 0.9-base: The base model is trained on images with a resolution of 1024x1024 at various aspect ratios. The base model uses OpenCLIP-ViT/G and CLIP-ViT/L for text encoding, while the improved model uses only the OpenCLIP model.

SD-XL 0.9- refiner(Refiner model): The improved model is trained to denoise small noise levels in high-quality data, so it is not suitable as a text-to-image model; instead, it is only image-to-image Model.

2. Introduction to the original text

2.1 Summary

Proposed SDXL (XL code of Stable Diffusion), a latent diffusion model (a latent diffusion model ) for 文本到图像合成(text-to-image synthesis.) . Compared with the previous stable diffusion version, SDXL utilizes a UNet backbone network (sa three times larger UNet backbone), and the increase of model parameters mainly comes from:
三倍大

  1. More attention blocks
  2. Using 第二个文本编码器(a second text encoder.) to obtain a larger cross-attention context (cross-attention: refers to a technique for sharing attention mechanisms between multiple inputs),
  3. We design various novel conditioning schemes and train SDXL on multiple aspect ratios.
  4. We also introduce 改进模型a refinement model for improving the visual fidelity of SDXL-generated samples via post-hoc image-to-image techniques.

We demonstrate that SDXL shows dramatically improved performance compared to previous stable diffusion versions and achieves results comparable to state-of-the-art black-box image generators. To drive open research and facilitate transparency in large model training and evaluation, we provide access to code and model weights.

2.2 Model structure

insert image description here

We use SDXL to generate initial latent variables of size 128×128. We then leverage specialized high-resolution refinement models and apply SDEdit [28] on the latent variables generated in the first step, using the same hints. SDXL and improved models use the same autoencoder.

2108.SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations : Guided Image Synthesis and Editing with Stochastic Differential Equations.

2.3 Comparison between SDXL and SD1.5 / SD2.0

Model Components and Parameters
insert image description here
Effect of the same prompt
insert image description here

3. Future work (places to be optimized)

• Single stage: Currently, we use a two-stage approach to generate the best samples from SDXL with an additional refinement model. This resulted in the need to load two large models into memory, limiting accessibility and sampling speed. Future work should explore ways to provide equal or better quality single-stage methods.

• Text synthesis: Although scale and larger text encoders (OpenCLIP ViT-bigG [19]) help improve text rendering, combining byte-level tokenizers [52, 27] may extend the model To a larger scale [53, 40] may further improve text synthesis.

• Architecture: During the exploration phase, we briefly tried Transformer-based architectures such as UViT [16] and DiT [33], but found no immediate benefit. However, we remain optimistic that scaling to larger Transformer-dominated architectures will eventually be achieved after careful hyperparameter studies.

Distillation : Although we have made a significant improvement over the original steady-state diffusion model, it comes at a cost 推理成本的增加(including VRAM and sampling speed). Therefore, future work will focus on reducing the amount of computation required for inference and increasing the sampling speed, such as through methods such as guided [29], knowledge-based [6, 22, 24] and progressive distillation [41, 2, 29] .

• Our model is trained according to 2006.Denoising Diffusion Probabilistic Models, and needs to offset the noise for aesthetically pleasing results . 离散时间公式The EDM framework by Karras et al. ** 2206.Elucidating the Design Space of Diffusion BasedGenerative Models ** is a promising future for model training 候选方案because its formulation in continuous time allows increased sampling flexibility and does not require noise scheduling Correction.

Important English explanation:

Two-stage approach: Two-stage method
Text synthesis: Text synthesis
Architecture: Architecture
Distillation: Distillation
Offset-noise: Offset noise
EDM-framework: EDM framework (isochronous discretization formula)
Continuous time: Continuous time
Noise-schedule corrections: noise scheduling correction

Guess you like

Origin blog.csdn.net/imwaters/article/details/131633950