AIGC Series: Introduction to the upgraded version of Stable Diffusion SDXL

Table of contents

AIGC tool comparison

GIVE HER

MidJourney

Stable Diffusion

Relevant information

Introduction to SDXL

SDXL rendering effect

SDXL training LoRA process

AIGC tool comparison

        Among the three emerging text-to-image models, Stable Diffusion was born the latest, but due to its well-developed open source community, its user attention and application breadth surpass Midjourney and DALL-E.

GIVE HER

        In January 2021, OpenAI launched the DALL-E model, which uses the 12 billion parameter version of the GPT-3 Transformer model to understand natural language input and generate corresponding images. But it's being rolled out primarily for research, so access is limited to a small group of beta users. This model is unstable, has imperfect understanding and processing of details, and may have serious logical or factual errors, but as a pioneer, it still has to be specially proposed.

        When DALL-E was released, CLIP (Contrastive Language-Image Pre-training) was also released. CLIP is a neural network that returns optimal captions for input images. It does the opposite of what DALL-E does - it converts images to text, whereas DALL-E converts text to images. CLIP was introduced to learn the connection between visual and textual representations of objects.

        In April 2022, OpenAI released a new version of DALL-E 2, which is an upgraded version of DALL-E. It can also perform secondary editing of the generated images. Now even new users need to recharge to generate new images.

        On September 21, 2023, Open Ai released the latest generation product in its dall-e series. Compared with the previous generation dall-2, DALL-3 has undergone a comprehensive upgrade. But dall-3 can perfectly generate pictures by using only text descriptions, and completely control the picture through text. This means that users no longer need to learn how to construct keywords, and only need a paragraph of language description to generate a picture that completely matches the This language describes the picture. This will have a huge impact on current AI painting, and it also represents the future direction of AI painting.

MidJourney

        MidJourney’s v1 was released in February 2022. It became popular due to the v3 version in July 2022. It is characterized by comprehensive comprehensive capabilities and strong artistic quality. It is very similar to the works produced by artists. In addition, the image generation speed is faster. In the early days, many artists mainly used Midjourney as creative inspiration. In addition, because Midjourney is hosted on the Discord channel, it has a very good community discussion environment and user base.

        The second time it became popular was actually the release of V5 in March this year. Officials said that this version has significantly improved the realism of characters and finger details in the generated images, and has improved the accuracy of prompt word understanding, aesthetic diversity and language understanding. Progress has also been made in all aspects.

Stable Diffusion

        The advent of Stable Diffusion in July 2022 shocked the world. Compared with its predecessors, Stable Diffusion has successfully solved the problems of details and efficiency. Through algorithm iteration, it has improved the precision of AI drawing to the level of art and improved production efficiency. At the second level, the threshold of equipment required for creation has also been raised to civilian levels.

        August 2022 is a revolutionary moment for AI drawing. Thanks to the open source nature of Stable Diffusion, global AI drawing products have ushered in rapid development. This big discussion on AI creation has allowed the public to intuitively feel the impact of the technological wave. AI drawing is entering thousands of households, and a wave of public opinion has followed.

        In April 2023, Stability AI released the Beta version of Stable Diffusion XL, and mentioned that it will be open source after the parameters are stable after training, and improved the need to enter very long prompts (prompts), and the processing of human structure flaws , abnormalities in movement and body structure often occur.

        On July 27, 2023, Stability AI officially released the next generation Vincentian graph model—SDXL 1.0. SDXL 1.0 has the largest number of parameters among all current open image models and adopts an innovative new architecture, including a basic model with 3.5 billion parameters and an optimization model with 6.6 billion parameters. This is also the focus of this article. Next Let’s take a look together~

Relevant information

论文:《SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis》

Organization: Stability AI, Applied Research

论文地址:https://arxiv.org/pdf/2307.01952.pdf

Code address:https://github.com/Stability-AI/generative-models

Model weights:https://huggingface.co/stabilit

Trial address:https://huggingface.co/spaces/google/sdxl

Introduction to SDXL

        On July 27, 2023, Stability AI officially released the next generation Vincentian graph model—SDXL 1.0. SDXL 1.0 has the largest number of parameters among all current open image models and adopts an innovative new architecture, including a base model with 3.5 billion parameters and an optimization model with 6.6 billion parameters.

SDXL 1.0 includes two different models:

        ​​​​ sdxl-base-1.0: A base text-to-image model that generates 1024 x 1024 images. The basic model uses OpenCLIP-ViT/G and CLIP-ViT/L for text encoding.

        ​​​​ sdxl-refiner-1.0: An image-to-image model that refines the potential output of the base model to produce higher fidelity images. The refined model only uses the OpenCLIP-ViT/G model. The SDXL 1.0 refiner is based on a 6.6B parameter model of OpenCLIP-ViT/G, which is one of the most powerful open access image models currently available.

        Improvements have been made to the three major components of Stable Diffusion: U-Net, VAE, and CLIP Text Encoder.

  • U-Net adds Transformer Blocks (self-attention + cross-attention) to enhance feature extraction and fusion capabilities;

  • VAE adds a conditional variational autoencoder to improve the expression ability of the latent space;

  • CLIP Text Encoder adds two encoders of different sizes to improve text understanding and matching capabilities.

        ​ ​ ​ Add a Refiner model based solely on Latent to improve the refinement of the image. The Refiner model is also a latent diffusion model that receives the image Latent features generated by the basic model as input and further denoises and optimizes it, making the final output image clearer and sharper.

        We have designed many training tricks, including image size conditioning strategies, image cropping parameter conditioning, and multi-scale training. These tricks can improve the generalization ability and stability of the model, allowing the model to adapt to different resolutions and aspect ratios, as well as different image content and styles.

        The SDXL 0.9 test version is pre-released. Based on the user experience and image generation, the data set is increased in a targeted manner and the SDXL 1.0 official version is iteratively optimized using RLHF technology. RLHF is an image quality assessment technology based on reinforcement learning that can adjust the parameters of the model according to human preferences, making the color, contrast, light and shadow of the generated image more in line with human aesthetics.

SDXL rendering effect

SDXL's image generation is more stable, the details are richer and more realistic, and its controllability is also greatly improved compared to SD1.5.

Picture effect 1:

lora:AP-xl:1, AP, no humans, cat, realistic, animal focus, animal, blurry, simple background, whiskers, newspaper, gray background, ragdoll, wear sunglasses,

Negative prompt: (worst quality, low quality:1.4), (malformed hands:1.4),(poorly drawn hands:1.4),(mutated fingers:1.4),(extra limbs:1.35),(poorly drawn face:1.4), missing legs,(extra legs:1.4),missing arms, extra arm,ugly, huge eyes, fat, worst face,(close shot:1.1), text, watermark, blurry eyes,

Steps: 35, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 3539483990, Size: 512x512, Model hash: 31e35c80fc, Model: sd_xl_base_1.0, VAE hash: 63aeecb90f, VAE: sdxl_vae.safetensors, Lora hashes: "AP-xl: f5f7e8a091b0", Refiner: sd_xl_refiner_1.0_0.9vae [8d0ce6c016], Refiner switch at: 0.8, Version: v1.6.0-2-g4afaaf8a

Time taken: 1 min. 0.6 sec.

Picture effect 2:

lora:AP-xl:1, AP, no humans, dog, (sit on the toilet:1.4), (smoking in mouse and watch newspaper:1.5), realistic, animal focus, animal, blurry, simple background, whiskers, gray background, ragdoll, wear sunglasses,

Negative prompt: (worst quality, low quality:1.4), (malformed hands:1.4),(poorly drawn hands:1.4),(mutated fingers:1.4),(extra limbs:1.35),(poorly drawn face:1.4), missing legs,(extra legs:1.4),missing arms, extra arm,ugly, huge eyes, fat, worst face,(close shot:1.1), text, watermark, blurry eyes,

Steps: 36, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 1930821284, Size: 512x512, Model hash: 31e35c80fc, Model: sd_xl_base_1.0, VAE hash: 63aeecb90f, VAE: sdxl_vae.safetensors, Lora hashes: "AP-xl: f5f7e8a091b0", Refiner: sd_xl_refiner_1.0_0.9vae [8d0ce6c016], Refiner switch at: 0.8, Version: v1.6.0-2-g4afaaf8a

Time taken: 57.6 sec.

SDXL training LoRA process

SDXL training LoRA

        We will also update the image generation effect of SDXL+LoRA in the future. From the above effects, we can see that the image generation effect of SDXL is more refined than SD, the effect is better, and the stability of the text is also better. But this also brings a long generation time, because SDXL requires a larger number of steps for sampling, and generally it takes more than 30 steps to generate a more beautiful image. SD generally only takes about 20 steps to generate. Therefore, if you have been using SD1.5 or 2.0 to generate pictures, you can try SDXL. I believe you will have a different experience.

Guess you like

Origin blog.csdn.net/xs1997/article/details/134663343