Firefly running model | High-performance Stable Diffusion helps high-quality AI drawing

Stable Diffusion

AI painting has recently successfully broken the circle and has become a hot topic. DALLE, GLIDE, Stable Diffusion and other diffusion-based generative models have brought about a qualitative change in AI mapping, allowing people to see the dawn of "AI turning into productivity".

Among these diffusion models, Stable Diffusion has become one of the representatives with its excellent effects and open source weights, and has received extensive attention and experience. It is based on the Laion5B ultra-large-scale "text-image" pair data set, and Stable AI claims to use 5,000 A100s for several months to train. Magic Square AI recently reproduced the training of Stable Diffusion using the Google Caption dataset on Firefly II and optimized it. Through the hfai.pl plug-in developed by Magic Square, the source code Pytorch Lightning framework can be easily integrated with the characteristics of the Firefly cluster, and the model training can be accelerated through optimization tools such as 3FS , hfreduce , and operators .

This article will share our experience in Stable Diffusion training optimization to help researchers and developers lower the research threshold.

Paper title: High-Resolution Image Synthesis with Latent Diffusion Models

Original address: https://arxiv.org/abs/2112.10752

Source address: https://github.com/CompVis/stable-diffusion

Model repository: https://github.com/HFAiLab/stable-diffusion

Model introduction

Stable Diffusion expanded training based on Latent Diffusion, which replaced Text Encoder from BERT to CLIP Text Encoder. Let's first understand the model design of Latent Diffusion.

In the past, although the generation ability of the diffusion model was very powerful and reached the SOTA level in many different types of generation tasks, due to its iterative generation characteristics, it often required a lot of GPU resources during training and reasoning. Latent Diffusion has improved on this point. By changing the diffusion process from the pixel domain of the image to the encoded latent space, this method greatly reduces the complexity of the diffusion model when it is running, while also retaining better Detail and image generation effects. The overall structure is shown in the figure below:

Latent Encoding

Latent Diffusion realizes the conversion between pixel space and latent space by adding a variational autoencoder to the general diffusion model. Before training the DDPM model, a VAE model is first trained on ImageNet, which aims to learn an encoder and decoder for compressing images into latent codes. VAE will compress the image to the original 1/4 ~ 1/8 size, so it can greatly reduce the computational complexity when operating in the latent space. In this case, the image after compression and restoration can better retain the information of the original image, and there will be basically no excessive loss due to compression.

Cross-Attention

Latent Diffusion innovatively uses the attention mechanism to fuse conditional control information with the image generation process. In terms of conditional control generation, the author introduces conditional control information into each layer of the U-Net model to control the direction of image generation. In the fusion of control information and image generation process, the author introduced Cross Attention. The attention-based fusion mechanism makes it easier for the model to use different types of conditional control information, such as text to generate images, images to generate images, or semantic graphs to generate images, etc.

Text-based Generation

Unlike Latent Diffusion, Stable Diffusion focuses on generating images from text. Stable Diffusion was trained using 2.5 billion image-text pairs in the LAION-5B dataset, which is much larger than Latent Diffusion. In addition, inspired by other generation work such as Imagen, the BERT Text Encoder used in Latent Diffusion was replaced with a better Text Encoder pre-trained by CLIP/ViT-L-14.

model practice

training dataset

In order to verify the training performance of the Stable Diffusion model, we reproduced the Stable Diffusion training using the Google Conceptual Caption dataset. Google Conceptual Caption is a relatively small-scale multimodal dataset with 2.85 million image-text pairs. The data set has been integrated in Magic Square AI's data set warehouse, converted into frecord training data format and stored in 3FS high-speed storage. Users can obtain high-speed training data reading through the following methods:

from hfai.datasets import GoogleConceptualCaption
dataset = GoogleConceptualCaption(split="train", transform=transform)
dataloader = dataset.loader(**args)

hfai.pl

Pytorch Lightning (PL) is a wrapper on top of PyTorch with its unique parallel training interface. The source code of Stable Diffusion is built based on PL. In order to make it take advantage of various optimization features of the firefly cluster, we use the hfai.pl plug-in developed by Magic Square for adaptation, including:

  • hfai.pl.HFAIEnvironment, automatically adapts to the multi-card parallel environment of the firefly cluster, and can be used normally after adding a plug-in during training;

  • hfreduce_bind_numa, use hfreduce to speed up communication, bind buma to avoid additional network overhead between multiple cards;

  • hfai.pl.nn_to_hfai, use the hfai optimization operator to replace the basic operator in the model to speed up training

The specific operation is as follows:

1. Specify the strategy of the trainer as hfreduce_bind_numa in the configuration file:

trainer:
    max_epochs: 300
    strategy: hfreduce_bind_numa
    ...

2. In the training code, use the nn_to_hfai operator acceleration and the environment setting function of HFAIEnvironment:

diffusionModelModule = nn_to_hfai(diffusionModelModule)

...

trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
trainer.plugins = [HFAIEnvironment()]

Through the above simple steps, the acceleration features of the Firefly cluster can be integrated into the training code of Stable Diffusion. Test the change of model training speed before and after using hfai.pl. It can be found that the single forward time of the model has been accelerated from 0.787 seconds to 0.758 seconds, and the speed has increased by 3.8 % .

training optimization

We train on Google Conceptual Caption at 256x256 resolution, using weights pretrained on ImageNet to initialize the VQVAE model for latent space mapping.

During training, we tried to use 4, 8, 16, and 32 nodes for Stable Diffusion training to test the sensitivity under different parallel conditions. In the process of gradually expanding the scale of Stable Diffusion training, we found that Stable Diffusion is very sensitive to the learning rate, but the learning rate cannot easily increase with the increase of Batchsize, which can easily lead to gradient explosion. Therefore, we used the methods of Warmup and Gradient Clipping during training to help the model speed up the convergence and avoid the situation where it cannot converge.

The orange and blue curves in the figure above show the loss when training with or without Warmup and Gradient Clipping, respectively. As shown in the orange curve, when the Clipping and Warmup strategies are not used, the model has a gradient explosion phenomenon when the model is trained to the 400th step, and cannot continue to converge normally. After using it, the learning rate growth of model training becomes more gradual, which can effectively avoid the occurrence of gradient explosion.

training result

我们将模型在 Googlecc 数据集上训练了 240K Steps 后进行测试。在 COCO FID-30K(一个 COCO Caption 数据集中随机挑选的子集,由 3w 张图片组成)上,模型取得了 16.5 的 FID 指标,说明模型生成的图片能比较好的体现出文本中的内容。

以下是一些样例文本和将他们输入模型后生成的图片:

(a) A photo of a woman skiing on a white mountain.

(b) A painting of a squirrel eating a burger.

(c) A photo of a red train being operated on a train track.

(d) A photo of a dog playing in a green field next to a lake.

可见虽然训练数据集规模较小,但模型仍然达到了较好的生成效果。

体验总结

Stable Diffusion 作为 AI 作图领域的旗舰模型,受到了广泛的关注,在小范围数据上训练也可以实现惊艳的生成效果。我们借助幻方萤火集群,通过简单几步改造,能比较轻松地实现 Stable Diffusion 的训练加速,证明了萤火集群的易用性和实力。

综合体验打分如下:

01:研究新颖度  ★★★★

作者提出了一种在隐空间上进行扩散的生成模型结构,降低了扩散模型运行开销的同时保证了生成质量。模型还应用了交叉注意力机制来辅助条件控制生成,并且支持多种不同模态条件下的图像生成。

02:开源指数  ★★★★★

作为首个完全开源代码、训练数据和预训练权重的 AI 绘画预训练大模型,stable-diffusion 在学术界和其他相关领域都产生了极大的影响力。

03:算力门槛  ★★

由于模型对资源占用有所优化,且开源工作完善,因此单个普通 GPU 即可运行模型推理。但训练开销较高。

04:通用指数  ★★★★

作者提出的在隐空间上进行扩散的方法对一般的扩散模型都能够适用,并且基于交叉注意力的条件控制方法也能将模型应用于许多不同任务类型,对生成领域研究工作有广泛的借鉴意义。

05:模型适配度  ★★★

该项目依赖 pytorch-lightning,需要对萤火集群进行一定适配,但通过 hfai.pl 工具也能比较容易的在幻方 AI 环境运行并获得加速效果。


我们希望让更多“想象力”和“创造力”生长。期待与各方科学家及开发者们一同共建AI时代。

Guess you like

Origin blog.csdn.net/weixin_66945478/article/details/128561930