Diffusion Model and Toolchain for Chinese Domain Text-Graph Generation with Efficient Reasoning Speed

Recently, Alibaba Cloud's machine learning platform PAI and South China University of Technology (Alibaba Cloud and South China University of Technology's joint training project) published the Chinese domain text-image generation diffusion model and tool chain Rapid with efficient reasoning speed at the top natural language processing conference ACL2023. Diffusion. It is a text-to-image generation model for specific Chinese domains. It adopts the same model structure as Stable Diffusion, and can achieve rapid text-to-image generation when given Chinese text. In addition, we provide a one-click deployment function. One-click model finetune and inference can be performed on personal data.

paper:

Bingyan Liu*, Weifeng Lin*, Zhongjie Duan, Chengyu Wang, Ziheng Wu, Zipeng Zhang, Kui Jia, Lianwen Jin, Cen Chen, Jun Huang. Rapid Diffusion: Building Domain-Specific Text-to-Image Synthesizers with Fast Inference Speed. ACL 2023 (Industry Track)

background

Text-to-Image Synthesis (TIS) refers to the technology of generating images based on text input, given a text instruction, using a computer program to generate an image that conforms to the description of the text content. In recent years, with the rapid development of research on pre-trained large models and diffusion models, text-to-image generation models using pre-trained text encoders and diffusion-based image generators have been able to generate beautiful images comparable to human painters. However, because the pre-trained language model lacks domain-specific entity knowledge and is limited by the reasoning speed of the diffusion model, it is difficult for the current popular text-graph generation model in the open source community to support applications in specific industrial fields. The main problem is that the diffusion-based methods need to use a pre-trained text encoder to encode the input text, which is then used as the conditional input of the UNet model of the diffusion model. However, the current pre-trained text encoder model lacks the ability to understand specific entity concepts using text images collected on the Internet, and it is difficult to capture specific entity knowledge, which is crucial for generating realistic entity object pictures. At the same time, the inference speed and calculation cost of the diffusion model are also important factors to be considered, and the cumbersome calculation of the iterative inverse diffusion denoising process has always been the bottleneck of the inference speed of the diffusion model.

In order to solve the above problems, it is necessary to study diffusion models with the ability of specific entity object understanding to generate high-resolution images with text content descriptions, and to develop a framework for optimized text-image generation models that support fast online reasoning.

Algorithm overview

We propose a new framework: Rapid Diffusion, which is used to train and deploy text graph generation diffusion models. The model architecture is shown in Figure 1. Rapid Diffusion has been improved on the basis of stable diffusion model. In order to improve the ability to understand specific entities, we inject rich entity knowledge into CLIP's text encoder, and use knowledge graphs for knowledge enhancement. The noise predictor and stable diffusion of the latent space of the Wentu Generating Model Model is a U-Net network with a cross-attention mechanism. Unlike the open source Stable Diffusion, which directly uses a large-scale layered diffusion model, we integrate an ESRGAN-based network after the image diffusion module to improve the resolution of the generated image while effectively solving the problem of parameter explosion and time-consuming. For online deployment, we design an efficient inference process based on the FlashAttention optimized neural architecture. The Intermediate Representation (IR) of the generated model calculation graph is further processed by the end-to-end artificial intelligence compiler BladeDISC to improve the inference speed of the generated model.

Knowledge Augmented Text Encoder

For the knowledge-enhanced text encoder, we focus on the text-image generation problem in the Chinese scene. In order to obtain a text encoder that is more capable of understanding Chinese text and Chinese entity knowledge, we use a dataset of 100 million Wukong image-text pairs. as our text encoder pre-training data. In addition, for entity knowledge, we use the latest Chinese knowledge graph OpenKG dataset, which contains 16 million entities and 140 million ternary relationship pairs, to train our Chinese CLIP model. In the Chinese CLIP pre-training stage, the entity token in the Wukong corpus sentence will be enhanced as

,in

is the text embedding of the entity,

It is the knowledge map embedding obtained by the TransE algorithm. Although we enhance the knowledge of the CLIP model in the Chinese scene, our enhancement method is also applicable to other language scenes. When training a domain-specific text-image generation model, we set the text encoder of the Chinese CLIP model to be trainable for domain knowledge alignment.

latent spatial noise predictor

After getting the text embedding

Finally, we use the latent diffusion model to generate the latent encoding of the image in the latent space. The latent diffusion is a Unet model with a cross-attention mechanism that can capture text conditional information. The loss function of image reconstruction during training is:

The process of image generation is the inverse process of diffusion. Images are generated from randomly sampled Gaussian noise based on conditional text information. Similarly, in order to enhance the correlation between generated images and text information, we use classifier-free guidance during training. training method. In order to reduce the time overhead caused by too large sampling steps, we use the PNDM algorithm to reduce the sampling steps. In our framework, we use the Wukong dataset to pre-train the latent diffusion model, and then fine-tune it on the domain scene data.

super-resolution network

The resolution of the image generated by our latent diffusion model is 256*256. In order to obtain a higher resolution image, we directly use the trained ESRGAN model to improve the resolution of the image. , greatly improving the speed of image generation.

Inference Accelerated Design

In the part of inference acceleration design, we analyzed the inference speed of the original PyTorch model, and observed that the model inference bottleneck is mainly in the U-Net model, where the cross-attention calculation in U-Net dominates the inference time. The analysis results are shown in Figure 2. To solve this problem, we combine automatic slicing and compilation optimization techniques to optimize the entire technology pipeline, and introduce an IO-aware attention implementation to further improve inference performance.

Our inference acceleration algorithm does this by augmenting a set of intermediate labels to create a complete dynamic graph representation. For memory-intensive operations, we make full use of shared memory to design a larger-grained kernel fusion strategy, which effectively reduces the switching between CPU/GPU. Perform optimal graph partitioning and kernel implementation selection for optimal inference speed.

On the basis of automatic compilation and optimization, we further utilize Flash Attention technology as the cross-attention operator of U-Net, which is the core of network reasoning bottleneck. Based on the attention IO feature, this technology performs tiling operations on attention calculations to reduce the amount of memory read and write calculations. We introduce different FlashAttention core implementations for various combinations of computing devices and hardware architectures, as well as dynamic input. By accelerating the cross-attention calculation, it brings a 1.9 times speedup to the U-Net module.

Algorithm Accuracy Evaluation

In order to evaluate the Rapid Diffusion model, we tested on three Chinese graphic-text datasets (e-commerce, traditional Chinese painting, food), and the results are shown in Table 1:

Table 1 Performance comparison between Rapid Diffusion and baseline models (FID score).

The results can prove that Rapid Diffusion has achieved good results on these data sets. As can be seen from Table 1, Rapid Diffusion outperforms all counterparties on the three datasets with an average FID score of 21.90. The results show that our knowledge-augmented model for domain-specific scenarios can better understand domain knowledge and generate more realistic and diverse images.

Table 2 Performance of knowledge-enhanced CLIP for text-image retrieval.

Since the CLIP model aims to learn cross-modal representations, we first evaluate our model intrinsically via text-image retrieval. We compare the Chinese CLIP model with our Chinese Knowledge-enhanced CLIP (CKCLIP) model using the same pretrained text image corpus. Table 2 reports the text-to-image and text-to-image retrieval results on the test set. Our CKCLIP model significantly improves retrieval performance (especially for the R@1 metric), showing its ability to learn cross-modal representations.

Table 3 The inference acceleration results of Rapid Diffusion.

In terms of inference speed, we use the end-to-end artificial intelligence compiler BladeDISC and FlashAttention technology to improve the inference speed of the model. Similarly, as shown in the experimental results in Table 3, our method can increase the inference speed by 1.73 times. Although we are experimenting on Rapid Diffusion, the acceleration method we propose is universal and also applicable to other diffusion models, such as Stable Diffusion and Taiyi Diffusion models. We also integrated Rapid Diffusion with Alibaba Cloud machine learning platform PAI to demonstrate its practical value in practical applications. On the Alibaba Cloud machine learning platform PAI, users can train, fine-tune, and infer their own models on their own tasks (data) with one click.

In the future, we will extend the functionality of Rapid Diffusion and further improve inference speed through advanced compilation optimization techniques. In order to better serve the open source community, our model and source code will be contributed to EasyNLP, a natural language processing algorithm framework, and NLP practitioners and researchers are welcome to use it.

EasyNLP open source framework: https://github.com/alibaba/EasyNLP

references

  • Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin. EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing. EMNLP 2022
  • Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695.
  • Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840– 6851.
  • Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising diffusion implicit models. In International Conference on Learning Representations.
  • Kai Zhu, WY Zhao, Zhen Zheng, TY Guo, PZ Zhao, JJ Bai, Jun Yang, XY Liu, LS Diao, and Wei Lin. 2021. Disc: A dynamic shape compiler for machine learning workloads. In Proceedings of the 1st Workshop on Machine Learning and Systems, pages 89– 95.
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. CoRR, abs/2205.14135.

Paper information

Paper title: Rapid Diffusion: Building Domain-Specific Text-to-Image Synthesizers with Fast Inference Speed
​​Paper authors: Liu Bingyan, Lin Weifeng, Duan Zhongjie, Wang Chengyu, Wu Ziheng, Zhang Zipeng, Jia Kui, Jin Lianwen, Chen Cen, Huang Jun Paper PDF link
: https ://aclanthology.org/2023.acl-industry.28.pdf

Click to try cloud products for free now to start the practical journey on the cloud!

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/yunqiinsight/article/details/131702602