"Zero" code changes, static compilation doubles the reasoning speed of Taiyi Stable Diffusion

ba60c78bec6ead72f28d7bb593a8b4d2.jpeg

Author|Liang Depeng
 

The tools in the field of AI mapping have been unsatisfactory until last August when Stable Diffusion was open sourced and became an indisputable epoch-making model in the field of AI image generation.

In order to improve its reasoning efficiency, OneFlow accelerated the Stable Diffusion model to the era of "one-second map generation" for the first time , which greatly improved the speed of Vincent graphs, which aroused great repercussions in the AIGC field and received official support from Stability.ai. So far, OneFlow is still refreshing the SOTA record.

However, since most teams currently develop based on the translation API + English Stable Diffusion model, it is difficult for the English version model to give the correct matching picture content when using the unique Chinese narrative and expression, which is a problem for some domestic users. It's not very convenient.


In order to solve this problem, the domestic IDEA Institute Cognitive Computing and Natural Language Research Center (IDEA CCNL) also open sourced the first Chinese version of "Taiyi Stable Diffusion", based on 20 million screened Chinese image-text pairs for training . Last month, Taiyi Stable Diffusion had nearly 150,000 downloads on HuggingFace, making it the most downloaded Chinese Stable Diffusion.

Recently, the OneFlow team has adapted the OneFlow backend for Taiyi Stable Diffusion, which has greatly improved the inference performance and can also produce maps in one second. Many developers are curious about what optimization "secrets" OneFlow uses, which will be briefly explained later.

Welcome Star, run the OneFlow version of Taiyi Stable Diffusion:
 

https://github.com/Oneflow-Inc/diffusers/wiki/How-to-Run-OneFlow-Stable-Diffusion#without-docker

1Compared with PyTorch
, OneFlow more than doubles the reasoning speed of "Taiyi Stable Diffusion"
 

The following charts show the use of PyTorch and OneFlow on Taiyi Stable on different types of GPU hardware of A100 (PCIe 40GB / SXM 80GB), V100 (SXM2 32GB), RTX 2080, RTX 3080 Ti, RTX 3090, and T4 respectively The performance of Diffusion inference.

b6bbced83e725c4f6a5897cef9b759f6.png

6c7ebe4c0a38a841e73e898dfef0c794.png

It can be seen that for the A100 graphics card, whether it is a PCIe 40GB configuration or an SXM 80GB configuration, the performance of OneFlow can be more than doubled compared with PyTorch, and the inference speed has reached more than 50it/s. The time required to generate a picture is in within 1 second.

Other hardware data:

dfe5878aaf634dd4c582d640f2e90e6f.png

Note: AIT data on 3090 is provided by IDEA Research Institute

dbd9e7012f08f175280747ba99b6f123.png

7e348b38daf39e490ecf6af32f24826a.png

4b6cfdb3843d5bb2cd0654a40f5cf1b0.png

898c6617fe1f7b68552e9cfde62d67a3.png

In summary, in the comparison of various hardware, compared with PyTorch, OneFlow can more than double the reasoning performance of Taiyi Stable Diffusion.

2

Generate image display

surging river, continuous, beautiful, illustration

870f173c57f3a2504455fe15c1483962.png

Great Wall, morning, hazy, beautiful, illustration

57648864a4a0ebf4891aca2247563a8e.png

Dreaming back to Jiangnan, an ancient Chinese town, beautiful, illustration

79f2ea43173b6baf5065a7601e0f7f8c.png

Future city in China, sci-fi illustration

f6f92477a4d4606591dfe311e3e6a6f9.png

ancient buildings, snowy

3d6d1ea88877e1094ebbebb97fb03d7e.png

Snail noodles

0b5e0b273e7431458bebadefb0886127.png

Note: The above pictures are all generated based on the OneFlow version of Taiyi Stable Diffusion
 

3

Seamlessly compatible with PyTorch ecosystem

Want to experience the OneFlow version of Taiyi Stable Diffusion? Only two lines of code need to be modified:

fc93a1a650ff324695c67af8e1bc0d8a.png

The reason why models can be migrated so easily is because OneFlow Stable Diffusion has two outstanding features:

  1. OneFlowStableDiffusionPipeline.from_pretrained can directly use PyTorch weights.

  2. The API of OneFlow itself is aligned with PyTorch, so after importing oneflow as torch, expressions such as torch.autocast and torch.float16 do not need to be modified at all.

The above features make OneFlow compatible with the PyTorch ecosystem, which not only played a role in the migration of OneFlow to Taiyi Stable Diffusion, but also greatly accelerated the migration of many other models by OneFlow users. For example, in flowvision, which is benchmarked against torchvision, many models only need It can be obtained by adding import oneflow as torch to the torchvision model file.

In addition, OneFlow also provides a global "mock torch" function. Running eval $(oneflow-mock-torch) on the command line can make the import torch in all Python scripts run next automatically point to oneflow.
 

4

Dynamic and static integrated programming experience

The prototype development stage of deep learning algorithms requires rapid modification and debugging, and dynamic graph execution (Eager mode, define by run) is optimal. But in the deployment phase, the model has been fixed, and computational efficiency becomes more important. Static graph execution (lazy mode, define and run) can be statically optimized by the compiler to achieve better performance. Therefore, the inference phase mainly uses the static graph mode.

Recently, PyTorch was upgraded to 2.0 and introduced the compile() API, which can change a model or a module from dynamic graph execution to static graph execution. There is a similar mechanism in OneFlow, but the interface name is nn.Graph(), which can convert the incoming Module into a static graph execution mode.

Not only that, OneFlow's nn.Graph mode implements layer optimization of a series of calculation graphs based on MLIR , such as memory layout and operator fusion.
 

This not only enables the deep learning model represented by the calculation graph to achieve the highest performance on various hardware, but more importantly, makes the calculation graph imported by the deep learning framework more convenient to migrate between different hardware, which helps to overcome domestic hardware. The problem of weak software ecology. In the future, we will publish more content to reveal the design and implementation of the OneFlow deep learning compiler.

Welcome Star, run the OneFlow version of Taiyi Stable Diffusion:
 

https://github.com/Oneflow-Inc/diffusers/wiki/How-to-Run-OneFlow-Stable-Diffusion#without-docker
 

OneFlow Address: https://github.com/Oneflow-Inc/oneflow/
 

everyone else is watching

Welcome Star, try the latest version of OneFlow: https://github.com/Oneflow-Inc/oneflow/ icon-default.png?t=MBR7https://github.com/Oneflow-Inc/oneflow/

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/128731928