The acceleration test of Stable Diffusion on various graphics cards can increase the speed by up to 211.2%.

Stable Diffusion is a diffusion model-based image generation technology capable of generating high-quality images from text, suitable for fields such as CG, illustrations, and high-resolution wallpapers.

However, its calculation process is complex, making its generation slower. So researchers have created various ways to increase its speed, such as Xformers, Aitemplate, TensorRT and onflow. In this paper we conduct a series of comparative tests on these acceleration methods.

In this article, we will introduce the principles and performance test results of these acceleration methods, and provide a cost-benefit summary for different graphics cards. Our goal is to generate high-quality images within 2 seconds.

Compared with Xformers on RTX 3090, OneFlow achieves 211.2% speedup on RTX 4090 and 205.6% speedup on RTX 4090 through our tests. So a high-end GPU is still necessary.

Principles and Features of the Acceleration Solution

The following table organizes the acceleration solutions currently available

This article uses Xformers, Aitemplate, TensorRT and onflow for testing. Because NvFuser is similar to Xformers in principle, both use FlashAttention technology. DeepSpeed ​​and colossalAI are mainly designed for training acceleration, while OpenAI Triton is a model deployment engine suitable for batch size acceleration, but not suitable for optimized latency scenarios, so these are included in this article.

We use VoltaML to evaluate the acceleration effect of Aitemplate, use Stable Diffusion web to evaluate the acceleration of Xformers, use the official TensorRT example to evaluate the performance of TensorRT, and integrate OneFlow into Diffusion to test its acceleration.

Accelerated program testing

Next, we will introduce the relevant test configuration

1. Test settings

Our performance metric is iterations per second (its/s). The image is set to 512*512, step 100

提示词为:A beautiful girl, best quality, ultra-detailed, extremely detailed CG unity 8k wallpaper, best illustration, an extremely delicate and beautiful, floating, high resolution.

Negative提示: Low resolution, bad anatomy, bad hands, text error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, bad feet, fused body.

Sampler: Euler a

Model: Stable Diffusion 1.5

2. Test results

The performance test results on various GPUs are shown in the figure below (the first line in the above figure is Xformers, the third line is Aitemplate, and the fourth line is OneFlow):

The acceleration comparison is as follows: OneFlow > TensorRT > Aitemplate > Xformers.

Compared to Xformers on RTX 3090, OneFlow achieves 211.2% relative speedup and 205.6% speedup on RTX 4090.

Here is a result:

GPU cost performance

A cost-benefit analysis of different GPUs was performed and the following conclusions were obtained:

From the perspective of cost performance, RTX4090 GPU is the most cost-effective, and currently RTX 2080Ti is the most cost-effective, and very low-end GPUs will increase the overall cost. So it is not recommended to use low-end entry-level GPU.

Several low-end GPUs we have selected here, including M60, 1660s and 1080, the problems are as follows:

1. GPUs such as 1660 and 1080 do not support acceleration solutions, such as TensorRT, Aitemplate, and OneFlow, which may be due to insufficient memory or GPU incompatibility

2. Among them, 1660s (1080) is generating a 512*512 20-step image, which takes 7.66s (7.57s), reaching 2.61 it/s (2.64 it/s). Although it is slow, it can be used. If you have time or just play, you can consider it

3. The M60 reaches 1.27 it/s, and it takes 15.74s to generate a 512*512 image in 20 steps, which is twice as slow as 1660s

Suggestions for choosing

1. Although RTX 4090 has the highest speed, RTX 3090 can also be considered. The performance of RTX 3090 is better than other GPUs of the same level, such as A5000 and A4000 (the price below is calculated based on the GPU price of the cloud service provider. If you buy it yourself, you must buy RTX 4090, because it is not much worse than 3090).

2. Larger VRAM allows more models to be cached, reducing model loading time and significantly speeding up the image generation process.

Both the RTX 3090 and RTX 4090 have 24GB of VRAM, but if the stable diffusion web is optimized based on VRAM usage, the RTX 3090 may have an advantage in VRAM cost. If inference speed is a priority, the RTX 4090 is the best choice as its inference time is about half that of the RTX 3090.

3. For more details on different GPUs, please refer to the chart below.

The above is a complete test, I hope it will be helpful to you.

https://avoid.overfit.cn/post/4d41ab2ecdce462786892e315dc49ecc

Author: Omniinfer

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/131876256