Stable Diffusion~Raspberry Pi

Stable Diffusion works on Raspberry Pi! 260MB of RAM "hold" a large model with 1 billion parameters

Stable Diffusion was born 11 months ago, and the news that it can run on consumer-grade GPUs has encouraged many researchers. Not only that, but Apple officially ended up quickly and "stuffed" Stable Diffusion into iPhone, iPad and Mac to run. This greatly reduces Stable Diffusion's requirements for hardware devices, making it gradually become a "black technology" that everyone can use.

Now, it even runs on the Raspberry Pi Zero 2.

Raspberry Pi Zero 2 「Just as small. Five times as fast.」

What kind of concept is this? Running Stable Diffusion is not trivial, it contains a large Transformer model of 1 billion parameters, the recommended minimum RAM/VRAM is usually 8GB. The RPI Zero 2 is just a microcomputer with 512MB of memory.

This means running Stable Diffusion on RPI Zero 2 is a huge challenge. Moreover, the authors did not increase storage space or offload intermediate results to disk during the run.

In general, major machine learning frameworks and libraries focus on minimizing inference latency and/or maximizing throughput, but all at the cost of memory usage. Therefore, the author decided to write an ultra-small, hackable reasoning library, dedicated to minimizing memory consumption.

OnnxStream does it.

Project address https://github.com/vitoplantamura/OnnxStream

OnnxStream is based on the idea of ​​decoupling the inference engine from the component responsible for providing model weights, which is a class derived from WeightsProvider. A WeightsProvider specialization can implement any type of model parameter loading, caching, and prefetching. For example, a custom WeightsProvider could decide to download data directly from an HTTP server without loading or writing anything to disk (this is why OnnxStream has Stream in its name). There are two default WeightsProviders available: DiskNoCache and DiskPrefetch.

Compared with Microsoft's inference framework OnnxStream, OnnxStream only needs to consume 1/55 of the memory to achieve the same effect, but the speed (on the CPU) is only 0.5-2 times slower than the former.

Next you will see the effect of Stable Diffusion running on RPI Zero 2, and the method behind it. It should be noted that although the running speed is slower, it is a fresh attempt to run large models on smaller and more limited devices.

Netizens think this project is cool

Running Stable Diffusion on a Raspberry Pi Zero 2

The VAE decoder is the only model in Stable Diffusion that cannot fit in RPI Zero 2 RAM in single or half precision. This is due to residual connections, very large tensors and convolutions in the model. The only solution is static quantization (8 bit).

These images below were generated using OnnxStream at different precisions of the VAE decoder from the Stable Diffusion example implementation included in the author's repo.

The first image was generated on the author's PC using the same latent generated by the RPI Zero 2.

Generation effect of VAE decoder with W16A16 precision 

 Generation effect of VAE decoder with W8A32 precision

The third graph was generated by RPI Zero 2 in about 3 hours.

Legend: The generation effect of the VAE decoder with W8A8 precision

Features of OnnxStream

  • Inference engine is decoupled from WeightsProvider

  • WeightsProvider can be DiskNoCache, DiskPrefetch or custom

  • attention slice

  • Dynamic quantization (8 bit unsigned, asymmetric, percentile)

  • Static quantization (W8A8 unsigned, asymmetric, percentile)

  • Easily calibrate quantitative models

  • FP16 support (with or without FP16 operations)

  • Implemented 24 ONNX operators (the most commonly used operators)

  • Operations are performed sequentially, but all operators are multi-threaded

  • Single implementation file + header file

  • XNNPACK calls are wrapped in XnnPack class (for future replacement)

And note that OnnxStream relies on XNNPACK to speed up certain primitives: MatMul, Convolution, element-wise Add/Sub/Mul/Div, Sigmoid, and Softmax.

performance comparison

Stable Diffusion consists of three models: a text encoder (672 operations and 123 million parameters), a UNET model (2050 operations and 854 million parameters), and a VAE decoder (276 operations and 49 million parameters.

Assuming a batch size equal to 1, it takes 10 steps to generate a full image, which requires 2 runs of the text encoder, 20 (ie 2*10) runs of the UNET model and 1 run of the VAE decoder to get good results (using Euler Ancestral scheduler).

The table shows the different inference times for the three models of Stable Diffusion, as well as memory consumption (i.e. Peak Working Set Size in Windows or Maximum Resident Set Size in Linux).

It can be found that in the UNET model (with FP16 arithmetic enabled in OnnxStream when running at FP16 precision), the memory consumption of OnnxStream is only 1/55 of that of OnnxRuntime, but the speed is only 0.5-2 times slower.

A few things to note about this test are:

  • The first run of OnnxRuntime is a warm-up inference because its InferenceSession is created before the first run and reused in all subsequent runs. And OnnxStream has no warm-up reasoning, because its design is purely "eager" (however, subsequent runs can benefit from the operating system's cache of weight files).

  • Currently OnnxStream does not support batch size ! = 1 input, which is different from OnnxRuntime, which can greatly speed up the whole diffusion process by using batch size = 2 when running the UNET model.

  • In testing, changing the OnnxRuntime's SessionOptions (such as EnableCpuMemArena and ExecutionMode) had no noticeable effect on the results.

  • The performance of OnnxRuntime is very similar to NCNN (another framework) in terms of memory consumption and inference time.

  • Tested running conditions: Windows Server 2019, 16GB RAM, 8750H CPU (AVX2), 970 EVO Plus SSD, 8 virtual cores on VMWare.

Attention Slicing and Quantization

Employing "attention slicing" and using W8A8 quantization for the VAE decoder when running the UNET model was critical to reducing model memory consumption to a level suitable for running on RPI Zero 2.

While there is a lot of information on the Internet about quantized neural networks, there is very little about "attention slicing".

The idea here is simple: the goal is to avoid generating a full Q@K^T matrix when computing the scaled dot product attention of various multi-head attentions in the UNET model. In the UNET model, when the number of attention heads is 8, the shape of Q is (8,4096,40), while K^T is (8,40,4096). Therefore, the final shape of the first MatMul is (8,4096,4096), which is a 512MB tensor (FP32 precision). whaosoft  aiot  http://143ai.com

The solution is to split Q vertically and then perform attention operations normally on each Q block. Q_sliced ​​shape is (1,x,40), where x is 4096 (in this case), divided by onnxstream::Model::m_attention_fused_ops_parts (default is 2, but can be customized.

This simple trick reduces the overall memory consumption of the UNET model from 1.1GB to 300MB when running at FP32 precision. A more efficient alternative is to use FlashAttention, but FlashAttention requires writing a custom kernel for each supported architecture (AVX, NEON), etc., bypassing XnnPack in the example given by the author.

Reference link:

https://www.reddit.com/r/MachineLearning/comments/152ago3/p_onnxstream_running_stable_diffusion_in_260mb_of/

https://github.com/vitoplantamura/OnnxStream

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/131879338