ByteDance proposed a high-performance transformer reasoning library, which won the Best Paper Award at IPDPS 2023

Hands-on attention

41e78f34afbcb501d14228be0552539c.gif

Dry goods do not get lost

The paper "ByteTransformer: A High-Performance Transformer Boosted for Variable-Length" jointly published by ByteDance, Nvidia and University of California, Riverside was presented at the 37th IEEE International Parallel and Distributed Processing Conference (IPDPS 2023), from 396 The submission stood out and won the best paper award. This paper proposes a byte-beating GPU transformer inference library - ByteTransformer. Aiming at the common variable-length input of natural language processing, the paper proposes a set of optimization algorithms. These algorithms successfully avoid the redundant calculation in the traditional implementation under the premise of ensuring the correctness of the operation, and realize a substantial improvement in the end-to-end reasoning process. optimization. In addition, the paper also manually tuned the multi-head attention, layer normalization, activation and other core operators in the transformer to improve the reasoning of ByteTransformer to the industry-leading level. Compared with well-known deep learning libraries such as PyTorch, TensorFlow, NVIDIA FasterTransformer, and Microsoft DeepSpeed-Inference, ByteTransformer achieves a maximum acceleration of 131% under variable-length input. The code of the thesis is open source.

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs ( https://arxiv.org/abs/2210.03052 )

IPDPS: The flagship conference in the field of computer systems in the direction of parallel and distributed computing. The conference focuses on sharing and discussing the latest research progress in related fields such as parallel computing, distributed computing, large-scale data processing, and high-performance computing. Participating experts and scholars come from top research institutions and enterprises around the world to discuss innovative development and cutting-edge technologies in this field.

code: https://github.com/bytedance/ByteTransformer

transformer variable length text padding free

Transformers are widely used in natural language processing (NLP). With the emergence and development of large models such as BERT and GPT-3, the importance of transformer models has become more and more prominent. These large models typically have over 100 million parameters and require significant computing resources and time for training and inference. Therefore, optimizing transformer performance becomes very important.

Some existing deep learning frameworks, such as Tensorflow, PyTorch, TVM, and NVIDIA TensorRT, require the input sequence to be of the same length in order to use batch processing to accelerate transformer calculations. However, in practical scenarios, the input sequence is usually variable length, and zero padding will introduce a lot of additional computational overhead. There are some ways to group inputs with similar seqlen before kernel launch to minimize padding, but padding free cannot be achieved. The "effective transformer" [4] previously proposed by the Bytedance AML team realizes the padding free of QKV projection and MLP by rearranging the input, but the self attention part still needs padding.

In order to solve this problem, the ByteDance AML team proposed ByteTransformer, which implements padding free calculation of variable-length input, and implements comprehensive kernel fusion to further improve performance.

11a5b32b65f477ccbe44b1ddea4fd407.png

Figure 1: Comparison of ByteTransformer with other working features

Remove padding algorithm

This algorithm is derived from the "effective transformer" of the Bytedance AML team's previous work, and is also integrated in NVIDIA's open source FasterTransformer. ByteTransformer also uses this algorithm to remove extra calculations for matrix multiplication outside the attention.

Algorithm steps:

1) Calculate the prefix sum of the attention mask as offsets

2) According to offsets, rearrange the input tensor from [batch_size, seqlen, hidden_size] to valid_seqlen, hidden_size] , and then participate in the subsequent matrix multiplication calculation to realize padding free

905d1ba2cd3db4ddd309119aad3d0c8e.png

Figure 2: Remove padding algorithm process

FMHA (Fused Multi-Head Attention)

In order to optimize the performance of the attention part, the fused multi-head attention operator is implemented in ByteTransformer. For the length of seqlen, it is divided into two implementation methods with 384 as the boundary:

  • For short seqlen, because the entire row of QK can be placed in shared memory for softmax operation, it can be realized by handwriting kernel , and matrix multiplication can ensure high performance by calling wmma interface and using TensorCore.

  • For long seqlen, all operations cannot be completed in a handwritten kernel due to the size limitation of shared memory. Based on the high-performance CUTLASS[5] grouped GEMM , it is divided into two gemm kernel implementations, and operations such as add_bias and softmax are fused into the GEMM kernel.

CUTLASS grouped GEMM

The grouped GEMM developed by NVIDIA can complete the calculation of multiple independent matrix multiplication problems in one kernel. Using this property, padding free in Attention can be realized.

  • The two matrix multiplication operations in Attention can be decomposed into batch_size x head_numan independent matrix multiplier problem.

  • For each matrix multiplier problem, the size of the problem is passed to the grouped GEMM, where seqlen passes the real valid seqlen.

Grouped GEMM principle: Each threadblock (CTA) in the kernel has a fixed tiling size, and each matrix multiplier problem is disassembled into different numbers of blocks to be calculated according to the problem size and tiling size, and then these blocks are evenly distributed to each threadblock Calculation.

4a6e3a05baf9be0c30e117b5899628ce.png

Figure 3: Schematic diagram of the grouped GEMM principle. Each sub-problem is disassembled into a different number of blocks, and then these blocks are evenly distributed to efficiently implement a single kernel to calculate multiple independent GEMM problems.

When using grouped GEMM to implement attention, since the number of sub-problems batch_size x head_numis usually large, there will be a lot of overhead in reading the sub-problem parameters, because from the perspective of threads, each thread needs to traverse and read all the sub-problem sizes.

To solve this problem, ByteTransformer optimizes the performance of reading subproblem parameters in grouped GEMM , making it negligible:

1) Share sub-problem parameters. For the same input, the valid seqlen of different heads is the same, and the problem size is also the same. Through sharing, the parameter storage capacity is batch_size x head_numreducedbatch_size

2) warp prefetch . In the original implementation, each CUDA thread reads all sub-problem problem sizes in turn, which is very inefficient. Instead, a warp thread reads 32 consecutive sub-problem parameters, and then exchanges data through warp thread communication, and the number of reads per thread is reduced to 1/32

25f6411721fbf570c56e61f739d7a83f.png

Figure 4: Schematic diagram of warp prefetch. One warp per iteration reads 32 sub-questions size

softmax fusion

In order to further improve the performance, the softmax after Q x K is also fuse into the matrix multiplication operator, which saves the memory access operation of the intermediate matrix compared to the separate softmax kernel.

Because softmax needs to reduce the entire row of data, but due to the limitation of the shared memory size, a threadblock cannot accommodate the entire row of data, and the communication between threadblocks is very inefficient, so the entire softmax cannot be completed only in the Q x K epilogue operation. Divide the softmax calculation into three steps, respectively fuse it into the epilogue of Q x K, into the prologue of QK x V, and add a lightweight kernel in the middle for specification.

60f468f885b94fcb9f0647a293c57897.png

Figure 5: Schematic diagram of the softmax fusion process. Divided into three steps of calculation, most of the calculation fuses into the GEMM kernel before and after

Algorithm steps:

1) Partial reduction: In the epilogue of Q x K, the internal specification of each threadblock calculates the two values ​​​​of max and sum

2) full reduction: a lightweight kernel that continues to reduce the partial reduction results of each row to the results of the entire row

3) element-wise op: modify the code of CUTLASS to support prologue fusion, that is, after loading the input matrix, fuse some element-wise operations. In the prologue of QK x V, read the reduction result of the current row, calculate the final result of softmax, and then participate in the subsequent matrix multiplication calculation

performance data

Short seqlen handwritten kernel performance

In the case of short seqlen <= 384, the performance of cuBLAS batch GEMM is 5 times higher than that of PyTorch MHA, and the performance is further improved by 9% after enabling the zero padding algorithm to optimize softmax. fuse into a kernel, compared to the three variant implementations, the average speedup is 617%, 42% and 30%, respectively .

e0a8e2b19631e41babdb3d3cd7b1ff2b.png

Figure 6: Performance comparison of handwritten attention kernel. Note: cuBLAS + zero padding refers to the zero padding of softmax

Performance of long seqlen CUTLASS kernel

Under the condition of 448~1024 seqlen, cuBLAS batched GEMM is 3 times better than PyTorch's MHA performance. At the same time, it further improves the performance of softmax zero padding by 17%. By introducing high-performance CUTLASS grouped GEMM and softmax fusion, ByteTransformer's fused MHA achieves performance improvements of 451%, 110% and 79% over variant MHA .

2fba6c0752271ea98e66f986f9996fc2.png

Figure 7: Long seqlen CUTLASS FMHA performance comparison

Comprehensive kernel fusion

In addition to the optimization of matrix multiplication and attention, ByteTransformer also implements comprehensive kernel fusion for some small operations, which can achieve more extreme performance by reducing the overhead of memory access and kernel launch.

add-bias & LayerNorm fusion

The add-bias and LayerNorm operations after matrix multiplication are fused by handwriting the kernel. This part of the operation accounts for 10% and 6% of the delay when the seqlen is 256 and 1024, respectively. The fused kernel can optimize 61% of the performance. 3.2% performance improvement for single-layer BERT transformer (average seqlen 128 - 1024 cases).

GEMM & add-bias & GELU fusion

Through the CUTLASS fuse epilogue, fuse the add-bias operation after matrix multiplication and the GELU activation operation into the matrix multiplication kernel. Add-bias and GELU account for 7% and 5% of the time when seqlen is 256 and 1024, respectively. Combining add-bias and GELU with GEMM can perfectly hide this part of the memory access delay, further improving the performance of the single-layer transformer by 3.8% .

Variant transformer support

Currently, the ByteDance AML team has open-sourced ByteTransformer's standard BERT implementation on GitHub (https://github.com/bytedance/ByteTransformer). In addition, the byte internal version also supports many transformer variants, such as Deberta, Roformer, T5 and so on. The code implementation is easy to expand, and the above various optimization methods can also be easily applied to the variant transformer.

more performance data

End-to-end performance comparison with other transformer implementations

Experimental configuration: standard BERT transformer, head size = 64, head number = 12, layer = 12, average valid seqlen = 0.6 * maximum seqlen, using A100 GPU for testing. Compare the performance in the case of seqlen=64~1024, batch_size=1,8,16. Average speedups of 87%, 131%, 138%, 74% and 55% compared to PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA FasterTransformer, respectively

  • Note: At the time of paper submission, PyTorch and NVIDIA FasterTransformer have not yet integrated FlashAttention [6]

a830c0b910bdcb12a566f56cd10c7951.png

Figure 8: End-to-end performance comparison of each transformer implementation

  1. Various optimization methods affect dismantling

    1. Compared with ByteTransformer's own baseline version, after various optimizations are turned on, the overall improvement is 60% relative to the baseline. The impact of dismantling various optimization methods on performance is as follows:

    2. add-bias & LayerNorm fusion can improve performance by 3.2%

    3. Adding add-bias & GELU fuse to GEMM epilogue can further improve by 3.8%

    4. Introduce remove padding algorithm, performance increased by 24%

    5. FMHA additional 20% increase

2b97dcab91e0c7458fe7f3403c6ba805.png

Figure 9: Dismantling of the performance improvement of each optimization method

  1. Performance comparison of BERT-like variants

Compare the performance of ByteTransformer with state-of-the-art DL frameworks under the model structures of ALBERT, DistilBERT and DeBERTa. The experimental configuration is consistent with the standard BERT, and the average seqlen is 0.6*max seqlen.

For ALBERT and DistilBERT, ByteTransformer is on average 98%, 158%, 256%, 93%, and 53% faster than PyTorch, TensorFlow, Tencent TurboTransformer, DeepSpeed-Inference, and NVIDIA FasterTransformer, respectively. For the DeBERTa model, ByteTransformer is 44%, 243% and 74% faster than PyTorch, TensorFlow and DeepSpeed, respectively.

51c3dd75a6399931fdcbbced08528df8.png

Figure 10: Performance comparison of BERT-like variants. Some data points are missing because the corresponding framework does not support or cannot run successfully

in conclusion

ByteTransformer is an efficient transformer implementation, which achieves high performance on BERT transformer through a series of optimization methods. For variable-length text input, ByteTransformer has obvious advantages compared to other transformer implementations, and the average speedup in the experiment can reach more than 50%. It is suitable for accelerating natural language processing tasks and improving the efficiency of model training and reasoning. At the same time, ByteTransformer also provides an efficient transformer implementation for other researchers. Its optimization means and performance are of great significance for practical applications.

ByteDance AML Team Introduction

AML is ByteDance's machine learning center, providing recommendation/advertising/CV/voice/NLP training and reasoning systems for businesses such as Douyin/Toutiao/Xigua Video. Provide powerful machine learning computing power for the company's business departments, and research some general and innovative algorithms on these business issues. At the same time, the core capabilities of some machine learning/recommendation systems are also provided to external enterprise customers through the volcano engine.

More AML positions are being recruited, please scan the QR code below to learn more details and submit

49470928b3bd3ef98eedabf9da7ce5b4.pngb0d224583a9229fd166f93675a785c85.png

Quote:

[1] J. Fang, Y. Yu, C. Zhao, and J. Zhou, “Turbotransformers: an efficient gpu serving system for transformer models,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 389–402.

[2] R. Y. Aminabadi, S. Rajbhandari, M. Zhang, A. A. Awan, C. Li, D. Li, E. Zheng, J. Rasley, S. Smith, O. Ruwase et al., “Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale,” arXiv preprint arXiv:2207.00032, 2022.

[3] NVIDIA, https://github.com/NVIDIA/FasterTransformer

[4] ByteDance https://github.com/bytedance/effective_transformer

[5] NVIDIA https://github.com/NVIDIA/cutlass

[6] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re, “Flashattention: Fast ´ and memory-efficient exact attention with io-awareness,” arXiv preprint arXiv:2205.14135, 2022.

Guess you like

Origin blog.csdn.net/ByteDanceTech/article/details/131238326