Ascend CANN 7.0 Black Technology: Decryption of Large Model Inference Deployment Technology

This article is shared from the Huawei Cloud Community " Ascend CANN 7.0 Black Technology: Decryption of Large Model Inference Deployment Technology " by Ascend CANN.

Recently, as generative AI and large models have entered the public eye, more and more people have realized that seizing the outbreak of AI is to seize the opportunity for future intelligent transformation. How to quickly deploy and use AI infrastructure and how to improve inference performance has gradually become the focus of many enterprises.

As the layer closest to the Ascend AI series hardware products, CANN creates a software architecture suitable for Ascend AI processors through joint software and hardware design, fully enabling and releasing the surging computing power of Ascend hardware. For large model inference scenarios, the latest CANN 7.0 version of CANN organically integrates various internal components, supports quantitative compression, distributed segmentation and compilation, distributed loading and deployment of large models, and integrates basic acceleration libraries, graph compilation and optimization, and model execution scheduling. In other aspects, the ultimate performance optimization is carried out for large models.

Automatic parallel segmentation enables distributed deployment of large models:

In view of the huge computing and memory overhead of the LLM model, CANN provides automatic parallel segmentation capabilities to realize distributed deployment of large models in the Shengteng cluster. The automatic parallel segmentation process can be divided into 5 steps:

cke_117.jpeg

The automatic segmentation strategy uses physical cluster information and model structure as input to perform spatial modeling for load segmentation optimization, and then searches for an optimized segmentation deployment strategy through multiple rounds of iterations of policy generation-policy application-performance simulation.

The KV Cache mechanism reduces repeated reasoning calculations:

The process of LLM model inference calculation can be divided into prompt processing and autoregressive calculation of subsequent output tokens. The former involves matrix multiplication of a large amount of data, which is a typical computationally intensive process, while the latter will accumulate more and more dialogue content as LLM is executed, and calculate new token output based on historical output. Taking "Pangu is a language model" as an example, after inputting the content, each token will generate the corresponding Q, K and V vectors, and perform calculations such as matrix multiplication and softmax in the attention part. In this process, the user prompt plus the output token must be used as the input of the next iteration, and the corresponding QKV must be recalculated, which results in a large number of repeated calculations.

To this end, the industry has proposed the KV Cache method, which stores the K and V vectors calculated by the tokens that have appeared in the memory, only calculates the QKV of the latest token, and then performs matrix multiplication and softmax calculation. In essence, it is a space exchange. time.

cke_118.jpeg

At present, CANN has fully supported KV Cache and implemented the distributed storage, update and reset of KV Cache, effectively accelerating the calculation of the autoregressive phase.

Quantization technology effectively reduces memory usage:

Quantification is a common technology in the AI ​​field. In the era of large models, quantification has different characteristics and requirements. The weight distribution of LLM is relatively uniform, while the FM data has many outliers. In traditional quantization algorithms, directly discarding outliers or including all outliers in the quantization range will result in loss of accuracy. For this reason, CANN supports only weight quantification. The INT8 quantization scenario can reduce the weight memory space occupation by 50% compared to FP16.

cke_119.jpeg

It also supports KV Cache quantization. KV Cache is essentially space for time. With the linear growth of model layers and sequence length, KV Cache quantization can reduce storage by half.

FlashAttention fusion operator reduces memory access overhead:

The Multi-Head Atten-tion structure is extensively used in the LLM model, which not only brings a huge amount of calculation, but also the memory capacity required to save data is also a key bottleneck of the computing system. In this regard, the industry has proposed the FlashAttention fusion operator. Its principle is to segment the attention processing process and calculate equivalence, so that multiple steps of attention can be completed in one operator, and through multiple loops, each time a small amount of attention is processed. Part of the data accesses HBM in a near-streaming manner, reducing the total amount of data accessed by HBM and better hiding the overlap of calculation and data handling.

cke_120.jpeg

Source: https://arxiv.org/pdf/2205.14135.pdf

CANN optimizes the Flash Attention fusion operator for the HBM and cache size of the Ascend AI processor, as well as the data transfer path, making full use of the on-chip cache to improve Attention processing performance by up to 50%.

Auto Batching scheduling improves computing power utilization:

Faced with the computing characteristics of compute-bound in the input stage and memory-bound in the output stage, as well as the delay requirements of LLM services, CANN supports heterogeneous deployment of multiple input and output computing clusters, and supports auto batching scheduling of LLM computing tasks, improving AI computing power utilization. Its principle is to aggregate and process different service requests as much as possible: in the input stage, it uses single batch and multiple preset sequence length model inferences to minimize the startup overhead of each request; in the output stage, it schedules multiple requests with iteration granularity. Services should be processed in batches as much as possible to increase computing density and balance computing and memory access.

Support Torch.Compile calculation graph to improve programming efficiency:

In order to make it easier for developers to run LLM inference on the Shengteng platform, CANN implements PyTorch's calculation graph support. Developers only need to use PyTorch's native torch.-compile interface, and the CANN-enabled NPU backend will take over the FX Graph generated by PyTorch, convert AtenIR to AIR based on trace logic, and then perform end-to-end graph compilation and deep optimization. , thereby reducing the memory requirements in the inference phase, improving computing performance, and minimizing the modification work of developers.

cke_121.jpeg

Source: https://pytorch.org/get-started/pytorch-2.0/

Here is an example of getting started with CANN large model inference. In the compilation stage, use the ATC tool to compile the pb or onnx model. The command parameters are similar to those of classic AI models such as CV, except that the input of cluster information and segmentation information is added. Turn on the cluster switch and parallel segmentation switch, and pass in the cluster configuration file and the segmentation method configuration file at the same time. ATC will automatically implement model segmentation and communication operator insertion during the compilation process.

atc --model=./matmul2.pb

--soc_version=Ascend910

--output=test910_parallel

--distributed_cluster_build=1

--cluster_config=./numa_config_910_2p.json

--enable_graph_parallel="1"

--graph_parallel_option_path=./parallel_option.json

cke_122.jpeg

In the execution phase, the OM offline model is loaded through the LoadGraph interface. CANN will load each model slice into the corresponding Ascend AI processor device, and then use the existing RunGraph interface to perform inference.

After optimization such as computing/communication parallelism, graph optimization, and operator tuning, the inference performance of LLAMA 65B can be more than doubled compared to before optimization, and the end-to-end time consumption can reach about 8 seconds. There is still room for improvement.

cke_123.jpeg

All in all, in the context of the era of rapid changes in large model technology and continuous iteration, Shengteng CANN will continue to delve into large model optimization & acceleration technology, such as continuing to explore scheduling optimization for online services and shortening service delays; weight prefetching and calculation graph-based Cache resident optimization improves memory access performance; it is compatible with the industry's latest fusion operator FlashAttention to improve computing performance; it supports richer quantitative calculation combinations and model sparseness, reducing memory usage... With the large-scale commercial implementation of large models, Ascend AI basic software and hardware platform with CANN as the core will continue to improve the core competitiveness of large model inference deployment scenarios and provide customers with the best choice!

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Alibaba Cloud suffered a serious failure and all products were affected (restored). Tumblr cooled down the Russian operating system Aurora OS 5.0. New UI unveiled Delphi 12 & C++ Builder 12, RAD Studio 12. Many Internet companies urgently recruit Hongmeng programmers. UNIX time is about to enter the 1.7 billion era (already entered). Meituan recruits troops and plans to develop the Hongmeng system App. Amazon develops a Linux-based operating system to get rid of Android's dependence on .NET 8 on Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released
{{o.name}}
{{m.name}}

おすすめ

転載: my.oschina.net/u/4526289/blog/10142003