PAI BladeLLM inference engine: ultra-long context, higher performance

BladeLLM is a large model inference engine provided by Alibaba Cloud PAI platform. It is committed to allowing users to easily deploy high-performance, low-cost large language model services. BladeLLM has conducted in-depth performance optimization and engineering optimization on the entire link of LLM inference and services to ensure that different models achieve optimal cost performance on different devices.

In addition to extreme performance optimization under conventional context lengths, BladeLLM also breaks through the context length limit of existing LLM inference systems and can support longer input lengths and text generation lengths, allowing LLM to unlock more application scenarios. Moreover, BladeLLM still maintains ultimate performance under ultra-long context, and has significant performance advantages compared to other LLM inference service systems.

This article mainly introduces the advantages of BladeLLM in ultra-long contexts, including the maximum context length supported and the reasoning performance of ultra-long contexts.

background

Ultra-long context is an inevitable trend in the development of LLM

Ultra-long context reasoning capability is one of the important emerging capabilities of LLM. This capability has spawned a series of application scenarios with huge potential value, including personalized chat robots (Character.AI), literary creation tools (Jasper), and article summarization Tools (ChatPaper), etc. Personalized chatbots will continue to interact with users and provide users with help in work, emotions, learning and other aspects. LLM will memorize the complete chat content during the communication process, and the model input length will increase gradually, forming an ultra-long input text sequence after multiple interactions. Literary creation tools use the capabilities of LLM to batch generate long texts, such as novels, stories, and scripts. Compared with the traditional manual creation process, LLM literary creation tools can generate a large number of backgrounds, plots and dialogues in a short time, greatly improving the creative efficiency of writers and screenwriters while providing readers with richer and more diverse reading materials. The ultra-long context reasoning ability emerging from LLM is considered to be the only way to AGI. The significance of this ability is mainly reflected in the following aspects:

  1. Explore more application scenarios: Support for ultra-long text generation allows LLM to be applied to more application scenarios, such as personalized chatbots, generating novels, technical documents, academic papers, etc. These application scenarios usually require the generation of longer text content.
  2. Generate more contextually coherent text: The goal of LLM is to generate natural language text that is relevant to a given context. When the generation sequence limit is short, it may lead to insufficient coherence between the generated text and the context, affecting the quality of the generated text. LLM supports ultra-long text generation, which can better maintain the integrity of the context and make the generated text more coherent, thus improving the quality of the generated text.
  3. Improve the diversity of generated text: Longer generated sequences can provide more space to explore different text possibilities, thereby improving the diversity of generated text. LLM supports ultra-long text generation, which can better capture subtle changes in context and generate more diverse and rich text content.

With the spread of related application scenarios, models that support ultra-long contexts are emerging one after another, including MPT StoryWriter that supports 84K context, Claude 2 with 200K context, LongLLaMA with 256K context, and so on (see the figure below). At the system level, although some frameworks (such as DeepSpeed) already support and optimize ultra-long contexts, they are still focused on the training phase. In the inference stage, popular frameworks are all faced with the problem that ultra-long input and output cannot run or operate inefficiently. It can be said that the input and output of ultra-long text brings new challenges to large model inference engines.

The challenge of very long contexts

First of all, existing LLM inference engines are difficult to meet the needs of large models for processing ultra-long context information. The configuration of storage resources and the design of computing operators in these systems will greatly limit the maximum input and output length of the model. Therefore, large-scale context support requires more efficient storage and computing strategies; in addition, longer context information causes the inference time to increase dramatically, causing an increase in costs and a decrease in user experience. This problem is particularly obvious in the existing LLM inference engine. . The main reason for the increase in reasoning time is the Attention mechanism of LLM, which needs to calculate the relative importance between each Token and other Tokens. As the context length increases, the Attention calculation needs to process more Tokens, resulting in longer calculation time. , so a faster and more efficient Attention calculation method is the key to accelerating LLM ultra-long text generation.

Taking the HuggingFace Llama2-13B model as an example, as the context length increases, the time to generate a token increases significantly. The specific growth trend is shown in the figure below. When the context length is 34K, the time it takes for the HuggingFace open source model to generate a token is 3.5 times longer than when the context length is 1K.

Technical solutions

The following is the technical architecture diagram of the BladeLLM inference engine, which contains many core components. This article mainly introduces RaggedAttention and DNN-based AutoTuner.

RaggedAttention

Recently, there have been two influential works on Transformer Multi Head Attention calculation, namely FlashAttention and PagedAttention, which have had a profound impact on the design paradigm of LLM training and inference systems.

PagedAttention is inspired by the ideas of virtual memory and paging in the operating system, and stores continuous keys and values ​​in discontinuous video memory space. PagedAttention divides the kv cache of each sequence into blocks, and each block contains the keys and values ​​of a fixed number of tokens. values. Since these blocks do not have to be consecutive in the video memory, video memory fragmentation is greatly reduced, and there is no need to reserve a large amount of video memory in advance for each sequence, so that precious video memory resources are fully utilized. The ultimate memory utilization combined with Contiguous Batching greatly improves the throughput of LLM inference service. Correspondingly, it also brings a disadvantage. Discontinuous video memory blocks affect the kernel memory access efficiency to a certain extent, thus affecting performance.

Although the problem to be solved by BladeLLM's self-developed RaggedAttention in the same period is similar to PagedAttention, there are certain differences in the implementation methods. Specifically, there are different tradeoffs between kernel performance and video memory utilization.

The name RaggedAttention is inspired by RaggedTensor in the Tensorflow framework. Ragged means irregular, which means that the kv cache of RaggedAttention is not a regular Tensor, but allows the length of each sequence to be different, so that it can efficiently cooperate with Contiguous Batching to improve system throughput. However, unlike PagedAttention, RaggedAttention ensures that the key and value cache of the same sequence are stored continuously, so it can improve the kernel's memory access efficiency and thereby improve performance. Similarly, continuous storage will cause certain video memory fragmentation and video memory reservation problems, thus affecting video memory utilization. This is a typical engineering tradeoff, and there is no standard answer, because different computing power and memory ratios, different input and output lengths, and even different business requirements for delay will lead to differences in system bottlenecks. As an AI platform, BladeLLM is committed to finding the most suitable configuration for different models, different devices, different workloads, and different business scenarios in an automated way.

For example, for context lengths that vary greatly, with the help of AutoTuner, which will be introduced in the next section, RaggedAttention can maintain efficient calculation and memory access under different context lengths. We measured that the context length changes from 1 to 512000, and RaggedAttention can obtain Ultimate performance. 

DNN-based AutoTuner

LLM inference is a typical strong Dynamic Shape scenario. Not only the Batch Size dimension will dynamically change, but the Sequence Length dimension will also change even more dramatically. One of the main methods to pursue the ultimate performance of the Kernel in the Dynamic Shape scenario is to perform tuning based on the actual running size, that is, for each specific set of input sizes, the Best Schedule is selected through actual running and measurement. Work using this method includes AutoTVM , Ansor et al. Although this method can achieve extreme performance, it has the problem of high Tuning overhead. In particular, the Tuning results can only be applied to specific Shape, which is very unfriendly to Dynamic Shape scenarios: if all possible shapes are tuned in advance offline, The tuning time and computing resources required are very huge; if each new set of shapes is tuned online in real time, it will cause serious performance disturbances to the online performance.

In response to the above pain points, BladeLLM uses DNN-based AutoTuner, which completely relies on the results predicted by the DNN model without actually running measurements to select the Best Schedule. We have done a lot of work in training data collection, model structure, feature extraction, Loss function design, etc. Explore and try to continuously improve the prediction accuracy of the DNN model. Currently, the average performance of GPU computing-intensive operators based on DNN-based AutoTuner reaches 99.39% of the Tuning tuning performance based on actual running measurements.

After solving the problem of prediction accuracy, reducing the running time and computing resources occupied by the DNN prediction model has become a key challenge for this technology to be applied to high-real-time online reasoning scenarios. Directly using existing frameworks and engines (such as PyTorch, TorchScript, OnnxRuntime, etc.) to build prediction models cannot meet the high real-time needs of the service. Through joint optimization of the model system, we have reduced the AutoTuner DNN model prediction delay to 2us. Ultimate system optimization Compared with models built with PyTorch, TorchScript, and OnnxRuntime, the performance of the prediction model is improved by 36 times, 19.5 times, and 4.3 times respectively (see the figure below), and the system resources occupied by the inference process are extremely low. The prediction model only uses one CPU Core. Non-GPU resources to ensure that there is no interference with the performance of the service's GPU model itself. Because of microsecond-level low prediction latency and more than 99% prediction accuracy, AutoTuner is not only used in LLM online inference services, but also successfully serves Dynamic Shape business scenarios including search promotion, speech recognition, Stable Diffusion and other Dynamic Shape business scenarios.

Comparative Results

We take the maximum text generation length and corresponding generation time as examples to compare the maximum supported context length and corresponding performance of different LLM reasoning systems. The results are as follows:

  • lmDeploy (based on FasterTransformer) will hang after generating a length exceeding 10K
  • vLLM gets illegal address error after generation length exceeds 12K
  • Huggingface's original Llama model OOMs after the generated length exceeds 34K
  • The maximum generation length of LightLLM (67K) is close to that of BladeLLM (70K), but the time required is 3 times that of BladeLLM.

Note: In order to make the comparison fair, the above results are based on fp16 weight and fp16 kv cache measurements. BladeLLM now supports kv cache quantification, which can further increase the maximum supported context length of a single card to 280K; all the above measurements are not speculative. Sampling; the above measurements were completed in August. At present, the LLM inference engines in the industry are still developing rapidly. We look forward to updated result comparisons. At the same time, the development of a new version of BladeLLM that supports longer context and higher performance is also nearing completion. With new We will continue to share the results with you.

Summarize

Ultra-long context is an inevitable trend in the development of LLM. However, the context length supported by the current mainstream LLM inference and service engines and the reasoning performance of ultra-long context are far from enough. The above shared some information about BladeLLM's support for ultra-long context and ultra-long context. Long context reasoning performance, everyone is welcome to communicate and discuss. In addition, in addition to focusing on ultra-long context scenarios, BladeLLM will also continue to focus on multiple technical directions for reasoning, including low-bit quantization compression, multi-round dialogue, extreme kernel optimization, compilation optimization, etc. We will also have more technologies to share with the outside world in the future. Open to the public, everyone is welcome to continue to pay attention!

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Google celebrates its 25th anniversary Svelte has built a "new wheel" - runes
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10111879