Burst! Microsoft's new work LongNet: Extend Transformer to 1 billion Tokens

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —> [Target Detection and Transformer] Exchange Group

Reprinted from: Heart of the Machine

It has been expanded to 1 billion tokens. Can the entire Internet be processed as a sequence in the future?

When everyone is constantly upgrading and iterating their own large models, the ability of LLM (Large Language Model) to process context windows has also become an important evaluation indicator.

For example, the star model GPT-4 supports 32k tokens, which is equivalent to 50 pages of text; Anthropic, founded by former members of OpenAI, even increased Claude’s ability to process tokens to 100k, about 75,000 words, which is roughly equivalent to a one-key summary of "Harry Wave Special "Part One.

In a recent study by Microsoft, they directly extended Transformer to 1 billion tokens this time. This opens up new possibilities for modeling very long sequences, such as treating the entire corpus or even the entire Internet as one sequence.

As a comparison, the average person can read 100,000 tokens in around 5 hours, and may take longer to digest, memorize, and analyze the information. Claude can do this in less than 1 minute. If converted into Microsoft's research, it will be a staggering number.

4c87fb1491b5a5954089cabdb60f1a78.png

  • Paper address: https://arxiv.org/pdf/2307.02486.pdf

  • Project address: https://github.com/microsoft/unilm/tree/master

Specifically, the study proposes LONGNET, a Transformer variant that can scale sequence lengths to over 1 billion tokens without sacrificing performance on shorter sequences. The paper also proposes dilated attention, which can exponentially expand the range of model perception.

LONGNET has the following advantages:

1) It has linear computational complexity;

2) It can be used as a distributed trainer for longer sequences;

3) Dilated attention can seamlessly replace standard attention and can be seamlessly integrated with existing Transformer-based optimization methods.

Experimental results show that LONGNET exhibits strong performance on both long sequence modeling and general language tasks.

In terms of research motivation, the paper stated that in recent years, expanding neural networks has become a trend, and many networks with good performance have been studied. In this, the sequence length as part of the neural network should ideally be infinite in length. But the reality is often the opposite, so breaking the sequence length limit will bring significant advantages:

  • First, it provides the model with a large memory capacity and receptive field, enabling it to interact effectively with humans and the world.

  • Second, longer contexts contain more complex causal relationships and inference paths that models can exploit in the training data. On the contrary, shorter dependencies will introduce more spurious correlations, which is bad for the generalization of the model.

  • Third, longer sequence lengths can help models explore longer contexts, and extremely long contexts can also help models mitigate catastrophic forgetting.

However, the main challenge in extending sequence length is finding the right balance between computational complexity and model expressive power.

For example, RNN-style models are mainly used to increase sequence length. However, its sequential nature limits parallelization during training, which is crucial in modeling long sequences.

Recently, state-space models have become so attractive for sequence modeling that they can be run as CNNs during training and converted to efficient RNNs at test time. However, such models do not perform as well as Transformers at regular lengths.

Another way to extend the sequence length is to reduce the complexity of Transformer, namely the quadratic complexity of self-attention. At this stage, some efficient Transformer-based variants have been proposed, including low-rank attention, kernel-based methods, down-sampling methods, and retrieval-based methods. However, these methods have not yet scaled the Transformer to a scale of 1 billion tokens (see Figure 1).

b6b2a009b4eebd3e7209c97b4acfe7ad.png

The following table compares the computational complexity of different calculation methods. N is the sequence length and d is the hidden dimension.

cd8bc35b57d6d583c3e6436e614e716d.png

method

The solution of this research, LONGNET, successfully extended the sequence length to 1 billion tokens. Specifically, the study proposes a new component called dilated attention, and replaces the attention mechanism of Vanilla Transformer with dilated attention. A general design principle is that the distribution of attention decreases exponentially as the distance between tokens increases. This study shows that this design approach achieves linear computational complexity and logarithmic dependence between tokens. This resolves the contradiction between limited attention resources and accessibility to each token.

f9e32bd44fea7e598065caa0f06d927a.png

During implementation, LONGNET can be transformed into a dense Transformer to seamlessly support existing optimization methods for Transformers (such as kernel fusion, quantization, and distributed training). Taking advantage of linear complexity, LONGNET can be trained in parallel across nodes, using distributed algorithms to break the constraints of computing and memory.

In the end, the study effectively expanded the sequence length to 1B tokens with almost constant runtime, as shown in the figure below. In contrast, the runtime of the Vanilla Transformer suffers from quadratic complexity.

a64b33fcf9c418a41789bcb29c9bd9b2.png

This study further introduces a multi-dilated attention mechanism. As shown in Figure 3 below, this study computes differently across different heads by sparsifying different parts of the query-key-value pairs.

380ed373b2f8a871ec47d9b6ba2f6963.png

distributed training

Although the computational complexity of dilated attention has been greatly reduced to 0ede1ec72020dbd8428552ca2c2e5611.png, it is not feasible to scale the sequence length to millions on a single GPU device due to computational and memory constraints. There are some distributed training algorithms for large-scale model training, such as model parallelism [SPP+19], sequence parallelism [LXLY21, KCL+22] and pipeline parallelism [HCB+19], however these methods are not enough for LONGNET , especially when the sequence dimension is very large.

This research exploits the linear computational complexity of LONGNET for distributed training in the sequence dimension. Figure 4 below shows the distributed algorithm on two GPUs, which can be further extended to any number of devices.

a8b6a8a4c5cead6e9a731a0a7e38d233.png

experiment

The study compares LONGNET with vanilla Transformer and sparse Transformer. The difference between the architectures is the attention layer, while the other layers remain the same. The researchers extended the sequence length of these models from 2K to 32K while reducing the batch size to keep the number of tokens in each batch constant.

Table 2 summarizes the results of these models on the Stack dataset. Studies use complexity as an evaluation metric. The models were tested with different sequence lengths ranging from 2k to 32k. When the input length exceeds the maximum length supported by the model, the research implements blockwise causal attention (BCA) [SDP+22], which is a state-of-the-art extrapolation method for language model inference.

In addition, the study removed absolute position encoding. First, the results show that increasing sequence length during training generally leads to better language models. Second, sequence length extrapolation in inference does not apply when the length is much larger than the model supports. Finally, LONGNET consistently outperforms baseline models, demonstrating its effectiveness in language modeling.

2c33588aaa25980a8373b630cb44ba20.png

Sequence length expansion curve

Figure 6 plots the sequence length expansion curves for the vanilla transformer and LONGNET. The study estimates the computational effort by counting the total flops of matrix multiplication. The results show that both vanilla transformer and LONGNET can obtain larger context length from training. However, LONGNET can scale the context length more efficiently, achieving lower test loss with less computation. This demonstrates the advantage of longer training inputs over extrapolation. Experiments show that LONGNET is a more effective way to extend the length of context in language models. This is because LONGNET can learn longer dependencies more efficiently.

5e98a7747461b39435402baa83601aaa.png

Extend model size

An important property of large language models is that the loss scales power-law as computation increases. To verify whether LONGNET still follows similar scaling laws, the study trains a series of models with different model sizes (from 125 million to 2.7 billion parameters). The 2.7 billion model was trained with 300B tokens, while the rest of the models used about 400B tokens. Figure 7(a) plots the expansion curve of LONGNET with respect to computation. The study calculated the complexity on the same test set. This proves that LONGNET can still follow a power law. This also means that the dense Transformer is not a prerequisite for extending the language model. Furthermore, both scalability and efficiency are achieved by LONGNET.

a116c160b53ab9ad135703d82a7c7b95.png

long context prompt

Prompts are an important way to guide language models and provide them with additional information. This study experiments to verify whether LONGNET can benefit from a longer contextual cue window.

The study reserved a prefix (prefixes) as a prompt, and tested the perplexity of its suffixes (suffixes). And, during the research process, the prompt was gradually extended from 2K to 32K. For a fair comparison, the length of the suffix is ​​kept constant, while the length of the prefix is ​​increased to the maximum length of the model. Figure 7(b) reports the results on the test set. It shows that the test loss of LONGNET gradually decreases as the context window increases. This demonstrates the superiority of LONGNET in making full use of long contexts to improve language models.

Click to enter —> [Target Detection and Transformer] Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch745a9e3c237aaf21e7742a561b8abd48.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/131606913