Challenge the Transformer in the big language model! Microsoft proposes a new RetNet architecture! Reasoning speed increased by 8 times!

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【Transformer】WeChat exchange group

The fish and sheep are sent from Aufei Temple
and reproduced from: Qubit (QbitAI)

The new architecture of Microsoft's large model officially challenges Transformer !

The title of the paper reads brightly:

Retentive Network (RetNet): The successor of Transformer in the field of large models.

6342e7b1f31938ee9026610c7b8413ec.png

Retentive Network: A Successor to Transformer for Large Language Models

Code: https://github.com/microsoft/unilm

Paper: https://arxiv.org/abs/2307.08621

The paper proposes a new Retention mechanism to replace Attention. Researchers from Microsoft Asia Research Institute and Tsinghua University did not deny their "ambition" and boldly said:

RetNet achieves good scaling results, parallel training, low-cost deployment, and efficient inference.

These characteristics make this infrastructure a powerful successor to the Transformer in the large language model.

The experimental data also shows that on language modeling tasks:

  • RetNet can achieve perplexity comparable to Transformer

  • 8.4 times faster inference

  • 70% reduction in memory usage

  • Has good scalability

And when the model size is larger than a certain scale, RetNet will perform better than Transformer.

00744de703f7ce961d6d67215186a736.png

Transformer really "successor has a model"? For details, let's see together.

Solve the "impossible triangle"

The importance of Transformer in large language models is beyond doubt. Whether it is OpenAI's GPT series, Google's PaLM, or Meta's LLaMA, they are all based on Transformer.

But Transformer is not perfect: its parallel processing mechanism is at the cost of inefficient reasoning , and the complexity of each step is O(N); Transformer is a memory-intensive model, and the longer the sequence, the more memory it takes up.

Before that, it's not that everyone didn't think about continuing to improve Transformer. However, the main research directions are somewhat neglected:

Linear attention can reduce the cost of reasoning, but the performance is poor;

Recurrent neural networks cannot be trained in parallel.

In other words, there is an "impossible triangle" in front of these neural network architectures. The three corners represent: parallel training, low-cost reasoning, and good scalability.

d24a11a4b2ddb7c6a7c66fd55f1be269.png

What the researchers of RetNet want to do is to make the impossible possible.

Specifically, on the basis of Transformer, RetNet uses a multi-scale retention mechanism to replace the standard self-attention mechanism .

Compared with the standard self-attention mechanism, the retention mechanism has several characteristics:

A position-dependent exponential decay term is introduced to replace softmax, which simplifies the calculation and preserves the information of the previous step in the form of decay.

Introduce complex number space to express position information, replace absolute or relative position coding, and easily convert to recursive form.

In addition, the retention mechanism uses multi-scale decay rates, which increases the expressiveness of the model, and utilizes the scaling invariance of GroupNorm to improve the numerical accuracy of the retention layer.

7e8cd05d4c5fe078985078dc87e6740b.png
Dual representation of RetNet

Each RetNet block contains two modules: a multi-scale preserving (MSR) module and a feed-forward network (FFN) module.

The hold mechanism supports representing sequences in three forms:

  • parallel

  • recursion

  • Block recursion, that is, a hybrid form of parallel representation and recursive representation, divides the input sequence into blocks, performs calculations according to parallel representation within blocks, and follows recursive representation between blocks.

Among them, the parallel representation enables RetNet to efficiently utilize GPU for parallel training like Transformer.

The recursive representation achieves O(1) inference complexity, reducing memory usage and latency.

Chunked recursion can handle long sequences more efficiently.

In this way, RetNet makes the "impossible triangle" possible. The following are the comparison results of RetNet and other infrastructures:

24bbc7ebc04f63bacae56b4df785523f.png

Experimental results on language modeling tasks further prove the effectiveness of RetNet.

The results show that RetNet can achieve a perplexity similar to Transformer (PPL, an indicator for evaluating the quality of the language model, the smaller the better).

At the same time, when the model parameters are 7 billion and the input sequence length is 8k, the inference speed of RetNet can reach 8.4 times that of Transformer , and the memory usage is reduced by 70% .

During the training process, RetNet also performs better than the standard Transformer+FlashAttention in terms of memory saving and acceleration effects, reaching 25-50% and 7 times respectively .

It is worth mentioning that the inference cost of RetNet is independent of the sequence length, and the inference latency is insensitive to the batch size, allowing high throughput.

462815edbef1972f8c1949079763a935.png

In addition, when the model parameter size is greater than 2 billion, RetNet will perform better than Transformer.

7e0daf568ff7fa03524418a93e897e97.png

research team

RetNet's research team is from Microsoft Asia Research Institute and Tsinghua University.

Together as Sun Yutao and Dong Li.

Sun Yutao, an undergraduate in the Department of Computer Science, Tsinghua University, is currently an intern at Microsoft Asia Research Institute.

Dong Li is a researcher at Microsoft Asia Research Institute. He is also one of the authors of the paper "Transformer that can remember 1 billion tokens" that has attracted a lot of attention.

400c39f79750e54186151bf40853493f.png

The corresponding author of the RetNet paper is Wei Furu. He is a global research partner of Microsoft Asia Research Institute, and the 1 billion token Transformer is also from his research team.

Paper address:
https://arxiv.org/abs/2307.08621

 
  

Click to enter —>【Computer Vision】WeChat Exchange Group

ICCV/CVPR 2023 Paper and Code Download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch7be25e51f821a50106eb7f3e25d94f3a.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/131799383