40% faster than Transformer! Meta releases a new Megabyte model to solve the problem of computing power loss

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —>【Transformer】WeChat exchange group

Reprinted from: Xinzhiyuan | Editor: Joey

[Introduction] Transformer has become the standard configuration of large models in recent years, and a Megabyte model developed by the Meta team claims to solve the flaws of Transformer, and the speed is 40% faster.

Transformer is undoubtedly the most popular model in the field of machine learning in the past few years.

Since it was proposed in the paper "Attention is All You Need" in 2017, this new network structure has maxed out major translation tasks and created a number of new records.

fb487f551c38239d9bf22274163504a1.png

However, Transformer has a flaw when dealing with long byte sequences, that is, the loss of computing power is serious, and the latest results of Meta researchers can solve this defect well.

They introduced a new model architecture that can generate more than 1 million tokens across multiple formats, and surpass the capabilities of the existing Transformer architecture behind models such as GPT-4.

This model, called Megabyte, is a Multi-scale Decoder Architecture that can model end-to-end differentiable sequences over a million bytes.

3e6c47f3eb34e89b9e1ba209470ac755.png

Paper link: https://arxiv.org/abs/2305.07185

Why Megabyte is stronger than Transformer, you have to look at the shortcomings of Transformer first.

Insufficiency of Transformer

So far, several types of high-performance generative AI models, such as OpenAI's GPT-4 and Google's Bard, are all models based on the Transformer architecture.

aa885ecae11c80f6ee32563f80d7d47a.png

But Meta's research team believes that the popular Transformer architecture may be reaching its threshold, citing two important flaws inherent in Transformer's design:

- The cost of self-attention increases rapidly as the input and output byte lengths increase, such as input music, image or video files often contain several megabytes, while large decoders (LLM) usually use only a few thousand contextual markup

- Feed-forward networks help language models understand and process words through a series of mathematical operations and transformations, but are difficult to achieve scalability on a per-position basis, these networks operate on groups of characters or positions independently, resulting in a large number of calculations overhead

Where is Megabyte strong?

Compared to Transformer, the Megabyte model exhibits a uniquely different architecture, partitioning input and output sequences into patches rather than individual tokens.

As shown in the figure below, in each patch, the local AI model generates results, while the global model manages and coordinates the final output of all patches.

3e7a7d1d32af42f2066906285bb54125.png

First, the byte sequence is split into fixed-size patches, roughly similar to tokens. This model consists of three parts:

(1) patch embedder: simply encode patches by losslessly concatenating the embeddings of each byte

(2) A global model: a large autoregressive transformer represented by input and output patches

(3) A local model: a small autoregressive model that predicts the bytes in the patch

The researchers observed that byte prediction is relatively easy for most tasks (such as completing a word given the first few characters), which means that large networks per byte are unnecessary and smaller ones can be used The model makes internal predictions.

This approach solves the scalability challenges prevalent in today's AI models. The patch system of the Megabyte model allows a single feed-forward network to run on a patch containing multiple tokens, thus effectively solving the self-attention scaling problem.

Among them, the Megabyte architecture has made three major improvements to the Transformer for long sequence modeling:

- Sub-quadratic self-attention

Most work on long sequence models focuses on mitigating the quadratic cost of self-attention, while Megabyte decomposes long sequences into two shorter sequences, which is still tractable even for long sequences.

- Patch feedforward layers (Per-patch feedforward layers)

In the GPT-3 size model, more than 98% of the FLOPS are used to calculate the position feed-forward layer, and each patch of Megabyte uses a large feed-forward layer to achieve a larger and more performant model at the same cost. With patch size P, the baseline transformer will use the same feed-forward layer with m parameters P times, and Megabytes can use the layer with mP parameters once at the same cost.

- Parallelism in Decoding

Transformers must perform all computations serially during generation, because the input to each time step is the output from the previous time step, and by generating representations of patches in parallel, Megabyte allows greater parallelism during generation.

For example, a Megabyte model with 1.5B parameters generates sequences 40% faster than a standard 350M Transformer, while also improving perplexity when trained with the same amount of computation.

f63afac85c4e8398d3cdfcd12de9ba88.png

Megabyte far outperforms other models and provides competitive results with sota models trained on subwords

In comparison, OpenAI's GPT-4 has a limit of 32,000 tokens, and Anthropic's Claude has a limit of 100,000 tokens.

In addition, in terms of computational efficiency, within a fixed model size and sequence length, Megabyte uses fewer tokens than Transformers and Linear Transformers of the same size, allowing larger models to be used at the same computational cost.

6563c303d77155488fd20546a18850ce.png

Together, these improvements allow us to train larger, better-performing models on the same computational budget, scale to very long sequences, and increase generation speed during deployment.

what will happen in the future

As the AI ​​arms race is in full swing, the model performance is getting stronger and the parameters are getting higher and higher.

While GPT-3.5 was trained on 175B parameters, there is speculation that the more powerful GPT-4 was trained on 1 trillion parameters.

OpenAI CEO Sam Altman also recently suggested a change in strategy. He said that the company is considering abandoning the training of huge models and focusing on other performance optimizations.

He equates the future of AI models with iPhone chips, with most consumers clueless about the raw technical specifications.

Meta's researchers believe their innovative architecture has come at just the right time, but acknowledge that there are other avenues for optimization.

For example, a more efficient encoder model using patching techniques, a decoding model that decomposes sequences into smaller blocks, and preprocesses sequences into compressed tokens, etc., and can expand the capabilities of the existing Transformer architecture to build a new generation of models.

Former Tesla AI director Andrej Karpathy also weighed in on the paper, tweeting:

This is very promising, and everyone should hope that we can throw away tokenization in large models and don't need those overly long byte sequences.

a5f3ab1a23d467b436a688802d906078.png

References:

https://www.artisana.ai/articles/meta-ai-unleashes-megabyte-a-revolutionary-scalable-model-architecture

Click to enter —>【Transformer】WeChat exchange group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-扩散模型或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch41c1f6797f37d43e80cf9a49ab73e31c.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/130998734