The AI circle exploded! Microsoft unblocks Transformer, the sequence length is extended by 1 billion+

The AI ​​circle exploded! The LONGNET launched by Microsoft has successfully expanded Transformer's Token processing capacity to 1 billion+.

You know, everyone has been praising Transformer's understanding ability and short sequence generation ability before, and has always been "powerless" for long sequences.

Microsoft's operation this time is equivalent to giving a sprint champion the ability to run a marathon at top speed . After all, while processing long sequences, it still maintains excellent performance when processing short sequence tasks.

LONGNET is a Transformer variant that can scale sequence length to more than 1 billion tokens, with no loss in shorter sequences.

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Good articles are inseparable from the sharing, communication, and recommendation of fans. Dry data, data sharing, data, and technical exchange improvement can all be obtained by adding an exchange group. The group has more than 2,000 members. The best way to add a note is: source + Interest direction, easy to find like-minded friends.

Method ①, add WeChat account: mlc2060, remarks: from CSDN + algorithm
Method ②, WeChat search official account: machine learning community, background reply: algorithm

In this regard, netizens commented: This is a revolution!

Because this work provides new ideas and possibilities for modeling long sequences. In the future, it is even expected to treat the entire Internet corpus as a Token. At the same time, it means that more complex AI interactions become possible.

LONGNET unpacking sequence length

The Transformer model is the core architecture of many AI systems, and its working principle is to process information sequences composed of Tokens to understand or generate text.

Note: Token can be a short word or a complete sentence.

global attention mechanism

Global attention is the key to Transformer's understanding ability, which allows a Token to "interact" with all other Tokens. Once the sequence becomes longer, the number of interactions increases exponentially, greatly increasing the computational complexity.

The previous paragraph was a bit abstract, so to explain: Imagine you're trying to have a separate conversation with everyone in the room. This is manageable if there are only a few people. But as the numbers grow, it quickly becomes overwhelming.

ChatGPT is developed by OpenAI based on Transformer. When you use it for contextual dialogue, you will find that it will often "forget" what you said to him before.

In the future, with LONGNET, ChatGPT's unlimited dialogue ability will be unlocked, and it will remember your initial question.

The Heart of LONGNET: The Power of Expanding Attention

In the LONGNET work, Microsoft researchers introduced a novel concept called "dilated attention" into the Transformer model, which fundamentally changed the way the model processes sequences.

The magic of expanded attention is that when the distance increases, more Tokens can be paid attention to, without requiring each sequence to interact with all other sequences.

Like, in a crowd, you can pay attention to people who are nearby and people who are far away, but you don't need to talk to everyone individually.

Legend: The building blocks used by dilated attention in LONGNET. Includes a family of attention patterns for modeling short-range and long-range dependencies. The number of attention patterns can be scaled according to the sequence length.

This is similar to the sparse attention model, but borrows the idea of ​​segment trees . Allows the number of interactions that grow exponentially with sequence length to grow linearly. In other words, as the sequences get longer, the increased computational effort becomes more manageable.

Dilated attention not only makes LONGNET more efficient, but also makes it more flexible. Because it does not need to interact with each sequence, it can also adjust the focus of attention according to the task, which makes it effective for short and long sequences.

LONGNET also performs well on general language tasks. This means that it is not only a specialized tool for long sequences, but a robust and flexible model capable of handling many tasks.

Legend: Comparison of computational complexity between different methods. N is the sequence length and d is the dimension of the hidden layer.

In addition, the researchers compared LONGNET with traditional Transformers and sparse Transformers. For comparison, they scaled the sequence length of these models from 2,000 tokens (2K) to 32,000 tokens (32K). To ensure a fair comparison, they adjusted the parameters of each model. Despite certain computational limitations, the experimental results are still excellent.

At the same time, increasing the model parameters from 120 million to 2.7 billion, as the calculation of LongNet increases, the PPL on the test set also decreases. This shows that LongNet also satisfies the scaling law. Better performance may be achieved by training larger language models.

LONGNET is not without limitations. For example, although the dilated attention mechanism reduces the computational complexity to a level lower than that of the standard Transformer model, processing sequences with more than 1 billion tokens still requires a lot of resources. Also, while powerful, more testing and validation may still be required.

Microsoft also proposed the future research direction of LONGNET: how to further optimize the expanded attention mechanism? Are there other sequence processing techniques that can complement dilated attention? How to effectively integrate LONGNET into existing AI systems such as ChatGPT?

Paper address:

https://arxiv.org/abs/2307.02486

Reference source:

https://thetechpencil.com/revolutionizing-ai-with-longnet-microsofts-breakthrough-in-handling-billion-token-sequences-59b05ef7d6e8

Guess you like

Origin blog.csdn.net/2301_78285120/article/details/131622908