To crack the mystery of self-attention reasoning defects, Ant develops a new generation of Transformer or realizes lossless extrapolation

17121803:

With the rapid development of large language models, their length extrapolating capabilities are increasingly attracting the attention of researchers. Although this was regarded as a natural ability when Transformer was born, with the deepening of relevant research, the reality is far from this. The traditional Transformer architecture invariably exhibits poor inference performance beyond the training length.

Researchers gradually realized that this defect might be related to position encoding, thus starting the transition from absolute position encoding to relative position encoding, and resulting in a series of related optimization works, among which the more representative ones are: Rotational position encoding (RoPE) (Su et al., 2021), Alibi (Press et al., 2021), Xpos (Sun et al., 2022), etc., as well as the position interpolation (PI) recently developed by meta (Chen et al. ., 2023), and the NTK-aware Scaled RoPE (bloc97, 2023) given by reddit netizens are all trying to make the model truly have the ideal extrapolation capability.

However, when researchers focused their attention on positional encoding, another heavyweight role in Transformer, self-attention itself, was ignored. The latest research from the Ant Artificial Intelligence team shows that this neglected role is likely to be the key to reversing the situation. Transformer's poor extrapolation performance, in addition to positional encoding, self-attention itself still has many unsolved mysteries.

Based on this discovery, the Ant Artificial Intelligence team developed a new generation of attention mechanism. While achieving length extrapolation, the model performed equally well on specific tasks.
 

  • Paper address: https://arxiv.org/abs/2309.08646
  • Github repository: https://github.com/codefuse-ai/Collinear-Constrained-Attention
  • ModelScope:https://modelscope.cn/models/codefuse-ai/Collinear-Constrained-Attention/summary
  • HuggingFace: Stay tuned


 

background knowledge

Before we dive in, let’s quickly review some core background knowledge.

 

Length Extrapolating

Length extrapolation refers to the ability of a large language model to handle text that is longer than what was in its training data. When training large language models, there is usually a maximum sequence length, and text exceeding this length needs to be truncated or split. However, in actual applications, users may provide the model with longer text as input than during training. If the model lacks length extrapolation capabilities or has poor extrapolation capabilities, this will cause the model to produce unpredictable output, thereby affecting the actual performance of the model. Apply effects.

 

Self-Attention

Multi-head self-attention proposed by (Vaswani et al., 2017) in 2017, as the core of today's large language models, has played a decisive role in promoting the development of the field of artificial intelligence. A visual description is given in Figure 1 below. This work itself has been widely recognized and will not be repeated here. Readers who are new to large language models and do not know much about this work can go to the original paper for more details (Vaswani et al., 2017).

Figure 1. Schematic diagram of the multi-head attention mechanism, quoted from (Vaswani, et al., 2017).

 

Position encoding

Since the self-attention mechanism itself does not directly process position information in the sequence, it becomes necessary to introduce position encoding. Since the positional encoding method in the traditional Transformer is rarely used today due to its poor extrapolation ability, this article will not discuss the encoding method in the traditional Transformer in depth. Readers who need to know more related knowledge can go to the original paper See details (Vaswani et al., 2017). Here, we will focus on the currently very popular rotational position encoding (RoPE) (Su et al., 2021). It is worth mentioning that Meta’s LLaMa series models (Touvron et al., 2023a) all adopt this Encoding.

 

RoPE is a very elegant structure from the perspective of modeling aesthetics. It achieves the expression of relative position by integrating position information into the rotation of query and key.

Figure 2. Rotation position encoding structure, quoted from ( Su et al., 2021 ).

 

Position Interpolation

Although the extrapolation performance of RoPE is much better than absolute position encoding, it still cannot meet the ever-changing application requirements. For this reason, researchers have successively proposed various improvement measures, with PI (Chen et al., 2023) and NTK-aware Scaled RoPE (bloc97, 2023) as typical representatives. However, in order to achieve ideal results, position interpolation still requires fine-tuning. Experiments show that even NTK-aware Scaled RoPE, which claims to be able to extrapolate without fine-tuning, can only achieve at most 4 to 8 times the extrapolation under the traditional attention architecture. Push length, and it is difficult to ensure good language modeling performance and long-range dependency capabilities.

Figure 3. Schematic diagram of position interpolation, quoted from ( Chen et al., 2023 ).

 

CoCA

Past research has mainly focused on positional encoding, and all related research work assumes that the self-attention mechanism has been perfectly implemented. However, the Ant Artificial Intelligence team recently discovered a key that has been ignored for a long time: to fundamentally solve the extrapolation performance problem of the Transformer model, the self-attention mechanism also needs to be reconsidered.

Figure 4. CoCA model architecture, quoted from ( Zhu et al., 2023 ).

 

Abnormal behavior of RoPE and self-attention

In the Transformer model, the core idea of ​​self-attention is to calculate the relationship between query (q) and key (k). The attention mechanism uses these relationships to decide which parts of the input sequence the model should "focus" on.

Figure 5. The order in the bidirectional model is destroyed, quoted from (Zhu et al., 2023).

Figure 6. The order in the causal model is destroyed, quoted from (Zhu et al., 2023).

 

Collinearity Constrained Attention (CoCA)

Based on the above abnormal behavior analysis of RoPE and self-attention, to fundamentally solve this problem, starting with position encoding alone is obviously not the right solution. The fundamental solution is to make the initial angle between query and key in self-attention as 0, which is the origin of the Collinear Constrained Attention in the paper.

The detailed derivation and formulas will not be discussed one by one here. Readers can read the original text for in-depth understanding.

It should be pointed out in particular that the theoretical explanation given in the paper shows:
 

  • Stable long-range attenuation characteristics : CoCA shows more stable long-range attenuation characteristics than RoPE.
  • Video memory bottleneck and solution : CoCA has the risk of introducing a video memory bottleneck, but the paper provides a very efficient solution, making the calculation and space complexity of CoCA almost the same as the original version of self-attention, which is a very important feature , making CoCA extremely practical.
  • Seamless integration : CoCA can be seamlessly integrated with currently known interpolation methods (NTK-aware Scaled RoPE was experimented in the paper), and achieved performance far exceeding the original attention structure without fine-tuning, which means using CoCA The trained model naturally has nearly unlimited extrapolation capabilities, which is a feature that large language models dream of.


 

Experimental results

The paper compares the differences in extrapolation performance between CoCA, RoPE (Su et al., 2021) and ALibi (Press et al., 2021), and achieves exciting results. The corresponding model is recorded as:

  • Origin: original attention structure, position encoding method is RoPE
  • ALibi: original attention structure, position encoding method is ALibi
  • CoCA: paper model structure, position encoding method is RoPE

For detailed experimental background, please refer to the original text.

 

Long text modeling capabilities


The paper evaluates the long text language modeling capabilities of CoCA, Origin, and ALibi models. The evaluation used 100 documents, each with at least 8,192 tokens, randomly sampled from the PG-19 dataset (Rae et al., 2019). The training length of all 3 models is 512 and the model size is 350M.

 

Figure 7 illustrates a noteworthy trend: when the inference length exceeds its training length, the Origin model's perplexity quickly deviates (>1000). In contrast, the CoCA model is able to maintain low perplexity, even at 16 times its training length, with no divergence in perplexity.

 

NTK-aware Scaled RoPE (bloc97, 2023) As an extrapolation method that does not require fine-tuning, the paper allows the application of this method in experiments, but even if the dynamic NTK method is applied to the Origin model, its perplexity is still much higher than CoCA .
 

ALibi has the best performance in perplexity, and CoCA can achieve results similar to ALibi after applying the dynamic NTK method.
 

Figure 7. Sliding window perplexity test results, quoted from ( Zhu et al., 2023 ).

 

Capturing long-range dependencies


Perplexity is a measure of how proficient a language model is at predicting the next token. However, it does not entirely represent an ideal model. Because while local attention performs well on perplexity, it often performs poorly on capturing long-range dependencies.

 

In order to evaluate this issue in depth, the paper uses the key retrieval comprehensive evaluation task proposed by (Mohtashami & Jaggi, 2023) to evaluate the CoCA and Origin, ALibi models. In this task, there is a random key hidden in a long document that needs to be identified and retrieved.
 

As shown in Figure 8, although models like ALibi with certain local assumptions perform well on perplexity tasks, they have irreparable disadvantages in capturing long-range dependencies. When extrapolating to 1 times the length, the accuracy begins It declined rapidly, eventually falling below 10%. In contrast, even when the test sequence length is extended to 16 times the original training length, CoCA always shows high accuracy, still exceeding 60% at 16 times the extrapolation length. It is 20% higher than the Origin model overall, and more than 50% higher than the ALibi overall.
 

Figure 8. Random key identification retrieval performance curve, quoted from (Zhu et al., 2023).

 

Hyperparameter stability


Since the dynamic NTK method was applied in the experiment, the paper conducted an in-depth discussion on the stability of the scaling factor hyperparameter of the Origin and CoCA models under the dynamic NTK method.

 

As shown in Figure 9, the Origin model fluctuates violently under different scaling factors (200-800), while the CoCA model is in a relatively stable range (60-70). Furthermore, as can be seen from the detailed data in Table 4, the worst perplexity performance of the CoCA model is still more than 50% better than the best perplexity performance of the Origin model.

 

In the Passkey experiment, the Origin and CoCA models showed similar characteristics to those in the perplexity experiment. The CoCA model had higher accuracy at different scaling factors, while the accuracy of the Origin model dropped to below 20%. Furthermore, as can be seen from the detailed data in Table 5, the Origin model still has an accuracy gap of 5% to 10% with the CoCA model even when the scaling factor=2, which performs best.

 

At the same time, the Origin model has poor perplexity performance when scaling factor=2. This reflects from the side that it is difficult for the original attention structure to simultaneously ensure perplexity and capture long-range dependence performance during length extrapolation, while CoCA does Got to this point.


Figure 9. Perplexity of Origin model and CoCA at different scaling factors, quoted from (Zhu et al., 2023)

 

Figure 10. Pass key accuracy of Origin model and CoCA at different scaling factors, quoted from (Zhu et al., 2023)

 

Attention score in length extrapolation

As explored in the PI (Chen et al., 2023) paper, the failure of large language models in length extrapolation is directly related to outliers (usually very large values) in the attention score. The paper further explores this phenomenon, which also explains why the CoCA model performs better than the traditional attention structure in length extrapolation.
 

The experiment used a random fragment in the PG-19 data set (Rae et al., 2019) with a length of 1951 tokens, which is approximately 4 times the model training length.

As shown in Figure 11, (a1) is the attention score of each layer of the Origin and CoCA models without using the dynamic NTK method, (b1) is the score after using the dynamic NTK method, and low layers represent the 6th and 12th layers of the model. , 18th layer, last layer represents the 24th layer, (a2) is the enlarged version of (a1) in the last 500 tokens, (b2) is the same.
 

  • From (a1) and (b1), it can be found that the attention score of the Origin model has a small number of outliers, and the value is 10 to 20 times larger than the attention score of the CoCA model.
  • Since these outliers affect the observation effect, (a2) partially zooms in on the last 500 tokens, and you can see that the last layer attention score of the Origin model is almost 0, which shows that the Origin model pays attention to neighboring tokens during length extrapolation. of failure.
  • It can be seen from (b2) that when the dynamic NTK method is applied, the attention score of the Origin model at adjacent tokens becomes abnormally large. This anomaly is closely related to the abnormal behavior of RoPE and self-attention demonstrated previously. The Origin model There may be severe overfitting at nearby tokens.

Figure 11. Attention score in extrapolation, quoted from (Zhu et al., 2023)
 

Human Eval

Outside of the paper, we used the same data (120B token), the same model size (1.3B), and the same training configuration to further evaluate the performance of human eval based on the CoCA and Origin models. The comparison with the Origin model is as follows:
 

  • Compared with the Origin model, the two models are at the same level, and CoCA has not lost its model expression ability due to its extrapolation ability.
  • The Origin model performs much better in python and java than other languages, but performs worse in go. CoCA's performance is relatively balanced. This is related to the small amount of go corpus in the training corpus, indicating that CoCA may have potential small sample learning capabilities.


 

python

java

cpp

js

go

AVG

CoCA

6.71%

6.1%

3.66%

4.27%

6.1%

5.37%

Origin

7.32%

5.49%

5.49%

5.49%

1.83%

5.12%

 

Summarize

In this work, the Ant Artificial Intelligence team discovered some abnormal behavior between RoPE and the attention matrix, which caused a disorder in the interaction of the attention mechanism and position encoding, especially at the nearest position containing key information. token.

 

In order to fundamentally solve this problem, the paper introduces a new self-attention framework called collinear constrained attention (CoCA). The paper provides mathematical evidence demonstrating the superior properties of the method, such as stronger forms of long-range attenuation, as well as computational and space efficiency for practical applications.

 

Experimental results confirm that CoCA has excellent performance in both long-text language modeling and long-range dependency capture. In addition, CoCA can be seamlessly integrated with existing extrapolation, interpolation techniques, and other optimization methods designed for traditional Transformer models. This adaptability shows that CoCA has the potential to evolve into an enhanced version of the Transformer model.
 

references

Shiyi Zhu, Jing Ye, Wei Jiang, Qi Zhang, Yifan Wu, and Jianguo Li. Cure the headache of transformers via collinear constrained attention, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021. URL https://api.semanticscholar.org/CorpusID:233307138.

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021. URL https://api.semanticscholar.org/CorpusID:237347130.

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and FuruWei. A length-extrapolatable transformer. ArXiv, abs/2212.10554, 2022. URL https://api.semanticscholar. org/CorpusID:254877252.

Shouyuan Chen, ShermanWong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. ArXiv, abs/2306.15595, 2023. URL https://api.semanticscholar.org/CorpusID:259262376.

bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_modes_to_have/.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothy Lee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Call: Open and Efficient Foundation Language Models. ArXiv , abs/2302.13971 , 2023a . URL https://api.semanticsscholar.org/CorpusID:257219404.

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Christian CantÅLon Ferrer, Moya Chen, William Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, AV Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Ilian Zarov, Yuchen Zhang, Angela Fan , Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Flame 2: Open foundation and fine-tuned chat models. ArXiv , abs/2307.09288 , 2023b . URL https://api.semanticsscholar.org/CorpusID:259950998.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https: // api.semanticscholar.org/CorpusID:247519241. 

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling. ArXiv, abs/1911.05507, 2019. URL https://api.semanticscholar.org/CorpusID:207930593.

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. ArXiv, abs/2305.16300, 2023. URL https://api.semanticscholar.org/CorpusID:258887482.
 

About DevOpsGPT
 

DevOpsGPT is an open source project related to large models in the DevOps field initiated by us. It is mainly divided into three modules.

 

The DevOps-Eval introduced in this article is the evaluation module, and its goal is to build an industry standard evaluation of LLM in the DevOps field. In addition, there are two modules, DevOps-Model and DevOps-ChatBot, which are respectively a large model dedicated to the DevOps field and an intelligent assistant for the DevOps field.

 

Our goal is to truly combine large models to improve efficiency and cost savings in the DevOps field, including development, testing, operation and maintenance, monitoring and other scenarios.

 

We hope that relevant practitioners will contribute their talents together to make "it is not difficult to be a coder". We will also regularly share our experiences & attempts in the field of LLM4DevOps.
 

Welcome to use & discuss & build together

(1) ChatBot - out-of-the-box DevOps intelligent assistant: https://github.com/codefuse-ai/codefuse-chatbot

(2) Eval - LLM industry standard evaluation in the DevOps field: https://github.com/codefuse-ai/codefuse-devops-eval

(3) Model - a large model exclusive to the DevOps field: https://github.com/codefuse-ai/CodeFuse-DevOps-Model

(4) CoCA - Ant’s self-developed new generation transformer: https://github.com/codefuse-ai/Collinear-Constrained-Attention

(5) CodeFuse official website: https://codefuse.alipay.com/welcome/product

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time It is about to enter the 1.7 billion era (already entered). Xiaomi officially announced that Xiaomi Vela is fully open source. The underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6942768/blog/10143582