Alibi:Attention With Linear Biases Enables Input Length Extrapolation

Alibi:Attention With Linear Biases Enables Input Length Extrapolation

Introduction

Assuming that a model is trained on 512 tokens, during inference, the model's performance on a longer sequence is called the extrapolation of the model. The authors show that the extrapolation of previous position codes such as Sin, Rotary, and T5 Bias all become worse as the inference length increases. Based on this, Sitting proposed Alibi, as shown in the figure below:
insert image description here
Compared with other positional encodings, Alibi's perplexity of the token is basically unchanged as the length of the reasoning token increases.
At the same time, Ailibi is faster than T5 and Rotary in terms of training speed and reasoning speed, comparable to Sin, and has 11% less memory usage than the former.
insert image description here

Method

insert image description here

Alibi's method is very simple. As shown in the figure above, when calculating the attention score, the previous score will be punished in different degrees according to the difference from the current position. Assume that when calculating the attention of q3 and k3, q3 will also consider the attention of k1 and k2, where it is -2 for q3 k1 and -1 for q3 k2. Then multiplied by the slope m, the author found that m does not need to choose different values ​​according to different data, and it can be used unchanged. The method of setting m on different heads is as follows:
insert image description here

Result

insert image description here

reference

https://arxiv.org/pdf/2108.12409.pdf

Guess you like

Origin blog.csdn.net/qq_18555105/article/details/131442418