Transformer's PE (position embedding), that is, position coding understanding

background:

Recently, I have to engage in theoretical study. I have been working on the project for more than half a year, and I have to pick up some theoretical principles. Now I will familiarize myself with the transformer and try to understand it thoroughly.
There are a lot of classic introductions and information about transformers, so I won’t go into details. If I encounter some problems that I didn’t quite understand at the moment, I will just note them down and think of them as a re-understanding.

Insert image description here
The input of the transformer is either a word vector or a block-processed image, which is used in the fields of natural language processing and computer vision respectively.
In natural language processing, the original input must be some kind of textual language, but before it is sent to the machine for processing, it must be encoded first. Generally, word2vec and other methods are used to convert it into word vectors.

There needs to be a relative position relationship between word vectors. If they are all input out of order, it will definitely be inconvenient to process. The meaning of the combination of different words will also change, so position information must be added to the word vectors.

Ordinarily adding a position information is not conducive to calculation when the word size is large, so a position encoding method is used in the transformer, and the position information is sent to the network as part of the input information of the word vector for processing, and good results are obtained. I didn't quite understand the location coding at first, but after reading more information, I got a general understanding.

Position Embedding is used in Transformer to save the relative or absolute position of the word in the sequence.
Insert image description here
This is the definition of the position vector function. After many interpretations, my current understanding is:

1. Expression method:

Assume that each word of a certain language can be embedded into a vector of length 254, that is, the length of a word is 254,

Assume that after word embedding and forming a batch, the shape is (b, N, 254), N is the maximum length of the sequence, that is, the number of words contained in the longest sentence is N, 254 is the length of the embedding vector of each word, and b is batch, in order to facilitate the formation of batches (the number of words in different training sentences will definitely be different) for training, you can simply count the number of words in all training sentences, and take the maximum. Assume that after counting, it is found that the longest sentence to be translated is 10 words. Then the encoder input is 10x254, the batch is 1, and the input dimension is (1, 10, 254).

Because the length of the position vector needs to be added to the word vector, the length needs to be consistent, so the dimension of the position vector is (1, N, 254), N represents the N positions corresponding to N words, and each position is represented by a 254-length vector. .
Insert image description here

2. Transformer is represented by sin-cos function:

Insert image description here
Among them, pos corresponds to N, the range is 0~N, and represents the position of a word with a word vector length of N.
i represents the position of the smallest unit of a word vector, and the range is 0-253; d represents the length of the word vector. The maximum length, which is 254

When representing words, the smallest unit representing the vector is divided according to position. Even numbers are represented by sin, odd numbers are represented by cos, and they are spliced ​​according to the position order.

Insert image description here
As shown in the figure above, the vertical direction is position, indicating the length of a sentence, here is 50, indicating that the sentence contains 50 words, which is the N above; the horizontal direction represents the length of the word vector of each word,
here is 128, which is the above 254. The other 1/50 horizontal bar represents a word. There are 3 bars drawn in the picture, corresponding to 3 words.

The following pictures are similar:
Insert image description here
Insert image description here

Attached are several helpful links for reference, which I find more useful during the search process:
https://zhuanlan.zhihu.com/p/338817680
https://zhuanlan.zhihu.com/p/308301901
https:// kazemnejad.com/blog/transformer_architecture_positional_encoding/
http://jalammar.github.io/illustrated-transformer/

おすすめ

転載: blog.csdn.net/qq_44442727/article/details/126505076