[Transformer Series] Understanding Positional Encoding in simple terms

1. Reference materials

This article teaches you a thorough understanding of Positional Encoding in Transformer
Transformer Architecture: The Positional Encoding
The Annotated Transformer
Master Positional Encoding: Part I
How to understand positional encoding in the Transformer paper, and what is its relationship with trigonometric functions?
Illustrated Transformer Series 1: Positional Encoding
Detailed explanation of Transformer’s positional encoding

二、Positional Encoding

1 Introduction

In any language, the position and order of words are crucial to expressing the meaning of a sentence. The traditional RNN model is naturally ordered. When processing a sentence, it processes the words in the sentence one by one in a sequence mode. This allows the order information of the words to be naturally preserved during the processing and does not require additional processing.

Since the Transformer model does not have an RNN (Recurrent Neural Network) or CNN (Convolutional Neural Network) structure, the words in the sentence all enter the network for processing at the same time, so there is no clear relative or absolute information about the position of the word in the source sentence. In order for the model to understand the position (order) of each word in the sequence, the Transformer paper proposes to use a technology called Positional Encoding. This technique works by adding an extra encoding to each word to represent its position in the sequence, so that the model can understand the relative position of the word in the sequence.

2. The concept of Positional Encoding

In models such as Transformer, the input sequence is usually a series of embedding vectors, which only contain semantic information of words or tags and lack positional information. To solve this problem, Positional Encoding adds a vector representing its position to each input vector, thus retaining the semantic information of the word/token while providing positional information .

In one sentence,Positional Encoding is to add (embed) positional information into the Embedding word vector, allowing the Transformer to retain the positional information of the word vector., which can improve the model's ability to understand sequences.

3. Position encoding

In the past, we calculated the distance based on the proportion of intervals between words. If the length of the entire sentence was set to 1, such as: Attention is all you need, the distance between is and you was 0.5. The 0.5 distance in To follow along you will first need to install PyTorchthe neutrons of longer texts will separate many words, which is obviously inappropriate.

So, to summarize, the ideal positional encoding should satisfy:

  1. Output a unique code for each word ;
  2. The difference between any two words should remain consistent between sentences of different lengths;
  3. Encoded values ​​should be bounded .

4. Characteristics of Positional Encoding

  1. Each position has a unique Positional Encoding;
  2. The relationship between two locations can be modeled (obtained) by an affine transformation between their locations;

5. Analysis of Positional Encoding Principle

Commonly used Positional Encoding methods include Sinusoidal Positional Encodingand Learned Positional Encoding. in,Sinusoidal Positional EncodingPositional encoding is calculated by applying different frequencies of sine and cosine functions to positions in the input sequence; positional Learned Positional Encodingencoding is calculated by learning a set of learnable parameters. This section takes Sinusoidal Positional Encodingas an example to introduce the principle of Positional Encoding.

Trigonometric functions are used in the Transformer paper to implement Positional Encoding, because trigonometric functions are periodic and can well represent the relative positions of words in the sequence. The authors of the Transformer paper used different frequenciesSine and cosine functions来作为位置编码:
{ P E ( p o s , 2 i ) = sin ⁡ ( p o s / 1000 0 2 i / d m o d e l ) P E ( pos , 2 i + 1 ) = cos ⁡ ( p o s / 1000 0 2 i / d m o d e l ) (1) \begin{equation} \begin{cases} PE(pos, 2i)=\sin \left(pos / 10000^{2 i / d_{model}}\right) \\ PE(\text {pos}, 2 i+1)=\cos \left(pos / 10000^{2 i / d_{model}}\right) \end{cases} \end{equation} \\ \tag{1} { PE ( p os ,2i ) _=sin(pos/100002 i / dmodel)PE ( pos ,2i _+1)=cos(pos/100002 i / dmodel)( 1 )
Among them, pos is position, indicating the position of token in the sequence. Assume the sentence length is L, thenpos = 0, 1, …, L − 1 pos=0,1,\ldots,L-1pos=0,1,,L1PE PEPE is the position vector of the token,PE ( pos , 2 i ) PE(pos, 2i)PE ( p os ,2 i ) represents the i-th element in this position vector;dmodel d_{model}dmodelRepresents the dimension of the token (usually 512), iii represents odd dimensions,2 i 2i2i represents even dimensions .

It can be seen from the formula that the position encoding of a word is actually composed of cosine functions of different frequencies. From low to high, the frequency corresponding to the cosine function decreases from 1 to 1 10000 \frac{1} {10000}100001, the wavelength is from 2 π 2\pi2 π increases to10000 ⋅ 2 π 10000\cdot2\pi100002 p .

For example, the first token is 0. 2iand 2i+1represent the dimensions of Positional Encoding, ithe value range is: [ 0 , … , dmodel / 2 ) [0,\ldots,d_{model}/2)[0,,dmodel/2 ) , wheredmodel d_{model}dmodelto 512. Therefore, when pos is 1, the corresponding Positional Encoding can be written as:
PE (1) = [ sin ⁡ ( 1 / 1000 0 0 / 512 ) , cos ⁡ ( 1 / 1000 0 0 / 512 ) , sin ⁡ ( 1 / 1000 0 2 / 512 ) , cos ⁡ ( 1 / 1000 0 2 / 512 ) , … bmatrix \begin{aligned} &\left.PE\left(1\right)=\left[\sin\left(1/10000 ^{0/512}\right.\right),\cos\left(1/10000^{0/512}\right),\sin\left(1/10000^{2/512}\right.\right ),\cos \\ &\begin{pmatrix}1/10000^{2/512}\end{pmatrix},\ldots {bmatrix} \end{aligned}PE(1)=[sin(1/100000/512),cos(1/100000/512),sin(1/100002/512),cos(1/100002/512),bmatrix
With the help of the above formula, we can get a dmodel d_{model} at a specific positiondmodel维 is the location of the vector, and with the help of the property of the triangle function:
{ sin ⁡ ( α + β ) = sin ⁡ α cos ⁡ β + cos ⁡ α sin ⁡ β cos ⁡ ( α + β ) = cos ⁡ α cos ⁡ β − sin ⁡ α sin ⁡ β (2) \begin{cases} sin⁡(α+β)=sin⁡αcos⁡β+cos⁡αsin⁡β \\ cos⁡(α+β)=cos⁡αcos⁡β−sin⁡ αsin⁡β \\ \end{cases} \tag{2}{ s in ( a+b )=sinαcosβ+cosαsinβcos ( a+b )=cosαcosβsinαsinβ( 2 )
can be obtained:
{ PE ( pos + k , 2 i ) = PE ( pos , 2 i ) × PE ( k , 2 i + 1 ) + PE ( pos , 2 i + 1 ) × PE ( k , 2 i ) PE ( pos + k , 2 i + 1 ) = PE ( pos , 2 i + 1 ) × PE ( k , 2 i + 1 ) − PE ( pos , 2 i ) × PE ( k , 2 i ) ( 3) \begin{cases} PE(pos + k,2i) = PE(pos,2i) \times PE(k,2i+1) + PE(pos, 2i+1) \times PE(k,2i) \ \ PE(pos + k,2i+1) = PE(pos,2i+1) \times PE(k,2i+1) - PE(pos, 2i) \times PE(k,2i) \end{cases} \tag{3}{ PE ( pos _+k,2i ) _=PE ( p os ,2i ) _×PE ( k ,2i _+1)+PE ( p os ,2i _+1)×PE ( k ,2i ) _PE ( pos _+k,2i _+1)=PE ( p os ,2i _+1)×PE ( k ,2i _+1)PE ( p os ,2i ) _×PE ( k ,2i ) .( 3 )
It can be seen that forpos + k pos+kpos+The position vector of k position has a certain dimension2 i 2i2 i2 i + 1 2i+12i _+1 , it can be expressed as:pos posp os position andkk2 i 2i of position vector of k position2 i2 i + 1 2i+12i _+A 1- dimensional linear combination. This linear combination means that the position vector contains relative position information.

BERT uses Transformer, but the position information is trained, and sine and cosine are not used; sine and cosine take into account that the semantics of the language are related to the relative position and have little to do with the absolute position. Whether a sentence is placed at the beginning of the text, in the text, or at the end of the text, it is excluded. The semantics after special circumstances should be similar. Therefore, as long as the design is reasonable, other periodic functions can also be used.

6. Popular understanding of Positional Encoding

The simplest and most intuitive way to add positional information is to use 1, 2, 3, 4, ... to directly encode the position of the sentence (one-hot). Use binary conversion as an example:
Insert image description here

In the above table, the number formed by dimension 0, dimension 1, dimension 2, and dimension 3 is the binary representation corresponding to the position. It can be seen that each dimension (each column) actually has a period, and the periods are different. Specifically, the change rate of each bit is different. The lower the bit, the faster the change (the further to the right, the faster the change frequency). Each number in the red position 0 and 1 will change once, while the yellow bit will change once. bit, it only changes every 8 digits. This can explainMulti-dimensional coding and incremental sequence coding using multiple periodic functions with different periods are actually equivalent.. This also answers why periodic functions can introduce position information.

In the same way, the combination of sin sine function and cos cosine function of different frequencies can achieve this change from low to high by adjusting the frequency of the trigonometric function, so that the position information can be expressed. 128-dimensional position coding 2D schematic diagram, as shown below:
Insert image description here

7. Position vector and word vector

Generally speaking, position vectors and word vectors can be combined using vector concatenation or addition.

input = input_embedding + positional_encoding

Here, input_embedding maps the vector dimension of each token from vocab_size to d_model through the regular Embedding layer. Due to the additive relationship, positional_encoding is also a vector of d_model dimension. (In the original paper, d_model=512)
Insert image description here

8. Code implementation of Positional Encoding

8.1 Method 1

Refer to the code implementation in OpenNMT: onmt/modules/embeddings.py

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, max_len=5000):
        super(PositionalEncoding, self).__init__()       
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        #pe.requires_grad = False
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:x.size(0), :]

8.2 Method 2

Core technology of large language model-Detailed explanation of Transformer

class PositionalEncoding(nn.Module):
    """
    compute sinusoid encoding.
    """
    def __init__(self, d_model, max_len, device):
        """
        constructor of sinusoid encoding class

        :param d_model: dimension of model
        :param max_len: max sequence length
        :param device: hardware device setting
        """
        super(PositionalEncoding, self).__init__()

        # same size with input matrix (for adding with input matrix)
        self.encoding = torch.zeros(max_len, d_model, device=device)
        self.encoding.requires_grad = False  # we don't need to compute gradient

        pos = torch.arange(0, max_len, device=device)
        pos = pos.float().unsqueeze(dim=1)
        # 1D => 2D unsqueeze to represent word's position

        _2i = torch.arange(0, d_model, step=2, device=device).float()
        # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])
        # "step=2" means 'i' multiplied with two (same with 2 * i)

        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))
        # compute positional encoding to consider positional information of words

    def forward(self, x):
        # self.encoding
        # [max_len = 512, d_model = 512]

        batch_size, seq_len = x.size()
        # [batch_size = 128, seq_len = 30]

        return self.encoding[:seq_len, :]
        # [seq_len = 30, d_model = 512]
        # it will add with tok_emb : [128, 30, 512]         

class TokenEmbedding(nn.Embedding):
    """
    Token Embedding using torch.nn
    they will dense representation of word using weighted matrix
    """

    def __init__(self, vocab_size, d_model):
        """
        class for token embedding that included positional information
        :param vocab_size: size of vocabulary
        :param d_model: dimensions of model
        """
        super(TokenEmbedding, self).__init__(vocab_size, d_model, padding_idx=1)

class TransformerEmbedding(nn.Module):
    """
    token embedding + positional encoding (sinusoid)
    positional encoding can give positional information to network
    """

    def __init__(self, vocab_size, max_len, d_model, drop_prob, device):
        """
        class for word embedding that included positional information
        :param vocab_size: size of vocabulary
        :param d_model: dimensions of model
        """
        super(TransformerEmbedding, self).__init__()
        self.tok_emb = TokenEmbedding(vocab_size, d_model)
        self.pos_emb = PositionalEncoding(d_model, max_len, device)
        self.drop_out = nn.Dropout(p=drop_prob)

    def forward(self, x):
        tok_emb = self.tok_emb(x)
        pos_emb = self.pos_emb(x)
        return self.drop_out(tok_emb + pos_emb)

Guess you like

Origin blog.csdn.net/m0_37605642/article/details/132866365