Transformer code flow study notes

This article is mainly the reading notes of the following document: https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec< /span>

At the same time, there is a very detailed code explanation on YouTube:https://www.youtube.com/watch?v=ISNdQcPhsts

This project on Github is very detailed:GitHub - hyunwoongko/transformer: Transformer: PyTorch Implementation of "Attention Is All You Need"

references:

https://medium.com/ching-i/transformer-attention-is-all-you-need-c7967f38af14

Transformer Diagram – Li Li’s Blog

13.【Li Hongyi Machine Learning 2021】Transformer (Part 2)_bilibili_bilibili

The overall structure:

Picture from the article "attention is all you need"

Transfomer is mainly divided into two parts: Encoder and Decoder. Each module is composed of a stack of N identical modules (N=6 in the figure). The input of each encoder is the output of the previous layer's encoder.

The information input into the Encoder is Input, which is the part of the information used for prediction; the information input into the Decoder includes the prediction results, including the actual prediction information and the output from the encoder. Finally, the loss value is calculated based on the comparison between the predicted results and the actual predicted results.

Encoder

Input Embedding

The first step of Encoder is to convert the input into a vector. The data dimensions obtained in the first step will be: (batch_size, sequence_length, d_model)

#vocab_size是token的数量，d_model是transfomer的输出维度
class Embedder(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
    def forward(self, x):
        return self.embed(x)

Note: There is forward in the code, and the forward method is a special method in the PyTorch module. Whenever you pass input to a PyTorch model or custom module (for example, model(input)), the forward method is called. Generally used to describe how input is converted to output.

Positional encoding

After converting the input into a vector, the next step is to add positional encoding. The positional encoding method is as follows:

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len = 80):
        super().__init__()
        self.d_model = d_model
        
        # create constant 'pe' matrix with values dependant on 
        # pos and i
        pe = torch.zeros(max_seq_len, d_model)
        for pos in range(max_seq_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = \
                math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = \
                math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
                
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
 
    
    def forward(self, x):
        # make embeddings relatively larger
        x = x * math.sqrt(self.d_model)
        #add constant to embedding
        seq_len = x.size(1)
        x = x + Variable(self.pe[:,:seq_len], \
        requires_grad=False).cuda()
        return x

The next step is to add the location information to the input embedding. The process is as shown in the figure:

The reason why we increase the embedding value before adding is to make the positional encoding relatively small. This means that the original meaning in the embedding vectors is not lost when we add them together.

The data dimensions obtained in this step are: (batch_size, seq_len, d_model)

Multi-head Attention

The following is the structure of multi-head attention:

The multi-head attention mechanism is formed by stacking h identical attention mechanisms. First, the principle of one of the attention mechanisms is introduced:

This video about Q, K, V, B station is very intuitive:The essence of attention mechanism|Self-Attention|Transformer|QKV matrix_bilibili_bilibili< /span>

There are three important vectors Q, K, and V in the attention mechanism, respectively representing:

Q (Query): The query is usually the information you want to find, and represents the content we are currently paying attention to.

K (Key): The key is paired with the query, which is used to calculate the attention score. This score determines the importance of each value

V (Value): Attention scores, these scores are used to weight the "value". This way, the resulting output is a weighted combination of values, with weights determined by the similarity between the key and the query.

At a certain moment, do the inner product of each k through q to get the matching similarity α1, α2... between q and k, and then use softmax to turn the score into probability, divided by $\sqrt{d_k}$ (the length of the vector) is to make the gradient stable.

This step is called: scaled dot-product attention.

In teacher Li Hongyi's course, an example is used to illustrate.

Q, K, V are all formed by X projection and projected to lower dimensions. For Q, K, and V, they are all generated by the same X multiplied by different W. Projecting h times generates different results, learns different features, and increases diversity. In all calculations, only $W^Q, W^K, W^V$ is unknown and will be determined by the algorithm during the calculation.

So the detailed formula is as follows:

class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model, dropout = 0.1):
        super().__init__()
        
        self.d_model = d_model
        self.d_k = d_model // heads
        self.h = heads
        
        self.q_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(d_model, d_model)
    
    def forward(self, q, k, v, mask=None):
        
        bs = q.size(0)
        
        # perform linear operation and split into h heads
        在进行分头的时候一般均使用以下的代码，view
        
        k = self.k_linear(k).view(bs, -1, self.h, self.d_k)
        q = self.q_linear(q).view(bs, -1, self.h, self.d_k)
        v = self.v_linear(v).view(bs, -1, self.h, self.d_k)
        
        # transpose to get dimensions bs * h * sl * d_model
       
        k = k.transpose(1,2)
        q = q.transpose(1,2)
        v = v.transpose(1,2)
# calculate attention using function we will define next
        scores = attention(q, k, v, self.d_k, mask, self.dropout)
        
        # concatenate heads and put through final linear layer
        concat = scores.transpose(1,2).contiguous()\
        .view(bs, -1, self.d_model)
        
        output = self.out(concat)
    
        return output

For the multi-head attention mechanism, the above procedure is actually repeated h times, using h different projections to learn different directions. The reason why it is done multiple times is to make different heads focus on different information. Finally, contact the results calculated by the bulls.

Add&Norm

After the multi-head attention mechanism, the last layer is the add&norm layer, which performs residual connection (Residual) and layer normalization.

It is even clearer in this picture: that is, the result obtained through the self-attention layer will also add the original x value to pass through the fully connected layer.

Normalization is very important in deep neural networks. It prevents the range of values in the layer from changing too much, which means the model trains faster and generalizes better.

What is used in transformer is not batch normalization, but layer normalization. The main reason is that the lengths are different for different samples. Layer normalization can reduce the impact of different lengths between different samples.

class Norm(nn.Module):
    def __init__(self, d_model, eps = 1e-6):
        super().__init__()
    
        self.size = d_model
        # create two learnable parameters to calibrate normalisation
        self.alpha = nn.Parameter(torch.ones(self.size))
        self.bias = nn.Parameter(torch.zeros(self.size))
        self.eps = eps
    def forward(self, x):
        norm = self.alpha * (x - x.mean(dim=-1, keepdim=True)) \
        / (x.std(dim=-1, keepdim=True) + self.eps) + self.bias
        return norm

Feed-Forward Network

This layer is actually an MLP.

Two-layer linear calculation, first perform linear operation on the result of the previous layer, Relu calculation and then perform linear calculation.

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff=2048, dropout = 0.1):
        super().__init__() 
        # We set d_ff as a default to 2048
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)
    def forward(self, x):
        x = self.dropout(F.relu(self.linear_1(x)))
        x = self.linear_2(x)
        return x

Decoder

Decoder is the decoding layer. The other parts of Decoder are basically the same as encoder. Unlike encoder, which only has multi-head attention and feed forward, there is also a block in decoder: masked multi-head attention.

The input of the Decoder is the real value (teaching forcing).

Create masks

Here we mainly talk about the Masked Multi-head attention part. Because the predicted information in the decoder is generated one by one, subsequent prediction results need to be blocked when producing the results.

The main purposes of masks are:

Padding mask in the encoder and decoder: If there is only padding in the input sentence, the output is zero. Padding is a special marker that is added to a time series to make them equal in length when batching multiple series together.
Sequence mask in the decoder: Prevents the decoder from "peaking" ahead of translating the rest of the sentence when predicting the next word.

batch = next(iter(train_iter))
input_seq = batch.English.transpose(0,1)
input_pad = EN_TEXT.vocab.stoi['<pad>']
# creates mask with 0s wherever there is padding in the input
input_msk = (input_seq != input_pad).unsqueeze(1)

# create mask as before
target_seq = batch.French.transpose(0,1)
target_pad = FR_TEXT.vocab.stoi['<pad>']
target_msk = (target_seq != target_pad).unsqueeze(1)
size = target_seq.size(1) # get seq_len for matrix
nopeak_mask = np.triu(np.ones(1, size, size),
k=1).astype('uint8')
nopeak_mask = Variable(torch.from_numpy(nopeak_mask) == 0)
target_msk = target_msk & nopeak_mask

How mask works:

From an individual perspective:

For $a^2$ , when masked self-attention, use $q^2$ only for $k^1$ and a> $k^2$ is queried, and the following $k^3$ and $k^4$ are blocked, which is different from the self-attention mechanism.

From a matrix perspective:

Use a very large number to make the V corresponding to the subsequent K. That is, a matrix with an upper triangle of 1 is used to make the information after the current word invisible.

Cross attention

The input q of the multi-head attention in the middle layer comes from the output of the previous layer, and k and v all belong to the output results of the last layer among the N encoders. When the decoder generates a new word, it uses its own output as the query (Q) and the encoder's output as the keys (K) and values (V) to decide which parts of the input sequence are relevant to the current generation. The most relevant words.

For example, in the figure below, q is generated in the decoder, and then queried with k and v in the encoder to obtain the generated results and subsequent results. In layman's terms: the decoder's q selects a value that is highly relevant to itself among the encoder's k.

This step mainly considers contextual information when modeling.