Detailed explanation of mask in attention mechanism

The mask of the attention mechanism allows us to send batches of different lengths to the transformer at once. This is done in code by padding all sequences to the same length, and then using the "attention_mask" tensor to identify which tokens are padded. This article will detail the principle and mechanism of this mask.

Let's first describe how it works if you don't use a mask. Here, GPT-2 uses one sequence at a time to perform inference, because there is only one sequence at a time, so the speed is very slow:

 from transformers import GPT2LMHeadModel, GPT2Tokenizer
 
 tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
 gpt2 = GPT2LMHeadModel.from_pretrained('gpt2')
 
 context = tokenizer('It will rain in the', return_tensors='pt')
 
 prediction = gpt2.generate(**context, max_length=10)
 tokenizer.decode(prediction[0])
 # prints 'It will rain in the morning, and the rain'

Using batch input is faster, as video memory allows, because we can process multiple sequences simultaneously during a single inference. Performing inference on many samples is much faster, but also slightly more complicated, here is the code for inference using the transformer library:

 tokenizer.padding_side = "left"
 tokenizer.pad_token = tokenizer.eos_token
 
 sentences = ["It will rain in the",
             "I want to eat a big bowl of",
             "My dog is"]
 inputs = tokenizer(sentences, return_tensors="pt", padding=True)
 
 output_sequences = gpt2.generate(**inputs)
 
 for seq in output_sequences:
     print(tokenizer.decode(seq))

The transformer library handles a lot of details for us, and now we will introduce in detail what it does in it.

We feed tokens into language models, such as GPT-2 and BERT, as tensors for inference. Tensors are like a python list, but with some additional features and restrictions. For example, for a tensor of 2+ dimensions, all vectors in that dimension must be the same length. For example,

 from torch import tensor
 
 tensor([[1,2], [3,4]])  # ok
 tensor([[1,2], [3]])   # error!

When we tokenize the input, it is converted into a tensor of sequences, each integer corresponding to an item in the model vocabulary. Here is an example of tokenization in GPT-2:

If we want to include the second sequence in the input:

Since the two sequences have different lengths, they cannot be combined into a tensor. This is where the shorter sequences need to be padded with dummy markers so that each sequence has the same length. Since we want the model to keep adding to the right side of the sequence, we will fill the left side of the shorter sequence.

This is an application of attention masks. The attention mask tells the model which tokens are padded, placing 0s where the padded tokens are and 1s where the actual tokens are. Now that we understand this, let's go through the code line by line.

 tokenizer.padding_side = "left"

This line tells the tokenizer to start padding from the left (right by default), since the logits of the rightmost token will be used to predict future tokens.

 tokenizer.pad_token = tokenizer.eos_token

This line specifies which token will be used for filling. It doesn't matter which one you choose, here we choose the "End of Sequence" marker.

 sentences = ["It will rain in the",
             "I want to eat a big bowl of",
             "My dog is"]

The above three sequences have different lengths when marking, we use the following method to fill:

 inputs = tokenizer(sentences, return_tensors="pt", padding=True)

After doing the table plan and adding padding, I got the following result:

 {'input_ids': tensor([
     [50256, 50256, 50256,  1026,   481,  6290,   287,   262],
     [   40,   765,   284,  4483,   257,  1263,  9396,   286],
     [50256, 50256, 50256, 50256, 50256,  3666,  3290,   318]
   ]),
 'attention_mask': tensor([
     [0, 0, 0, 1, 1, 1, 1, 1],
     [1, 1, 1, 1, 1, 1, 1, 1],
     [0, 0, 0, 0, 0, 1, 1, 1]
   ])}

It can be seen that the first and third sequences are filled at the beginning, and the attention_mask parameter marks the position of this filling.

Now let's pass this input to the model to generate new text:

 output_sequences = gpt2.generate(**inputs)

If you're not familiar with the **kwargs syntax for a function call, it's passing an input dictionary as named arguments, using keys as argument names, and values ​​as corresponding argument values.

We just need to loop through each generated sequence and print out the result in human readable form, using the decode() function to convert the token id to a string.

 for seq in output_sequences:
     print(tokenizer.decode(seq))

In the attention mask, our input is 0 and 1, but in the final calculation, the attention weight in the invalid position is set to a small value, usually negative infinity (-inf), so that The probability of suppressing it to close to zero when computing the attention score.

This is because, when calculating the attention weight, the calculation of Softmax is required:

The nature of the Softmax function: The attention mechanism usually uses the Softmax function to convert the attention score into an attention weight, and the Softmax function performs an exponential operation on the input value and then normalizes it. When the input value is very small or negative infinity, it will be close to zero after exponentiation. Therefore, setting the mask to negative infinity can ensure that the attention weight of the corresponding position approaches zero when the Softmax function is calculated.

Exclude the influence of invalid locations: By setting the attention weights of invalid locations to negative infinity, these locations are effectively weighted down. When calculating the attention weight, the weight of negative infinity will make the attention weight of the corresponding position close to zero, so that the model will ignore the influence of invalid positions. This can ensure that the model can better focus on valid information, improving the accuracy and generalization ability of the model.

But negative infinity isn't the only option. Sometimes you can choose to use a large negative number to achieve a similar effect. The specific choice can be determined according to the specific task and model requirements.

https://avoid.overfit.cn/post/0538d928a1c14940b3861437ea2fcffa

Author: Prudhviraju Srivatsavaya

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/131695758