Bert model introduction and code analysis (pytorch)

Bert (pretrained model)

motivation

  • Fine-tuning based NLP models
  • The pre-trained model has extracted enough information
  • New tasks only need to add a simple output layer

Note: bert is equivalent to a transformer with only an encoder

Transformer-based improvements

  • Each sample is a sentence pair
  • Adding additional fragment embeds
  • Position codes can be learned

insert image description here

< cls > is a category and < sep > is used to separate sentences. There are two sentences where the former id is 0 and the latter id is 1

BERT chooses the Transformer encoder as its bidirectional architecture. As is common in Transformer encoders, positional embeddings are added to each position of the input sequence. However, unlike the original Transformer encoder, BERT uses learnable positional embeddings. The figure above shows that the embedding of the BERT input sequence is the sum of lexical embedding, segment embedding, and position embedding .

Pre-training task 1: Language model with mask

1) The encoder of the transformer is bidirectional, and the standard language model requires one-way

2) In this pre-training task, 15% of tokens will be randomly selected as masked tokens for prediction. To predict a masked token without label cheating, a simple approach is to always replace the token in the input sequence with a special "".

3) However, the artificial special word "" will not appear in the fine-tuning. To avoid this mismatch between pre-training and fine-tuning, if a token is masked for prediction (e.g. choose to mask and predict "great" in "this movie is great"), replace it in the input with:

  • 80% of the time for special "" lemmas (e.g., "this movie is great" becomes "this movie is";
  • 10% of the time as random tokens (e.g. "this movie is great" becomes "this movie is drink");
  • Label lemmas that are unchanged 10% of the time (e.g. "this movie is great" becomes "this movie is great")

Pre-training task 1: Next sentence prediction

  • Predict whether two sentences in a sentence pair are adjacent

  • In the training sample:

    • 50% probability of selecting adjacent sentence pairs (equivalent to positive examples)
    • 50% probability to select a random sentence pair (equivalent to a negative example)
  • Put the output corresponding to < cls > into a fully connected layer to predict (whether the pair of sentences are adjacent)

model realization

BERT input sequences unambiguously represent single texts and text pairs. When the input is a single text, the BERT input sequence is a concatenation of the special class token "", the token of the text sequence, and the special separator token "". When the input is a text pair, the BERT input sequence is "", the token of the first text sequence, "", the token of the second text sequence, and the concatenation of "".

The following get_tokens_and_segmentstakes a sentence or two as input and returns the tokens of the BERT input sequence and their corresponding segment indices.

import torch
from torch import nn
from d2l import torch as d2l

def get_tokens_and_segments(tokens_a, tokens_b=None):
    """获取输入序列的词元及其片段索引"""
    tokens = ['<cls>'] + tokens_a + ['<sep>']
    # 0和1分别标记片段A和B
    segments = [0] * (len(tokens_a) + 2)
    if tokens_b is not None:
        tokens += tokens_b + ['<sep>']
        segments += [1] * (len(tokens_b) + 1)
    return tokens, segments

BERTEncoderUsing segment embeddings and learnable position embeddings:

class BERTEncoder(nn.Module):
    """BERT编码器"""
    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
                 ffn_num_hiddens, num_heads, num_layers, dropout,
                 max_len=1000, key_size=768, query_size=768, value_size=768,
                 **kwargs):
        super(BERTEncoder, self).__init__(**kwargs)
        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
        self.segment_embedding = nn.Embedding(2, num_hiddens)
        self.blks = nn.Sequential()
#将transformer中的EncodeerBlosck照搬过来
        for i in range(num_layers):
            self.blks.add_module(f"{
      
      i}", d2l.EncoderBlock(
                key_size, query_size, value_size, num_hiddens, norm_shape,
                ffn_num_input, ffn_num_hiddens, num_heads, dropout, True))
        # 在BERT中,位置嵌入是可学习的,因此我们创建一个足够长的位置嵌入参数  随机初始化
        self.pos_embedding = nn.Parameter(torch.randn(1, max_len,
                                                      num_hiddens))

    def forward(self, tokens, segments, valid_lens):
        # 在以下代码段中,X的形状保持不变:(批量大小,最大序列长度,num_hiddens)
        X = self.token_embedding(tokens) + self.segment_embedding(segments)
        X = X + self.pos_embedding.data[:, :X.shape[1], :]
        for blk in self.blks:
            X = blk(X, valid_lens)
        return X

Assuming a vocabulary size of 10000, to demonstrate BERTEncoderforward inference, let's create an instance and initialize its parameters.

vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4
norm_shape, ffn_num_input, num_layers, dropout = [768], 768, 2, 0.2
encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape, ffn_num_input,
                      ffn_num_hiddens, num_heads, num_layers, dropout)

will tokensbe defined as 2 input sequences of length 8, where each token is an index into the vocabulary. Returns an encoded result using forward inference of the input, where each token is represented by a vector whose length is defined by the tokenshyperparameters .BERTEncodernum_hiddens

tokens = torch.randint(0, vocab_size, (2, 8))
segments = torch.tensor([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
encoded_X = encoder(tokens, segments, None)
encoded_X.shape

Task 1. Masked language model:

We implemented the following MaskLMclasses to predict masked tokens in the BERT pretrained masked language model task. Prediction uses a single hidden layer of a multilayer perceptron ( self.mlp). In forward inference, it requires two inputs: BERTEncoderthe encoded result of , and the token position for prediction. The output is the prediction for these locations.

class MaskLM(nn.Module):
    """BERT的掩蔽语言模型任务"""
    def __init__(self, vocab_size, num_hiddens, num_inputs=768, **kwargs):
        super(MaskLM, self).__init__(**kwargs)
        self.mlp = nn.Sequential(nn.Linear(num_inputs, num_hiddens),
                                 nn.ReLU(),
                                 nn.LayerNorm(num_hiddens),
                                 nn.Linear(num_hiddens, vocab_size))

    def forward(self, X, pred_positions):
        num_pred_positions = pred_positions.shape[1]
        pred_positions = pred_positions.reshape(-1)
        batch_size = X.shape[0]
        batch_idx = torch.arange(0, batch_size)
        # 假设batch_size=2,num_pred_positions=3
        # 那么batch_idx是np.array([0,0,0,1,1])
        batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
        masked_X = X[batch_idx, pred_positions]
        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
        mlm_Y_hat = self.mlp(masked_X)
        return mlm_Y_hat
    
mlm = MaskLM(vocab_size, num_hiddens)
mlm_positions = torch.tensor([[1, 5, 2], [6, 1, 5]])
mlm_Y_hat = mlm(encoded_X, mlm_positions)
print(mlm_Y_hat.shape) #torch.Size([2, 3, 10000])

mlm_Y = torch.tensor([[7, 8, 9], [10, 20, 30]])
loss = nn.CrossEntropyLoss(reduction='none')
mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
print(mlm_l.shape) #torch.Size([6])

Task 2. Predict the next sentence:

The class below NextSentencePreduses a single hidden layer Multilayer Perceptron to predict whether the second sentence is the next sentence of the first sentence in the BERT input sequence. Thanks to the self-attention in the Transformer encoder, the BERT representation of the special token "" has encoded the input two sentences. self.outputTherefore, the output layer ( ) of the multilayer perceptron classifier Xtakes as input, where Xis the output of the hidden layer of the multilayer perceptron, and the input of the hidden layer of the MLP is the encoded "" token.

class NextSentencePred(nn.Module):
    """BERT的下一句预测任务"""
    def __init__(self, num_inputs, **kwargs):
        super(NextSentencePred, self).__init__(**kwargs)
        self.output = nn.Linear(num_inputs, 2)

    def forward(self, X):
        # X的形状:(batchsize,num_hiddens)
        return self.output(X)

Combining the above functions:

#@save
class BERTModel(nn.Module):
    """BERT模型"""
    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
                 ffn_num_hiddens, num_heads, num_layers, dropout,
                 max_len=1000, key_size=768, query_size=768, value_size=768,
                 hid_in_features=768, mlm_in_features=768,
                 nsp_in_features=768):
        super(BERTModel, self).__init__()
        self.encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape,
                    ffn_num_input, ffn_num_hiddens, num_heads, num_layers,
                    dropout, max_len=max_len, key_size=key_size,
                    query_size=query_size, value_size=value_size)
        self.hidden = nn.Sequential(nn.Linear(hid_in_features, num_hiddens),
                                    nn.Tanh())
        self.mlm = MaskLM(vocab_size, num_hiddens, mlm_in_features)
        self.nsp = NextSentencePred(nsp_in_features)

    def forward(self, tokens, segments, valid_lens=None,
                pred_positions=None):
        encoded_X = self.encoder(tokens, segments, valid_lens)
        if pred_positions is not None:
            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
        else:
            mlm_Y_hat = None
        # 用于下一句预测的多层感知机分类器的隐藏层,0是“<cls>”标记的索引
        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
        return encoded_X, mlm_Y_hat, nsp_Y_hat

Guess you like

Origin blog.csdn.net/gary101818/article/details/124137169