[Self-attention mechanism must learn] BERT class pre-trained language model (including Python examples)

insert image description here

BERT-like pre-trained language model

1. Introduction to BERT

1.1 Introduction and Features of BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-training model that is a major milestone in the field of Natural Language Processing (NLP) and is considered one of the current State-of-the-Art models. The design concept and structure of BERT are based on the Transformer model, which is trained through unsupervised learning and can be adapted to various NLP tasks.

The pre-training model refers to a large amount of unsupervised training on large-scale text data to learn rich language representation. The pre-training task of BERT is carried out through two stages of masked language model (Masked Language Model, MLM) and next sentence prediction (Next Sentence Prediction, NSP). In MLM, part of the input text is randomly masked, and the model needs to predict these masked words. In NSP, the model needs to judge whether two sentences are continuous sentences in the original text.

BERT's unsupervised training enables it to learn rich sentence-level and word-level representations, as it needs to understand context and perform linguistic reasoning. This capability makes BERT ideal for a variety of NLP tasks. When using BERT for downstream tasks, it can be fine-tuned as a feature extractor or by adding a task-specific output layer on top of BERT.

The structure of BERT is based on the Transformer model, which is a deep neural network architecture based on the self-attention mechanism. It is able to capture the contextual information in the sentence, enabling the model to consider the preceding and following words in the sentence at the same time. This ability to model bidirectionally is one of BERT's key features and a significant improvement over previous models such as those based on recurrent neural networks.

The emergence of BERT has caused major changes in the field of NLP. It has achieved breakthrough results on multiple NLP tasks, such as question answering, text classification, named entity recognition, etc. The success of BERT has also inspired the design and improvement of many subsequent models, including GPT, RoBERTa, etc. These models have made tremendous progress in the field of NLP, enabling researchers and practitioners to better process natural language data and achieve higher performance on multiple tasks.

1.2 Traditional method and pre-training method

For traditional methods, the main focus is on:

  • design model structure
  • Collect/Standard Training Data
  • Model training with labeled data
  • Real scene prediction
    The above process is also called Fine-tune, that is, the process of fine-
    tuning. Compared with the above process, the pre-training method has more Pre-train process:
  • Collect massive amounts of unlabeled text data
  • Perform model pre-training and use it in the task model

This two-step process is called Pre-train

  • design model structure
  • Collect/Standard Training Data
  • Model training with labeled data
  • The real scene prediction
    pre-training method is also the process of Pre-train+Fine-tune
    insert image description here

The pre-training method of BERT includes two tasks: masked language model (Masked Language Model, MLM) and next sentence prediction (Next Sentence Prediction, NSP) .

In the masked language model (MLM) task, BERT randomly masks the input text. Specifically, for a part of the words in the input text, about 15% of the words will be masked into a specific token (usually "[MASK]"), a small part of the words will be replaced with random words, and the rest Words remain the same. The goal of the model is to predict the original content of these masked words from other words in the context.

It can be commonly understood as the process of cloze, the difference from n-gram, two-way, considering the information in the two directions before and after the word

The Next Sentence Prediction (NSP) task aims to train models to model sentence-level semantics. The model receives a pair of sentences as input and needs to determine whether the two sentences are consecutive in the original text. To generate training data, BERT randomly selects a pair of sentences from large-scale text data, and adds special markers to the input to indicate the boundary between the two sentences.

Can be viewed as a sentence-level language model

Through the pre-training of these two tasks, BERT has learned a rich language representation. The MLM task enables the model to understand the context and predict the masked words, prompting the model to learn to model the semantics in the sentence. The NSP task allows the model to understand the relationship and semantic coherence between sentences.

This pre-training method enables BERT to learn general-purpose language representations from large-scale unlabeled data, which can be applied to various downstream NLP tasks to achieve task-specific performance improvements through fine-tuning or as feature extractors.

1.3 Properties of BERT

First of all, the essence of BERT is a text representation (context representation) , which is a text -> matrix (max length × \times× hidden size) or text->vector (1× \times× hidden size), word2vec can also do the same thing, but word2vec is static, while BERT is dynamic, because BERT takes the input into account and outputs it through the transformer, one-to-one

BERT performs better than RNN for two reasons:

  • Adopted the idea of ​​pre-training
  • Adopted a more efficient network structure

2. BERT structure

The core point in Transformer is the attention mechanism. The core article is "Attention is All you Need". The main structure of BERT's model uses the Transformer structure proposed by Google in 2017.
insert image description here
BERT's Encoder structure
insert image description here
The Encoder part of BERT is composed of multiple Transformer encoders with the same structure, and its structure can be analyzed layer by layer. Suppose BERT contains L Transformer encoder layers.

2.1 Input layer and position encoding

Input layer: The input is a sequence of word embeddings (Word Embedding), where each word is represented as a vector. The input sequence also includes special tokens such as the start of a sentence ([CLS]) and separators ([SEP]).
insert image description here
Segment Embedding This part of each piece of text is the same, and the parameter amount of this part is the number of segments × 768 \times 768× 768 , Token Embedding maps each word to a vector (including [CLS] and [SEP]), the parameter amount is the vocabulary size× 768 \times 768×768

Positional Encoding: In the word embedding of the input sequence, the Transformer encoder needs a way to encode the positional information of the word. To do this, BERT uses positional encoding, which embeds positional information into word embedding vectors, enabling the model to take into account the order of words in the sequence.
Position Embedding maps the position of the word into a vector with a parameter of 512 × 768 512\times 768512×768 , each vector is 768 dimensions, and the maximum length is limited to 512. The paper stipulates that Layer Normalization will be performed after the above three embeddings are summed

2.2 Transformer encoder layer

Transformer encoder layer: The Encoder part of BERT is stacked by L Transformer encoder layers, stacked 12 times.

Self-Attention Layer: Each Transformer encoder layer contains multiple self-attention heads, which can be computed in parallel. The input to the self-attention layer consists of three vector sequences: query, key, and value. The self-attention mechanism interacts queries with keys by computing attention weights, then applying the attention weights to the value vectors, and finally summing the weighted value vectors. This interaction captures the relationship of each position to other positions in the input sequence and produces an updated representation for each position.
Query (Q): The query vector is used to determine how relevant the current location is to other locations. In the self-attention mechanism, the attention weight of each location to other locations is obtained by performing a dot product operation on the query vector and the keys of other locations. The query vector can be understood as the information concerned by the current location.

Key (K): The key vector is used to provide information about other locations so that the query vector can calculate the degree of association with them. Attention weights are obtained by dot producting the query vector with the key vector. Key vectors can be understood as representations of other locations.

Value (V): The value vector contains information associated with each position. Attention weights are applied to the value vector to get a weighted sum for each position as the final output. The value vector can be understood as the information carried by the current location.
The specific steps for using the qkv parameter in the Transformer algorithm:

  • Linear transformations: First, the input sequence undergoes three independent linear transformations to compute queries (Q), keys (K), and values ​​(V). For each position in the input sequence, the input is linearly transformed using a different weight matrix to obtain the corresponding query vector Q, key vector K, and value vector V.

  • Compute attention weights: Use the query vector Q and the key vector K to perform a dot product operation, and then calculate the attention weights by scaling (typically dividing the result by a scaling factor). Scaling can make the distribution of attention weights more stable. The attention weights measure the degree to which each position is related to other positions, and determine the importance of each position in the attention mechanism.

  • Weighted summation of attention weights and values: Multiply the attention weights by the value vector V, and perform weighted summation of these products to obtain the output of the attention mechanism. The attention weights determine how much each position contributes to the final representation, while the value vector provides the actual feature information.

  • Multi-head attention mechanism: In Transformer, a multi-head attention mechanism is usually used to improve the representation ability of the model by computing multiple sets of qkv parameters in parallel. In the multi-head attention mechanism, each set of qkv parameters will undergo independent linear transformation and attention calculation, and multiple sets of attention outputs will be obtained. Finally, multiple sets of attention outputs are concatenated, and the final representation is obtained through linear transformation.
    insert image description here
    In the calculation of the square matrix, the problem of the relationship between distant characters is solved. Since it is a square matrix, no matter how far away you are, you can calculate the relationship between each word and its corresponding word, so it is self-attention.
    insert image description here
    Self-attention multi-head mechanism : In the traditional attention mechanism, attention weights are calculated to give the degree of relevance of a position to other positions in the sequence. The multi-head attention mechanism provides a richer expressive ability by using multiple attention heads at the same time, and each head can learn different attention weights.

    Specifically, the multi-head attention mechanism performs multiple linear transformations on the input sequence to obtain multiple sets of query (query), key (key) and value (value) vectors. Then, within each attention head, the output of each head is obtained by computing attention weights and applying the weights to the value vector. Finally, the outputs of multiple heads are concatenated, and the final attention representation is obtained through linear transformation.

    Through the multi-head attention mechanism, the model can simultaneously learn different relationships between different positions. Different attention heads can focus on different semantic information, for example, one head can focus on local grammatical structure, and another can focus on global contextual semantics. In this way, the model can comprehensively consider the information of the input sequence from different perspectives and provide a more comprehensive representation capability.

    In BERT, the multi-head attention mechanism is widely used in the self-attention layer in each Transformer encoder layer. By using multiple attention heads, BERT is able to simultaneously capture the semantic information of different relations, providing a richer contextual representation. The attention weights learned by each head can be regarded as a kind of "experts" who focus on different aspects, and they work together to provide a global contextual understanding.
    The size of the "head" is calculated as follows:
    attention _ head _ size = int ( hidden _ size / num _ attention _ heads ) attention\_head\_size = int(hidden\_size / num\_attention\_heads)attention_head_size=in t ( hi dd e n _ s i ze / n u m _ a tt e n t i o n _ h e a d s )
    attention_head_size is a fixed value of 64, from Google experiments, hidden_size is 768, and num_attention_heads is a hyperparameter It is fixed at 12, and 768 is cut into 12 parts (the operation of twisting and deforming the matrix), which is equivalent to running 12 models in parallel, and then the reason why all 12 models are divided by the root number 64 is to facilitate
    insert image description here
    normalization (reflected in the code of the algorithm), to prevent the problem of non-0 or 1 caused by too large numerical gap in the vector, and the final output result has no change in dimension after multiple transformations

2.3 Feedforward neural network layer

Feed-Forward Neural Network Layer: After the self-attention layer, each Transformer encoder layer contains a feed-forward neural network layer. The feed-forward neural network layer consists of two fully connected layers connected by a non-linear activation function. This feed-forward neural network layer is able to non-linearly transform and map the representation of each position output by the self-attention layer.

2.4 Residual connection layer

Residual Connection and Layer Normalization: After each sublayer, BERT employs residual connections and layer normalization to enhance information flow and alleviate gradient disappearance problems. The residual connection adds the input of the sublayer to the output of the sublayer, thus preserving the information of the original input. Layer normalization normalizes the results of residual connections such that the inputs to each sublayer have similar means and variances.

Add the vectors before and after the input to the next layer of input

2.5 Output layer

Output layer: The output of the last layer of the Encoder part of BERT is used as input for downstream NLP tasks. For classification tasks, such as text classification, a fully connected layer and Softmax activation function can be added on top of the output layer for prediction.
sequence_output: the overall matrix corresponding to each character of the entire sentence; pooler_output: a vector representing the entire sentence

3. Brief notes on BERT class model

insert image description here

4. Code engineering practice

Manually implement the Bert structure

import torch
import math
import numpy as np
from transformers import BertModel

'''

通过手动矩阵运算实现Bert结构
模型文件下载 https://huggingface.co/models

'''

bert = BertModel.from_pretrained(r"D:\badou\pretrain_model\chinese_bert_likes\bert-base-chinese", return_dict=False)
state_dict = bert.state_dict()
bert.eval()
x = np.array([2450, 15486, 15167, 2110]) #通过vocab对应输入:深度学习
torch_x = torch.LongTensor([x])  #pytorch形式输入
# seqence_output, pooler_output = bert(torch_x)
# print(seqence_output.shape, pooler_output.shape)
# print(seqence_output, pooler_output)

# print(bert.state_dict().keys())  #查看所有的权值矩阵名称


#softmax归一化
def softmax(x):
    return np.exp(x)/np.sum(np.exp(x), axis=-1, keepdims=True)

#gelu激活函数
def gelu(x):
    return 0.5 * x * (1 + np.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * np.power(x, 3))))

class DiyBert:
    #将预训练好的整个权重字典输入进来
    def __init__(self, state_dict):
        self.num_attention_heads = 12
        self.hidden_size = 768
        self.num_layers = 1
        self.load_weights(state_dict)

    def load_weights(self, state_dict):
        #embedding部分
        self.word_embeddings = state_dict["embeddings.word_embeddings.weight"].numpy()
        self.position_embeddings = state_dict["embeddings.position_embeddings.weight"].numpy()
        self.token_type_embeddings = state_dict["embeddings.token_type_embeddings.weight"].numpy()
        self.embeddings_layer_norm_weight = state_dict["embeddings.LayerNorm.weight"].numpy()
        self.embeddings_layer_norm_bias = state_dict["embeddings.LayerNorm.bias"].numpy()
        self.transformer_weights = []
        #transformer部分,有多层
        for i in range(self.num_layers):
            q_w = state_dict["encoder.layer.%d.attention.self.query.weight" % i].numpy()
            q_b = state_dict["encoder.layer.%d.attention.self.query.bias" % i].numpy()
            k_w = state_dict["encoder.layer.%d.attention.self.key.weight" % i].numpy()
            k_b = state_dict["encoder.layer.%d.attention.self.key.bias" % i].numpy()
            v_w = state_dict["encoder.layer.%d.attention.self.value.weight" % i].numpy()
            v_b = state_dict["encoder.layer.%d.attention.self.value.bias" % i].numpy()
            attention_output_weight = state_dict["encoder.layer.%d.attention.output.dense.weight" % i].numpy()
            attention_output_bias = state_dict["encoder.layer.%d.attention.output.dense.bias" % i].numpy()
            attention_layer_norm_w = state_dict["encoder.layer.%d.attention.output.LayerNorm.weight" % i].numpy()
            attention_layer_norm_b = state_dict["encoder.layer.%d.attention.output.LayerNorm.bias" % i].numpy()
            intermediate_weight = state_dict["encoder.layer.%d.intermediate.dense.weight" % i].numpy()
            intermediate_bias = state_dict["encoder.layer.%d.intermediate.dense.bias" % i].numpy()
            output_weight = state_dict["encoder.layer.%d.output.dense.weight" % i].numpy()
            output_bias = state_dict["encoder.layer.%d.output.dense.bias" % i].numpy()
            ff_layer_norm_w = state_dict["encoder.layer.%d.output.LayerNorm.weight" % i].numpy()
            ff_layer_norm_b = state_dict["encoder.layer.%d.output.LayerNorm.bias" % i].numpy()
            self.transformer_weights.append([q_w, q_b, k_w, k_b, v_w, v_b, attention_output_weight, attention_output_bias,
                                             attention_layer_norm_w, attention_layer_norm_b, intermediate_weight, intermediate_bias,
                                             output_weight, output_bias, ff_layer_norm_w, ff_layer_norm_b])
        #pooler层
        self.pooler_dense_weight = state_dict["pooler.dense.weight"].numpy()
        self.pooler_dense_bias = state_dict["pooler.dense.bias"].numpy()


    #bert embedding,使用3层叠加,在经过一个embedding层
    def embedding_forward(self, x):
        # x.shape = [max_len]
        we = self.get_embedding(self.word_embeddings, x)  # shpae: [max_len, hidden_size]
        # position embeding的输入 [0, 1, 2, 3]
        pe = self.get_embedding(self.position_embeddings, np.array(list(range(len(x)))))  # shpae: [max_len, hidden_size]
        # token type embedding,单输入的情况下为[0, 0, 0, 0]
        te = self.get_embedding(self.token_type_embeddings, np.array([0] * len(x)))  # shpae: [max_len, hidden_size]
        embedding = we + pe + te
        # 加和后有一个归一化层
        embedding = self.layer_norm(embedding, self.embeddings_layer_norm_weight, self.embeddings_layer_norm_bias)  # shpae: [max_len, hidden_size]
        return embedding

    #embedding层实际上相当于按index索引,或理解为onehot输入乘以embedding矩阵
    def get_embedding(self, embedding_matrix, x):
        return np.array([embedding_matrix[index] for index in x])

    #执行全部的transformer层计算
    def all_transformer_layer_forward(self, x):
        for i in range(self.num_layers):
            x = self.single_transformer_layer_forward(x, i)
        return x

    #执行单层transformer层计算
    def single_transformer_layer_forward(self, x, layer_index):
        weights = self.transformer_weights[layer_index]
        #取出该层的参数,在实际中,这些参数都是随机初始化,之后进行预训练
        q_w, q_b, \
        k_w, k_b, \
        v_w, v_b, \
        attention_output_weight, attention_output_bias, \
        attention_layer_norm_w, attention_layer_norm_b, \
        intermediate_weight, intermediate_bias, \
        output_weight, output_bias, \
        ff_layer_norm_w, ff_layer_norm_b = weights
        #self attention层
        attention_output = self.self_attention(x,
                                q_w, q_b,
                                k_w, k_b,
                                v_w, v_b,
                                attention_output_weight, attention_output_bias,
                                self.num_attention_heads,
                                self.hidden_size)
        #bn层,并使用了残差机制
        x = self.layer_norm(x + attention_output, attention_layer_norm_w, attention_layer_norm_b)
        #feed forward层
        feed_forward_x = self.feed_forward(x,
                              intermediate_weight, intermediate_bias,
                              output_weight, output_bias)
        #bn层,并使用了残差机制
        x = self.layer_norm(x + feed_forward_x, ff_layer_norm_w, ff_layer_norm_b)
        return x

    # self attention的计算
    def self_attention(self,
                       x,
                       q_w,
                       q_b,
                       k_w,
                       k_b,
                       v_w,
                       v_b,
                       attention_output_weight,
                       attention_output_bias,
                       num_attention_heads,
                       hidden_size):
        # x.shape = max_len * hidden_size
        # q_w, k_w, v_w  shape = hidden_size * hidden_size
        # q_b, k_b, v_b  shape = hidden_size
        q = np.dot(x, q_w.T) + q_b  # shape: [max_len, hidden_size]      W * X + B lINER
        k = np.dot(x, k_w.T) + k_b  # shpae: [max_len, hidden_size]
        v = np.dot(x, v_w.T) + v_b  # shpae: [max_len, hidden_size]
        attention_head_size = int(hidden_size / num_attention_heads)
        # q.shape = num_attention_heads, max_len, attention_head_size
        q = self.transpose_for_scores(q, attention_head_size, num_attention_heads)
        # k.shape = num_attention_heads, max_len, attention_head_size
        k = self.transpose_for_scores(k, attention_head_size, num_attention_heads)
        # v.shape = num_attention_heads, max_len, attention_head_size
        v = self.transpose_for_scores(v, attention_head_size, num_attention_heads)
        # qk.shape = num_attention_heads, max_len, max_len
        qk = np.matmul(q, k.swapaxes(1, 2))
        qk /= np.sqrt(attention_head_size)
        qk = softmax(qk)
        # qkv.shape = num_attention_heads, max_len, attention_head_size
        qkv = np.matmul(qk, v)
        # qkv.shape = max_len, hidden_size
        qkv = qkv.swapaxes(0, 1).reshape(-1, hidden_size)
        # attention.shape = max_len, hidden_size
        attention = np.dot(qkv, attention_output_weight.T) + attention_output_bias
        return attention

    #多头机制
    def transpose_for_scores(self, x, attention_head_size, num_attention_heads):
        # hidden_size = 768  num_attent_heads = 12 attention_head_size = 64
        max_len, hidden_size = x.shape
        x = x.reshape(max_len, num_attention_heads, attention_head_size)
        x = x.swapaxes(1, 0)  # output shape = [num_attention_heads, max_len, attention_head_size]
        return x

    #前馈网络的计算
    def feed_forward(self,
                     x,
                     intermediate_weight,  # intermediate_size, hidden_size
                     intermediate_bias,  # intermediate_size
                     output_weight,  # hidden_size, intermediate_size
                     output_bias,  # hidden_size
                     ):
        # output shpae: [max_len, intermediate_size]
        x = np.dot(x, intermediate_weight.T) + intermediate_bias
        x = gelu(x)
        # output shpae: [max_len, hidden_size]
        x = np.dot(x, output_weight.T) + output_bias
        return x

    #归一化层
    def layer_norm(self, x, w, b):
        x = (x - np.mean(x, axis=1, keepdims=True)) / np.std(x, axis=1, keepdims=True)
        x = x * w + b
        return x

    #链接[cls] token的输出层
    def pooler_output_layer(self, x):
        x = np.dot(x, self.pooler_dense_weight.T) + self.pooler_dense_bias
        x = np.tanh(x)
        return x

    #最终输出
    def forward(self, x):
        x = self.embedding_forward(x)
        sequence_output = self.all_transformer_layer_forward(x)
        pooler_output = self.pooler_output_layer(sequence_output[0])
        return sequence_output, pooler_output


#自制
db = DiyBert(state_dict)
diy_sequence_output, diy_pooler_output = db.forward(x)
#torch
torch_sequence_output, torch_pooler_output = bert(torch_x)

print(diy_sequence_output)
print(torch_sequence_output)

# print(diy_pooler_output)
# print(torch_pooler_output)

Guess you like

Origin blog.csdn.net/qq_38853759/article/details/131343535