Pytorch practical notes (3) - BERT implements sentiment analysis

This article demonstrates using Pytorch to build a BERT to implement sentiment analysis. The structure of this article is that the first chapter introduces BERT in detail, including Self-attention, Transformer's Encoder, BERT's input and output, and BERT's pre-training and fine-tuning methods; Chapter 2 is the core code part.

1 BERT

1.1 self-attention

Self-attention accepts a sequence input and outputs a sequence of equal length . Its operation process is as follows. The above picture is a partial example of self-attention, because only b 1 b_1
self-attention
is shownb1calculation process. The calculation process is as follows (only b 1 b_1 is explained hereb1The calculation process of b 2 b_2b2to b 4 b_4b4The calculation method is the same as b 1 b_1b1Same):

  1. against import order { a 1 , a 2 , a 3 , a 4 } \{a_1, a_2, a_3, a_4\}{ a1,a2,a3,a4} , when we calculatea 1 a_1a1For the attention vector of this input sequence, a 1 a_1a1After three different linear transformations, we get q 1 q_1q1 k 1 k_1 k1v 1 v_1v1Vector, the formula is as follows. The q (query), k (key), and v (value) here can be understood using the database. q corresponds to the SQL statement to query a certain key and finally return the value of the key. For example, q is 'select age from girlfriend', where the query is the sql statement, the key is age, and the value is 18.
    q 1 = W qa 1 , k 1 = W ka 1 , v 1 = W va 1 . q_1=W^qa_1, \\ k_1=W^ka_1,\\ v_1=W^va_1.q1=Wqa1,k1=Wk a1,v1=Wva1.
  2. But to { a 2 , a 3 , a 4 } \{a_2, a_3, a_4\}{ a2,a3,a4} , they are the objects to be queried for attention, so only k and v are generated (note here,self-attention 是会计算自己对自己的注意力的,所以会有 k1 和 v1).
  3. Then q 1 q_1q1Will match { k 1 , k 2 , k 3 , k 4 } \{k_1, k_2, k_3, k_4\}{ k1,k2,k3,k4} Do a dot product operation respectively to get the attention weight{ α 1 , 1 , α 1 , 2 , α 1 , 3 , α 1 , 4 } \{\alpha_{1, 1}, \alpha_{1, 2} , \alpha_{1, 3}, \alpha_{1, 4}\}{ a1,1,a1,2,a1,3,a1,4} , the formula is as follows. What needs to be noted here is that sincekkk will undergo a transposition, so the attention weightα \alphaα isa scalar quantity. At the same time, since the dot product operation can be regarded as one time相似度的计算(because the calculation formula of cosine similarity iscos θ = a ⋅ b ∣ a ∣ ∣ b ∣ {\rm cos}\theta=\frac{a \cdot b}{| a||b|}cosθ=a∣∣bab,in a ⋅ b = ∣ a ∣ ∣ b ∣ cos θ a \cdot b = |a||b|{\rm cos}\thetaab=a ∣∣ b cos θ , so the inner product can be regarded as calculating the similarity of two vectors), so the inner product here can be understood as calculatingq 1 q_1q1 With { k 1 , k 2 , k 3 , k 4 } \{k_1, k_2, k_3, k_4\}{ k1,k2,k3,k4} A similarity weight calculation(becauseα \alphaα is a scalar, so it is相似度权重).
    { a 1 , 1 , α 1 , 2 , α 1 , 3 , α 1 , 4 } = q 1 { k 1 , k 2 , k 3 , k 4 } T . \{a_{1,1}, \alpha_ {1, 2}, \alpha_{1, 3}, \alpha_{1, 4}\} = q_1 \{k_1, k_2, k_3, k_4\}^{\rm T}.{ a1,1,a1,2,a1,3,a1,4}=q1{ k1,k2,k3,k4}T.
  4. Finally, the similarity weight { α 1 , 1 , α 1 , 2 , α 1 , 3 , α 1 , 4 } \{\alpha_{1, 1}, \alpha_{1, 2}, \alpha_{1, 3 }, \alpha_{1, 4}\}{ a1,1,a1,2,a1,3,a1,4} y{ v 1 , v 2 , v 3 , v 4 } \{v_1, v_2, v_3, v_4\}{ v1,v2,v3,v4} Multiply, we geta 1 a_1a1vsa 1 a_1a1The attention vector, a 1 a_1a1Opa 2 a_2a2The attention vector, a 1 a_1a1vsa 3 a_3a3attention vector, and a 1 a_1a1Opa 4 a_4a4attention vector. Then put these vectors together to get b 1 b_1b1 b 1 b_1 b1It contains a 1 a_1a1All attention vectors for the entire input sequence.

1.2 multi-head self-attention

The multi-head self-attention mechanism actually calculates self-attention multiple times. As shown below.
multi-head self-attention
multi-head self-attention means that the input vector will go through hhh different linear transformations, we gethhh q, k, v. For example,h = 2 h=2h=When 2 , a 1 a_1a1You will get q 1 1 q_1^1 through the following formulaq11 k 1 1 k_1^1 k11v 1 1 v_1^1v11and q 1 2 q_1^2q12 k 1 2 k_1^2 k12v 1 2 v_1^2v12
q 1 1 = W 1 q a 1 , k 1 1 = W 1 k a 1 , v 1 1 = W 1 v a 1 , q 1 2 = W 2 q a 1 , k 1 2 = W 2 k a 1 , v 1 2 = W 2 v a 1 . q_1^1=W^q_1a_1, \\ k_1^1=W^k_1a_1,\\ v_1^1=W^v_1a_1,\\ q_1^2=W^q_2a_1, \\ k_1^2=W^k_2a_1,\\ v_1^2=W^v_2a_1. q11=W1qa1,k11=W1ka1,v11=W1va1,q12=W2qa1,k12=W2ka1,v12=W2va1.Then
, the output after each self-attention will be put together, and then through a linear transformation, the output of multi-head self-attention will be obtained. Let the output of the first head behead 1 head_1head1, the output of the second head is head 2 head_2head2, the final output is OOO , the calculation formula is:
O = concat ( head 1 , head 2 ) W o . O = {\rm concat}(head_1, head_2)W^o.O=concat(head1,head2)Wo.

1.3 Encoder

The Encoder here specifically refers to the Encoder in Transformer[1] (the left is the Encoder of the Transformer, and the right is the Decoder). The model structure is as follows: There are a total of the following
Transformer
parts in the Encoder:

  • Multi-head self-attention: Already introduced before.
  • 残差连接 (Residual connection)[2]: The corresponding one is in the picture Add. The residual connection is shown in the figure below. Simply put, residual connection is to add the input of a module to its output , usually used in 层次较深的结构当中. So why do residual connections work in deep models? Specifically, if residual connections are not used, then the forward propagation is F ( x ) F(x)F ( x ) , during backpropagation, the gradient is∂ ( F ( x ) ) ∂ x \frac{\partial (F(x))}{\partial x}x(F(x)), when the gradient disappears, ∂ ( F ( x ) ) ∂ x \frac{\partial (F(x))}{\partial x}x(F(x))If it is 0, the gradient cannot be returned. When the residual connection is used, the forward propagation becomes F (x) + x F(x) + xF(x)+x . Intuitively, this allows the model to focus more on the changed parts after passing through this module; and mathematically, during backpropagation, it will become ∂( F ( x ) + x ) ∂ x = ∂ ( F ( x ) ) ∂ x + 1 \frac{\partial (F(x)+x)}{\partial x}=\frac{\partial (F(x))}{\partial x}+1x(F(x)+x)=x(F(x))+1 . When the gradient disappears, then∂ ( F ( x ) ) ∂ x \frac{\partial (F(x))}{\partial x}x(F(x))tends to 0, so ∂ ( F ( x ) + x ) ∂ x \frac{\partial (F(x)+x)}{\partial x}x(F(x)+x)Approaching 1, the gradient cannot disappear and can always be returned.
    residual connection
  • 层归一化 (layer norm)[3]: The corresponding one is in the picture Norm. The formula for layer normalization is shown below. Among them, mmm is the vectorxi x_ixiThe mean of σ \sigmaσ is the vectorxi x_ixistandard deviation. An example plot of layer normalization is shown below. Specifically, if the data is not normalized, the gradient may drop very quickly in some directions (from lower left to upper right), which will lead to crossing the optimal point; in some directions the gradient may drop very slowly (from lower right to upper left). ), this will result in the convergence not being able to reach the optimal point in half a day. After layer normalization, the data can decline equally fast in all directions, allowing faster convergence.
    xi ′ = xi − m σ x_i'=\frac{x_i-m}{\sigma}xi=pxim
    layer norm
  • 位置嵌入 (Positional Encoding): Corresponds to the bottom one in the picture Positional Encoding. Why location embedding? Since self-attention can be viewed as shown below, the interval between the two positions is 1. If you don’t understand why 1, you can go back and look at the gif above. Then this will cause a problem. For natural language processing tasks, the order of words is definitely very important. For example, if I write about positional encoding here, then the connection with self-attention written in the first section is very weak. , so position embedding is needed to control the position between words.
    self-attention
  • 全连接层: Corresponding to the one in the figure Feed forward, there is nothing to say. The only thing to note is that here are two layers of fully connected layers . The formula is as follows:
    FFN ( x ) = W 2 ( R e LU ( W 1 x + b 1 ) ) + b 2 FFN(x)=W_2({\rm ReLU}(W_1x+b_1))+b_2FFN(x)=W2( ReLU ( W1x+b1))+b2

1.4 Input and output of BERT

1.4.1 BERT input

The input of BERT is different from the input of traditional language model. The input of traditional language model is just the entire sentence, while BERT also adds several special characters to the input. These include:

  • [CLS]: [CLS] must appear at the beginning of the sentence. The hidden state of this special character through BERT represents the sentence vector of the sentence. [CLS] There will definitely be one.
  • [SEP]:[SEP] must appear at the end of the sentence. Since BERT supports single-sentence and two-sentence input , [SEP] is used to distinguish which sentence is which sentence. [SEP] is definitely there .
  • [MASK]: [MASK] will appear anywhere in [CLS] and [SEP]. This special character is to let BERT predict what word this position is . [MASK] Not necessarily .

Take the following two sentences as an example : 练习时长两年半and 唱跳 rap 打篮球, then after inputting into BERT, it will become as follows: [CLS] 练习时长两年半 [SEP] 唱跳 rap 打篮球 [SEP]; If only the previous sentence is entered and masked out , then it will be as follows:[CLS] 练习时长两年[MASK] [SEP]

1.4.2 BERT’s output

Like traditional sequence models, the output of BERT has two parts:

  • 句向量: After passing through the model, the hidden state of [CLS] is the sentence vector. If it is one sentence input, then it is the sentence vector of this sentence; if it is two sentence input, then it is the sentence vector of these two sentences.
  • 每个词语的隐藏状态: Like LSTM, BERT also outputs the hidden state of each word. What needs special attention here is that the so-called word embedding of BERT actually refers to the hidden state after passing BERT, not the embedding layer of BERT. This is because BERT is a contextualized word embedding . You must have contextual information to call it a word embedding.

1.5 BERT pre-training

BERT pre-training has two parts, the first part is masked language model (MLM), and the second part is next sentence prediction (NSP).

  • To put it simply, MLM randomly 15%extracts words from the input text, and then performs the following processing: 1. 80%If possible, replace the word with [MASK], which allows the model to predict [MASK]what word it is through the context; 10%if possible, replace the word randomly Replace with another word; 10%possible, keep the word unchanged. MLM is shown below. MLM is a VVV classification task, whereVVV is the vocabulary size.
    MLM
  • Simply put, NSP is to input two sentences into the model and let the model determine whether the latter sentence is related to the previous sentence. NSP is shown in the figure below. NSP is a binary classification task, where 1 means that the two sentences above and below are related, and 0 means there is no correlation.
    NSP

1.6 BERT fine-tuning

In the fine-tuning phase, BERT will first load the pre-trained parameters and add additional randomly initialized parameters. As shown below. In the figure, the fully connected layer marked in orange is the parameter randomly initialized during the fine-tuning phase. Therefore, in the fine-tuning phase, the training consists of two parts:

  • 模型本身: This part does not need to participate in training, because the model has been trained in the pre-training stage and is not required to be trained.
  • 随机初始化的参数: This part is the parameter that must be trained.
    BERT fine-tuning

2 BERT implements sentiment analysis

The specific model code is as follows:

import torch
import torch.nn as nn
from transformers import BertTokenizer, BertConfig, BertForSequenceClassification


class Config:
    def __init__(self):
        # 训练配置
        self.seed = 22
        self.batch_size = 64
        self.lr = 1e-5
        self.weight_decay = 1e-4
        self.num_epochs = 100
        self.early_stop = 512
        self.max_seq_length = 128
        self.save_path = '../model_parameters/BERT_SA.bin'

        # 模型配置
        self.bert_hidden_size = 768
        self.model_path = 'bert-base-uncased'
        self.num_outputs = 2


class Model(nn.Module):
    def __init__(self, config, device):
        super().__init__()
        self.config = config
        self.device = device
        tokenizer_class, bert_class, model_path = BertTokenizer, BertForSequenceClassification, config.model_path
        bert_config = BertConfig.from_pretrained(model_path, num_labels=config.num_outputs)
        self.tokenizer = tokenizer_class.from_pretrained(model_path)
        self.bert = bert_class.from_pretrained(model_path, config=bert_config).to(device)

    def forward(self, inputs):
        tokens = self.tokenizer.batch_encode_plus(inputs,
                                                  add_special_tokens=True,
                                                  max_length=self.config.max_seq_length,
                                                  padding='max_length',
                                                  truncation='longest_first')

        input_ids = torch.tensor(tokens['input_ids']).to(self.device)
        att_mask = torch.tensor(tokens['attention_mask']).to(self.device)

        logits = self.bert(input_ids, attention_mask=att_mask).logits

        return logits

The experimental results are as follows:

test loss 0.281900 | test accuracy 0.878846 | test precision 0.853424 | test recall 0.915280 | test F1 0.883270

reference

[1] Ashish Vaswani, Noam Shazeer, Nick Parmar, et al. Attention is all you need [EB/OL] [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al .
Deep Residual Learning for Image Recognition [EB/OL]. [3] Jimmy Lei Ba, Jamie Ryan Kiros
, Geoffrey E. Hinton. Layer Normalization [EB/OL]. https://archiv.org/abs/1607.06450 ,

Guess you like

Origin blog.csdn.net/qq_35357274/article/details/128812063