Study Notes: Deep Learning (6) - Language Model Based on Deep Learning

Study time: 2022.04.22~2022.04.25

5. Language model based on deep learning

5.1 From NNLM to word embeddings

From language model to word embedding Word Embedding. Some structures based on CNN or RNN are not mentioned in this section, see: Summary of various models and applications of deep learning NLP .

5.1.1 Neural Network Language Model NNLM

Neural Network Language Model, NNLM, born in 2003, fire in 2013.

NNLM is still a probabilistic language model, which uses a neural network to calculate each parameter in the probabilistic language model. The same as the N-gram model is to find P ( X n ∣ X 1 n − 1 ) P(X_n|X_1^{n-1})P(XnX1n1) , NNML builds a model by inputting a piece of text and predicting the next word. The MLP+tanh+softmax model is used, the cross-entropy is used as the loss function, and the back-propagation algorithm is used for training.

insert image description here

Model Explanation:

  • Input layer: Map each word of context (w) into a word vector of length m (the length is specified by the trainer, generally use tf.random_normal to randomly initialize a matrix of nxm), the word vector is random at the beginning and is also a hyperparameter Participate in network training;
  • Create a ∣ m ∣ × N |m|×N using random initializationm ×A lookup table of size N ;
  • context(w): It can be called the length of the context window, similar to how many words N-gram takes as an addition. For example, the context window length is c = 3 c=3c=3 , then loop to take four words, the first three are the feature values, and the last one is the target value;
  • Projection layer: concatenate all the context items into a long vector for you, as the feature vector of the target w. The length is m ( n − 1 ) m(n-1)m(n1 ) ;
  • Hidden layer: The spliced ​​vector will go through a hidden layer of size h, and the tanh activation function is used in the paper;
  • Output layer: The final output will output the size probability distribution of all n words through softmax.

defect:

  • The computational complexity is too large and there are many parameters (word2vec is an improved method); it is still limited to the size of N, the best is to not limit the number of N, and it may be dependent on any words before and after, then based on the cyclic neural network The RNNLM of the network is the solution, because the natural structure of RNN is dependent on any length before and after the sequence.
  • The model is not optimized enough, the output categories are too many, the training is too slow, the autoregressive language model cannot obtain the following information, early solutions, and there are few applications, and the general python tools will not be integrated;

5.1.2 Language Model RNNLM Based on Recurrent Neural Network

Recurrent Neural Network Language Model,RNNLM。

RNN is introduced to form a model that predicts the probability of the next word according to the context. Therefore, if you want to predict the probability of a certain sentence S in the future, you can calculate the P at multiple times according to the model and multiply them to get the probability of the sentence. . The innate structure can handle the dependencies of sequences of arbitrary length without artificially limiting the length of the model input. The disadvantage is the long computation and training time.

5.1.3 Word2Vec

Word2Vec is a tool released by Google in 2013, and it can also be said to be a group of models that generate word vectors. The tool mainly includes two word vector generation models, skip-gram and continuous bag of words (CBOW) , as well as two efficient training (optimization acceleration) methods: negative sampling (negative sampling ). sampling) and hierarchical softmax (hierarchical softmax) . Word2vec is essentially a dimensionality reduction operation - reducing the dimensionality of words from one-hot representation to word embeddings represented by Word2vec.

Since Word2vec will consider the context, it is better than the previous Embedding method; it has fewer dimensions than the previous Embedding method, so it is faster; it is very versatile and can be used in various NLP tasks. However, due to the one-to-one relationship between words and vectors, the problem of polysemy cannot be solved; word2vec is a static method, although it is versatile, it cannot be dynamically optimized for specific tasks.

It should be noted that Word2vec is the product of the previous generation (before 18 years). If you want to get the best results after 18 years, the method of Word Embedding is no longer used, so Word2vec will not be used.

1. Generative model of word vector

That is, how Word2Vec calculates the word vector, and only one way can be used for one calculation.

Personal understanding: both methods are essentially the same, network structure, weight WWW is the same size and position.

(1) Continuous bag of words model CBOW

The core idea is to pull out a word from a sentence, and predict the probability of occurrence of the word that has been pulled out when the context is known.

eg: Take {"The", "cat", "over", "the", "puddle"} as the context, and hope to predict or generate the central word "jumped" from these words.

The network diagram of the CBOW model is as follows, which is a simple 3-layer neural network (there is no nonlinear transformation from the input layer to the Projection layer):

image-20220421171000258img

  • Input layer: the onehot of the input context word (excluding the word itself), assuming the word vector space dimension is V × 1 V × 1IN×1 , the number of context words isCCC
  • Hidden/Projection layer: All oneshots are multiplied by shared input weights WW respectivelyW (initialization weight matrixWWW for N × VN ×VN×V matrix,NNN is the number set by yourself), the obtained vectors are added and averaged as the hidden layer vector, and the Size isN × 1 N × 1N×1 ;
  • Output layer: N × 1 N × 1 calculated by the hidden layerN×1 vector as the input vector of the hidden layer, multiplied by the output weight matrixW ' W'In W ′ W' In' isV × NV × NIN×N matrix), getV × 1 V × 1IN×The output vector of 1 is finally obtained through the activation function Softmax to obtainV − dim V-dimINThe probability distribution of d im , in which the word indicated by the index with the highest probability is the predicted intermediate word (also one hot);
  • Loss function: The predicted one hot is compared with the real one hot, and the cross-entropy loss function is used to update the parameter WW through gradient descent backpropagation.W , which is also the weight matrix we will eventually need through training.
(2) Skip-gram model

In contrast to CBOW, Skip-Gram is inputting a word and asking the network to predict its context word. It is equivalent to giving you a word and asking you to guess what words may appear before and after.

image-20220421184519068img

  • The size of the vocabulary we have is VVV (the dimension of one hot), the window size of the prediction model output isCCC C = 2 R C=2R C=2 R , the low-dimensional dimension of our embedded word vector isNNN
  • Input layer: one-hot original word vector of 1 word ( V × 1 V × 1IN×1 ); Since the embedding word vector matrix is ​​unique, the weightWWW (word vector matrix,N × VN × VN×V ) is unique and shared;
  • Hidden/Projection layer: The output of this layer is passed through WWW -weighted matrixN × 1 N × 1N×1 ; weight matrixW′ W′In' isV × NV × NIN×N matrix;
  • Output layer: matrix and weights are multiplied to get V × CV × CIN×The output of the C matrix is ​​finally processed by softmax to obtain the one hot probability form of all words;
  • Loss function: Similarly, according to the comparison between the predicted one hot and the real one hot, the cross entropy loss function is used to update the parameter WW through gradient descent backpropagationW and W′ W'In' . The final weight matrixWWW is the Embedding matrixof the embedded word vector we are looking for.

One thing to note is that because there may be multiple context words corresponding to a central word, positive samples will have multiple pairs <input_word, output_context1>, <input_word, output_context2>, etc., so the same corpus, skip-gram Longer training than cbow.

2. Acceleration method for optimizing the model

When using softmax to find the probability value, we need to calculate each word in the vocabulary for each training (sort all the words, and then take the maximum value), so that the operation of traversing the entire vocabulary will be in the large corpus. The calculation becomes extremely time-consuming and difficult. Therefore, word2vec also has two mechanisms to speed up operations, hierarchical softmax and negative sampling.

It seems unnecessary to use both methods at the same time: Can Hierarchical Softmax and Negative Sampling be applied to the Word2Vec model at the same time ?

(1)Hierarchical Softmax

Detailed explanation: word2vec principle (2) Model based on Hierarchical Softmax .

Principle: In order to avoid the huge amount of computation from the hidden layer to the output softmax layer, a Huffman tree is used to replace the mapping from the hidden layer to the output softmax layer. All internal nodes of the Huffman tree are similar to the neurons of the previous neural network projection layer, where the word vector of the root node corresponds to our projected word vector, and all leaf nodes are similar to the Vector of the previous neural network softmax output layer. , the number of leaf nodes is the size of the vocabulary.

In the Huffman tree, the Softmax mapping from the hidden layer to the output layer is not completed at once, but is completed step by step along the Huffman tree, so this Softmax is named "Hierarchical Softmax". Its time complexity is from O ( n ) O(n)O ( n ) changed O ( logn) O(log^n)O(logn ), high-frequency words need less time to be found, which is in line with the greedy optimization idea.

Personal understanding: The work from the hidden layer to the output layer is to map the obtained word vector to the corresponding value for comparison with the real value.

First for all in VVFor the words of the V -dimensional vocabulary, the Huffman tree is constructed according to the word frequency. The greater the word frequency, the shorter the path and less coding information. All leaf nodes in the tree form the wordVVV , the intermediate nodes have a total ofV − 1 V-1IN1 , each leaf node has a unique shortest path from the root to the node.

insert image description hereimg

How to "step by step along the Huffman tree"? In word2vec, the method of binary logistic regression is adopted, that is, it is stipulated that if you walk along the left subtree, then it is the negative class (Huffman tree code 1), and if you walk along the right subtree, then it is the positive class (Huffman tree). code 0). The way to discriminate between positive and negative classes is to use the sigmoid function, namely:
P ( + ) = σ ( xw T θ ) = 1 1 + exw T θ P ( − ) = 1 − P ( + ) P(+) = σ (x_w^Tθ) = \frac{1}{1+e^{x_w^Tθ}}\\ P(-) = 1- P(+)P(+)=s ( xinTi )=1+andxinTi1P()=1P ( + ) xw
x_wxinis the word vector of the current internal node, and θ θθ is the model parameter of the logistic regression we need to find from the training samples.

Added: Huffman tree:

When trying to build a tree with n nodes (all of which are leaf nodes and have their own weights), if the weighted path length of the tree constructed is the smallest, the tree is called "optimal binary tree" , sometimes called "Huffman tree" or "Huffman tree".

When constructing a Huffman tree, in order to minimize the length of the weighted path of the tree, only one principle needs to be followed: the node with greater weight is closer to the root of the tree.

(2)Negative Sampling

Source of this part: word2vec principle (3) Model based on Negative Sampling .

Using Huffman tree can improve the efficiency of model training, but if the central word ww in the training samplew is a very uncommon word, so it has to go down the Huffman tree for a long time. Negative sampling Negative Sampling abandons the complex Huffman tree, each time only by samplingneg negThe model can be trained by using n e g different central words as negative examples, which adopts a simpler model solving method than Hierarchical Softmax.

For example, we have a training sample, the central word is www , there are 2 c 2csurrounding contexts2 c words, denoted ascontext ( w ) context(w)c o n t e x t ( w ) . Because of this central word ww, it is indeed the same ascontext(w) context(w)c o n t e x t ( w ) correlation exists, so it is a true positive example. With Negative Sampling sampling, we getneg negn e g andwww different head wordswi , i = 1 , 2 , … , neg w_i,i=1,2,…,negini,i=1 ,2 ,,n e g , such thatcontext ( w ) context(w)c o n t e x t ( w ) sumwi w_iiniform neg negn e g negative examples that do not really exist. Use this positive example andneg negFor n e g negative examples, we perform binary logistic regression to obtain negative samples corresponding to each wordwi w_iiniCorresponding model parameters θ i θ_iii, and the word vector for each word.

How to do negative sampling?

  • If the size of the vocabulary is VVV , then we take a segment of length1 1The line segment of 1 is divided into VVV copies, each corresponding to a word in the vocabulary. Of course, the length of the line segment corresponding to each word is different, the line segment corresponding to the high frequency word is long, and the line segment corresponding to the low frequency word is short. each wordwwThe length of the line segment of w is determined by the following formula (in word2vec, both the numerator and denominator are raised to the power of 3/4):

l e n ( w ) = c o u n t ( w ) 3 4 ∑ u ∈ v o c a b c o u n t ( u ) 3 4 len(w) = \frac{count(w)^{\frac{3}{4}}}{\sum_{u∈vocab}count(u)^{\frac{3}{4}}} l e n ( w )=u v o c a bcount(u)43count(w)43

  • Before sampling, we set this length to 1 1The line segment of 1 is divided intoMMM equal parts, whereM > V M > VM>V , which ensures that the line segment corresponding to each word will be divided into corresponding small blocks. And each of the M copies will fall on the line segment corresponding to a certain word. When sampling, we only needtoNeg negis sampled from M positionsN e g positions are enough, and the word to which the line segment corresponding to each sampled position belongs is our negative example word. In word2vec,MMThe default value of M is 1 0 8 10^81 08 .

insert image description here

5.1.4 Context2Vec

Context2Vec (Melamud et al., 2016) uses bidirectional long short-term memory (LSTM) to encode context around a central word.

The main goal of Context2vec is to learn a general task-independent embedding model for variable-length sequence vector representations of target word contexts. Drawing on the CBOW model of word2vec, the context is used to predict the target word. Unlike CBOW, the original context vector averaging operation is replaced by a bidirectional LSTM model.

Context2Vec can be used as a method for sentence embedding. We can directly input a sentence to generate its embedding result, which contains sequence information .

previewpreview

5.2 Pre-trained models: from ELMo to GPT&BERT

But so far, the effect of using Word Embedding has not been as good as expected, because there is a problem with Word Embedding - each word is assumed to have only a specific embedding vector representation, which is obviously not in line with our language habits, because obviously Sometimes the same word has different meanings, that is, polysemy. This phenomenon exists in both Chinese and English. To solve this problem, a model of a typical two-stage process, a pretrained model, has emerged.

It should be mentioned that the early pre-training model researchers have tried a lot on the model structure, such as ELMo using bidirectional LSTM. However, after the emergence of Transformer, the focus of researchers' research has shifted from the model structure to the training strategy.

For example, both GPT and BERT are based on Transformer structures: GPT is based on Transformer decoder, while BERT is based on Transformer encoder.

Supplement: Transformer

After a brief look down, the subsequent models are basically based on this network structure. It is really impossible not to look at it, so I will stop here temporarily and add some relevant knowledge first. See: Study Notes: Deep Learning (7) - From Encoder-Decoder to Transformer .

5.2.1 ELMo

For in-depth understanding, please refer to: ELMo detailed explanation , NLP word vector articles (4) ELMo .

ELMo (Embeddings from Language Models), from the paper: Deep contextualized word representation (deep contextualized word representation), that is, dynamically changing word vectors, 2018.3 University of Washington. The model diagram is as follows:

insert image description here

ELMo is an RNN-based multi-layer bidirectional LSTM language model (the paper is bi-directional, Bi-LSTM, referred to as BiLM).

ELMo integrates the output of each layer of forward and backward LSTM with the initial word vector to form a vector, which is input to the Softmax layer after regularization, and the final output weight vector is the word vector output by ELMo:

img

Unlike traditional word vectors, each word corresponds to only one word vector, ELMo uses a pre-trained two-way language model, and then the context-dependent current word representation can be obtained from the language model according to the specific input (for the same word in different contexts). The representation of words is different), extract three Embedding features through double-layer LSTM, and get the final Embedding for other tasks (that is, based on feature fusion) by weighting and summing, and the word vector is dynamically generated.

The previous Word Embedding is essentially a static method. The so-called static means that the expression of each word is fixed after training. When using it later, no matter what the context word of the new sentence is, the Word Embedding of this word is not It will change as the contextual scene changes.

The essential idea of ​​ELMO is: I use the language model to learn the Word Embedding of a word in advance. At this time, the polysemous words cannot be distinguished, but it does not matter. When I actually use Word Embedding, the word already has a specific context. At this time, I can adjust the Word Embedding representation of the word according to the semantics of the context word, so that the adjusted Word Embedding can better express in this context. The specific meaning naturally solves the problem of polysemy.

shortcoming:

  • In terms of feature extractor selection, the ability of LSTM to extract features is not strong enough, and it is relatively weaker than Transformer (proposed in 2017);
  • The training speed is slow and cannot be parallelized. Because it is an RNN model, it can only recursively budget and cannot be parallelized;
  • The method of outputting splicing forward and backward fusion features is not good enough, and the fusion feature method of Bert integration is better;
  • There is no better general-purpose trained Chinese EMLo pre-training model, nor is there a better tool.

5.2.2 GPT

论文:Improving Language understanding by Generative Pre-Training(2018.06,OpenAI)。

GPT is mainly based on the Decoder part of Transformer, but it has made a little structural change, removing the Encoder-Decoder Muti-Head Attention layer, because GPT only uses Decoder, so the input layer of Encoder is not needed. The specific module diagram is as follows, which contains the superposition of 12 Decoders:

imginsert image description here

Because GPT inherits Transformer's Decoder, some features of Decoder are also used in GPT - it is unidirectional rather than bidirectional. model in the ttAt time step t , only t − 1 t-1t1 time step and previous output without knowingiiThe output of the i time step and beyond. The details are as follows:

Stage 1: Pre-Trained

  1. Superimpose word vector Embedding and Positional Encoding. h 0 h_0 before input to Transformerh0is the superposition of word embedding vector and position vector, UUU is the one hot vector of the word,W e W_eInandis the word embedding vector matrix, W p W_pInpis a matrix of position vectors.

h 0 = U ⋅ W e + W p h_0 = U·W_e + W_ph0=U Inand+Inp

  1. Use a Mask to mask out "should not see" output, just like Transformer. In this way, GPT can only scan input data from left to right, or from right to left.

insert image description here

  1. After the calculation of each layer (12 layers in total) of GPT, the output is obtained (the specific calculation process is the same as that of Transformer): ht = transformer _ block ( hl − 1 ) , l ∈ [ 1 , t ] h_t = transformer\_block(h_{ l-1}),\l∈[1, t]ht=transformer_block(hl1) , l[ 1 ,t]

  2. Finally, output the probability of the current word through Softmax: P ( u ) = softmax ( ht W e T ) P(u) = softmax(h_tW^T_e)P(u)=softmax(htInandT) .

  3. Loss function: L 1 ( U ) = ∑ ilog P ( ui ∣ ui − k , … , ui − 1 ; Θ ) L_1(U) = \sum_ilogP(u_i|u_{ik},…,u_{i-1 };Θ)L1( U )=ilogP(uiuik,,ini1;Θ ) . Among them, the fixed phraseU = [ u 1 , u 2 , … , un ] U=[u_1, u_2, …, u_n]IN=[ in1,in2,,inn] k k k is the size of the context window,PPP is the conditional probability,Θ ΘΘ is the parameter of conditional probability. Then backpropagation, updating the parameters through stochastic gradient descent SGD.

Stage 2: Fine-tuning

  • That is, the model is fine-tuned according to specific downstream tasks. For example, there are downstream tasks as shown in the figure below (mainly introduced in the paper: classification tasks, reasoning tasks, similarity analysis, and question-answering tasks):

insert image description here

  • The fine-tuning method varies according to the task type:

    • Classification task: the input is text, the vector of the last word is directly used as the input for fine-tuning, and the final classification result is obtained;
    • Reasoning task: the input is a priori + delimiter + hypothesis, the vector of the last word is directly used as the input for fine-tuning, and the final classification result (ie, whether it is true) is obtained;
    • Similarity analysis: The input is that the two sentences are reversed, the vector of the last word obtained is added, and then Linear is performed to obtain the final classification result (ie, whether it is similar);
    • Question-and-answer task/multiple-choice task: the input is the context and the question together, plus multiple answers, and the middle is also separated by a separator, the vector of the last word of the sentence formed by each answer is used as the input for fine-tuning, and then Linear , perform softmax on the results of multiple Linear, and get the word with the highest probability as the output.
  • The fine-tuning process adopts supervised learning, and the specific steps are as follows:

    1. Suppose we have a labeled dataset CCC , that is, each Token sequencex 1 , x 2 , … , xmx^1,x^2,…,x^mx1 ,x2 ,,xm has a labelyyy , the Token sequence is the input for fine-tuning.
    2. Input the Token sequence into the Transformer model to get the final output state, then add the output to the linear layer Linear to process the output, and predict the label yyy . The formula says:P ( y ∣ x 1 , x 2 , … , xm ) = softmax ( hlm ⋅ W y ) P(y|x^1,x^2,…,x^m) = softmax(h^m_l· W_y)P(yx1 ,x2 ,,xm)=softmax(hlmInand) , inwhich W y W_yInandis the weight of Linear.
    3. The loss function is: L 2 ( C ) = ∑ ( z , y ) log P ( y ∣ x 1 , x 2 , … , xm ) L_2(C) = \sum_{(z,y)}logP(y|x ^1,x^2,…,x^m)L2(C)=(z,y)logP(yx1 ,x2 ,,xm ). However, GPT also considers the loss function of the pre-trained language model when fine-tuning (which can make the model more generalizable and accelerate convergence), so the final function to be optimized is:L 3 ( C ) = L 2 ( C ) + λ L 1 ( C ) L_3(C) = L_2(C) + λL_1(C)L3(C)=L2(C)+λL1( C ) , whereλ λλ represents the weight. The update parameters are also back-propagated.

The main disadvantage of GPT is that it is a one-way language model. If it is transformed into a two-way language model, there is not much to do with Bert. In addition, GPT also needs to adjust the structure of the input data in the fine-tuning stage.

Supplement: GPT-2 & GPT-3

① GPT-2:

GPT-2 is an upgraded version based on GPT, and the basic structure and content have not been significantly changed (2019.02). The structure diagram is as follows:

img

The main differences between GPT-2 and GPT are as follows:

  1. Fine-tuning step: Instead of fine-tuning modeling for different tasks, it does not define what tasks the model should do, and the model will automatically identify what tasks need to be done. Learning is a general NLP model;
  2. Increase the dataset: GPT-2 collects a wider and larger number of corpora to make up the dataset. The dataset contains 8 million web pages with a size of 40G, and the data (data with task information) are all filtered high-quality web pages;
  3. Increase network parameters: GPT-2 increases the number of Transformer stacking layers to 48 layers (GPT2: 12 layers, GPT2-XL: 48 layers), the tensor dimension of the hidden layer is 1600 dimensions, and the number of parameters has reached 1.5 billion ( 1558M, Bert large is 340 million);
  4. Adjust Transformer: Standardize the Layer Normalization layer before each sub-block, and add an additional Layer Normalization after the last Self-Attention, and the parameter initialization of the residual layer is adjusted according to the network depth;
  5. GPT-2 increases the vocabulary to 50,000 (Bert English is 30,000, Chinese is 20,000); Embedding size includes 768, 1024, 1280, 1600; the length of the word sequence that can be processed is increased from 512 to 1024 in GPT; batch_size is increased to 512.

② GPT-3:

论文:Language Models are Few-Shot Learners(2020.5)。

What GPT-3 wants to do is remove the fine-tuning part, which it wants to use to directly solve the downstream task (Zero-shot Learning, 0 samples). The main contents are as follows:

  1. 175 billion parameters;

    17.5 billion parameters need at least 700G hard disk to save, and then 10 times larger... Since the number of parameters is too large, if you want to train a GPT-3 yourself, it will cost 12 million US dollars.

  2. GPT-3 has 10 times more parameters than a non-sparse model (a sparse model means that many weights are 0, which is sparse).

  3. There is no need to calculate gradients when doing subtasks, because the parameters are very large and the parameters are difficult to adjust.

    • Fine-tuning: Pre-training + training samples calculate loss update gradient, and then predict. The model parameters are updated.
    • Zero-shot: pre-training + task description + prompt, direct prediction. Model parameters are not updated.
    • One-shot: pre-training + task description + example + prompt, prediction. Model parameters are not updated.
    • Few-shot: pre-training + task description + examples + prompt, prediction. Model parameters are not updated.

5.2.3 BERT

BERT, Bidirectional Encoder Representations from Transformers (2018.10, Google), is based on the Encoder part of Transformer for word vector training. From the paper " BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding " (Deep Bidirectional Transformer Pre-training Language Model). Similarly, BERT is also divided into two stages: pre-training and fine-tuning for downstream tasks. Its structural framework is as follows:

img

The core of the model is composed of BERT Encoder, which is composed of multiple layers of BERT Layer. The BERT Layer of each layer is actually the Encoder layer in the Transformer.

Two different scales of BRET:

BRET-Base: 12-layer Transformer Encoder, 768 hidden layer units (output 768 dimensions) and 12 heads (Mutil-Head Attention), with a total parameter size of 110M (110 million);

BRET-Large: 24-layer Transformer Encoder, 1024 hidden layer units (output 1024 dimensions) and 16 heads (Mutil-Head Attention), the total number of parameters is 340M (340 million).

BERT and its variants at a glance:

model name number of hidden layers Tensor dimension head count parameter quantity training corpus
Bert-base-uncased 12 768 12 110M lowercase english text
Bert-large-uncased twenty four 1024 16 340M lowercase english text
Bert-base-cased 12 768 12 110M Case-insensitive English text
Bert-large-cased twenty four 1024 16 340M Case-insensitive English text
Bert-base-multilingual-uncased 12 768 12 110M Lowercase text in 102 languages
Bert-large-multilingual-uncased twenty four 1024 16 340M Lowercase text in 102 languages
Bert-base-chinese 12 768 12 110M Simplified and Traditional Chinese text

1. Stage 1: Pre-training

The pre-training process of BERT is mainly divided into two parts: MLM (Mask Language Model, masking language model <cloze>) and NSP (Next Sentence Prediction, sentence pair prediction). Mask LM can obtain context-dependent bidirectional feature representation, and NSP is good at processing sentence or paragraph matching tasks, thereby achieving the goal of multi-task training.

Mask Language Model

BERT's MASK method (auto-encoding) is not the same as Transformer (auto-regression). MLM solves the bidirectional problem by randomly masking the input text sequence, and then predicting what word the mask should be through the context. The specific methods are as follows:

  1. The Token in the input sentence is randomly selected with a probability of 15%;

  2. The selected Token is replaced with an 80% probability [MASK], a 10% probability is replaced with another Token, and the last 10% remains unchanged.

    For example, if the sentence is "my dog ​​is hairy", there is a 15% probability that "hairy" is selected as the MASK. When doing MASK, there is an 80% probability of replacing [MASK]"hairy" with "apple", a 10% probability of replacing "hairy" with "apple", and a 10% probability of leaving "hairy" unchanged.

Step 2 In order to improve the generalization ability of the model, because the subsequent fine-tuning stage does not do the Mask operation, in order to reduce the phenomenon of poor model expression ability caused by the problem of inconsistent input distribution in the pre-training and fine-tuning stages, this strategy is adopted. .

Next Sentence Prediction

The task description of NSP is: given two sentences in an article, determine whether the second sentence is immediately after the first sentence in the text (only two sentences are considered to determine whether it is before and after in an article). sentence). The specific methods are as follows:

  1. Randomly select 50% of the correct sentence pairs (i.e., sentence 2 is the next sentence of sentence 1) from the text corpus as positive samples;
  2. Randomly select 50% of the wrong sentence pairs from the text corpus (that is, sentence 2 is randomly selected in the corpus, not the next sentence of sentence 1) as negative samples;
  3. The positive and negative samples are used as input data for training, combined with the MLM task, and the output is CLSused for binary classification to determine whether the two sentences are connected before and after.

2. Model input

The input of BERT is a linear sequence, which supports single-sentence text and sentence-pair text. The beginning of a sentence is [CLS]represented by a symbol, and the end of a sentence is represented by a symbol [SEP]. If it is a sentence pair, a symbol is added between the sentences [SEP]. (The length of the Tokens input in BERT cannot exceed 512, that is, the maximum length of a sentence plus the [CLS]sum [SEP]cannot exceed 512)

Because the Transformer used by the network has Self-Attention, each position can learn the information of other position words, so they are the same at this point. More or less [CLS]or [SEP]all are possible, and even one token can be followed by one [SEP].

The final input Embedding is composed of Token Embeddings (word vector), Segment Embeddings (sentence vector) and Position Embeddings (position vector), as shown below.

img
  • Token Embeddings: The most traditional word vector, like the input of the previous language model, it is the result of embedding the token into a high-dimensional space.

    BERT adopts a new Tokenize (word segmentation) strategy:

    • The previous strategies are: word-level (splitting sentences into words), char-level (splitting into characters);
    • BERT uses subwords : that is, try not to decompose common words, but decompose infrequent words into commonly used subwords . For example "annoyingly" might be considered a rare word and could be broken down into "annoying" and "##ly" (and possibly "annoy", "##ing" and "##ly"). It also enables the model to deal with words that have never been seen before (OOV problem, Out Of Vocabulary) to a certain extent.
  • Segment Embeddings: This vector is used to describe the global semantic information of the text, indicating which sentence the word belongs to (NSP requires two sentences).

  • Position Embeddings: Unlike Transformer (pre-set), it is a learned Embedding vector.

    In BERT, if there are two sentences "Hello world" and "Hi there", "Hello" and "Hi" will have exactly the same Position Embedding, and "world" and "there" will also have the same Position Embedding.

    But for "I think, therefore I am", the first "I" and the second "I" have different Position Embeddings.

Token, Segment, and Position Embeddings are all obtained automatically through learning. Taking the input of 2 sentences as an example, each layer of Embedding and its dimensions (the word vector of BRET-Base is 768 dimensions) can be expressed as follows:

image-20201217152751053

3. Stage 2: Fine-tuning

The practice at this stage is the same as GPT. There are 4 different types of downstream tasks mentioned in the paper:

image-20220425113025724
  • Sentence pair classification: The input of the text matching task is two different sentences. In the end, it is still a two-category problem. If it is to be implemented with BERT, the basic code and process are consistent with the classification problem. The output dimension of the fully connected layer is 2. However, in practical engineering applications, if BERT is used directly for the text matching problem, the final effect may not be good. Generally, some structures similar to the twin network can be used to solve the text matching problem, and this can be further explored.
  • Text classification (single sentence classification): Text classification is what BERT is best at. The most basic way is to load the pre-trained BERT and add a fully connected layer to the output [CLS] for classification. The dimension of the output of the fully connected layer is the number of categories we want to classify. Of course, another network layer can also be added before the classification to achieve the corresponding purpose.
  • Extractive question answering: Note that since BERT does not have the ability to generate, it can only do extractive question and answer. It can be understood like this: this answer is actually a process of finding an answer in an article, and training is performed by predicting the id of the answer's start and end positions in the article.
  • Sequence tagging (single sentence tagging): Since the general word segmentation task, part-of-speech tagging and naming body recognition tasks are all sequence tagging problems. For this type of problem, each token of the input sentence needs to predict its label, so sequence labeling is a single-sentence multi-label classification task, and all outputs of the BERT model (except special symbols) must give a prediction result.

Migration strategy: After getting a BERT pre-training model, we have two choices: ① Use BERT as a feature extractor or sentence vector, not fine-tuning in downstream tasks; ② Use BERT as the main model of downstream business, in downstream tasks Medium fine-tuning.

4. Summary evaluation

Bert draws on ELMO, GPT and CBOW, and mainly proposes the Masked language model and NSP, but here NSP basically does not affect the overall situation, and MLM draws on the idea of ​​CBOW. The biggest highlight of Bert is its good effect and strong universality. Almost all NLP tasks can apply Bert's two-stage solution, and the effect should be significantly improved. The flaws are as follows:

  1. Huge consumption of hardware resources and long training time;
  2. The input is inconsistent during pre-training and fine-tuning, and pre-training is used [MASK], which affects the performance of the model during fine-tuning;
  3. The semantic distinction of words with the same sentence pattern is not obvious;
  4. Not suitable for generative tasks;
  5. It is not very friendly to very long texts and cannot be used for document-level NLP tasks. It is only suitable for sentence and paragraph-level tasks.

5.2.4 Basic Paradigm of Pretraining

GPT and BERT represent the two most basic pre-training paradigms, which are called "Auto-regressive model" (AR) and "Auto-encoding model" (AE), respectively. Suitable for different types of downstream tasks.

insert image description here

Among them, GPT is often more suitable for text generation tasks, and BERT is often more suitable for text understanding tasks, both of which are based on some parameters of the Transformer structure.

  • Auto-regressive model : A classic language modeling task that predicts the next word based on the content of the read text. For example, Transformer's decoder applies a masking mechanism in the training phase, so that only the content before a certain word can be seen during the attention calculation process, but not the content behind it. Although such pretrained models can be fine-tuned and achieve excellent results on many downstream tasks, their most natural application is text generation. The representative of this model is the GPT series model.
  • Auto-encoding model : pre-train by destroying the input text in some way (such as masking the input text in Bert) and trying to reconstruct the original text. In a sense, they correspond to Transformer's encoders, because they don't need any masks, and every position has access to the full input. These models often build bidirectional encoded representations of entire sentences, they can be fine-tuned and achieve excellent results on many downstream tasks, the most natural applications of which are sentence classification or sequence labeling. A typical representative of this type of model is BERT.

5.3 Pre-training model: PTM After GPT&BERT

5.3.1 Development Overview

Starting from GPT and BERT, some improvement schemes are proposed. The following figure shows the main improvement models at present (2021.6) (from: " Pre-Trained Models: Past, Present and Future ").

img

The research on pre-training models after GPT and BERT can be divided into the following aspects:

1. Improve the architecture design

Two directions, Unified Sequence Modeling and Cognitive-inspired Architectures.

  • Unified Sequence Model
    • Combining autoregressive and autoencoder modeling. For example, XLNet permutes the tokens in pre-training and then applies the autoregressive prediction paradigm.
    • Apply a generalized codec. Neither encoder structures (like BERT) nor decoder structures (like GPT) solve the important problem: padding blanks with variable lengths.
  • Cognitive Heuristic Architecture
    • Maintain working memory. For example, Transformer-XL introduces segment-level recursion and relative position encoding, and CogQA proposes to maintain cognitive maps in multi-hop reading.
    • sustainable long and short memory. Such as replacing the feedforward network of the Transformer layer with a large key-value storage network, or extending the masked pre-training to autoregressive generation.

2. Leverage multi-source data

PTMs utilizing multi-source heterogeneous data, including multilingual PTMs, multimodal PTMs, and knowledge-augmented PTMs.

  • More modalities: Besides images and text, video and audio can also be used for multimodal pre-training.
  • Deeper interpretation: Deep learning visualization tools can be used for interpretation before multimodal training.
  • More downstream applications: Multimodal preprocessing can be applied to image-text retrieval, image-text generation, text-image generation, and other downstream tasks.
  • Transfer Learning: For a multimodal multilingual model to handle different languages, pre-training requires data for each language.

3. Improve computing efficiency

It mainly focuses on three aspects: system-level optimization, efficient learning and model compression strategies.

  • System-level optimization: single-device optimization, multi-device optimization;
  • efficient learning
    • Efficient training method: adaptively use different learning rates at different levels to speed up the convergence speed when the batch size is large.
    • Efficient model architecture: More model architecture variants can also reduce computational complexity and improve the efficiency of training PTMs.
  • model compression
    • Parameter sharing: PTMs can be compressed by sharing parameters among similar units.
    • Knowledge distillation: train a small model to reproduce the behavior of a large model. Even with a small distillation model for inference, memory usage and time overhead are reduced.
    • Model pruning: pruning the weights of attention and linear layers to reduce the number of parameters in PTMs while maintaining comparable performance to the original model.
    • Model quantization: refers to compressing floating-point parameters with higher precision into floating-point parameters with lower precision.

4. Computational Efficiency

  • Data Transfer: Develop an efficient distributed deep learning framework.
  • Parallel strategy: In the choice of parallel strategy, data parallelism, model parallelism, pipeline parallelism, and various hybrid parallelism methods can all find their best use according to the structure and hardware configuration of the neural network.
  • Large-scale training: In view of the insufficient support for model parallelism and pipeline parallelism in deep learning frameworks, a framework dedicated to large-scale training was developed.
  • Wrappers and Plugins: Develop various libraries specialized for some specific algorithms by manually inserting data routing operations between computational operations on existing frameworks.

5.3.2 XLM

Paper: " Cross-lingual Language Model Pretraining " (2019.1 Facebook)

XLM is a cross-language language model optimized based on BERT. Significant improvements have been made on multiple cross-language understanding (XLU) benchmarks. The contributions are: ① Introduce a new unsupervised method for learning cross-language representation; ② Introduce a new supervised training objective, in parallel When data is available, the effect of cross-language pre-training is improved; (3) Cross-language models can significantly improve the complexity of low-resource languages.

Compared with BERT, the work done by XLM is as follows:

  1. A Byte-Pair Encoding (BPE) encoding method (a method of Subword) is used to replace the way Bert uses words or characters as model input.

    • First, each corpus is sampled separately, and then the sampled corpus of each language is spliced, and finally normal BPE processing is performed. Divide text input into the most common subwords in all languages, thereby increasing the vocabulary shared between different languages.
  2. Three training objectives are proposed, the first two require only monolingual data (for unsupervised training) and the third requires parallel corpora (for supervised training).

    • Causal Language Modeling, CLM: used to give the probability of the next word given the previous word;
    • Masked Language Modeling, MLM: Unlike BERT pre-training, where the input to BERT uses pairs of sentences, XLM uses a text stream consisting of any number of sentences (each sentence is truncated to 256 tokens) instead of pairs of sentences ——That is, multiple sentences that are physically adjacent to each other are regarded as a whole sentence group, and two sentence groups are selected as input pairs;
    • Translation Language Modeling, TLM: TLM is an extension of MLM that does not consider monolingual text streams, but concatenates parallel translation sentences. As shown in the figure below, some tokens are randomly masked in both the source sentence and the target sentence. When predicting the masked token in the English sentence, the model can not only notice the English token, but also the French translation content, so it can The model is guided to align the representations of English and French; in particular, the model is able to exploit the contextual information of the target sentence when the source sentence is insufficient to infer the masked token.

insert image description here

The complete XLM model is trained and interacted with two strategies, MLM and TLM.

5.3.3 DIFFERENT

Paper: " ERNIE: Enhanced Representation through Knowledge Integration " (2019.4 Baidu)

In BERT [Mask], single characters are used instead of entities or phrases, and lexical structure/grammar structure is not considered. Therefore, ERINE utilizes richer semantic knowledge and more semantic tasks in the pre-training process, mainly for Chinese training.

  • Introduce knowledge (ie pre-identified entities) in the pre-training stage, and introduce 3 Mask strategies:

    • Basic-Level Masking: Like BERT, masking the Subword cannot obtain high-level semantics;

    • Phrase-Level Masking: Mask continuous phrases;

    • Entity-Level Masking: Mask pre-identified entities;

insert image description here

  • Introduce forum dialogue data in the pre-training stage:

    • Use the Dialogue Language Model (DLM, Dialogue Language Model) to model the Query-Response dialogue structure, take the dialogue pair as input, introduce the Dialogue Embedding to identify the role of the dialogue, and use the Dialogue Response Loss (DRS, Dialogue Response Loss) to learn the implicit dialogue relationship to further improve the semantic representation capability of the model.

ERNIE 2.0

Paper: " ERNIE 2.0: A Continual Pre-training Framework for Language Understanding " (2019.7 Baidu)

The architecture diagram of ERNIE 2.0 is as follows:

insert image description here

On the basis of ERNIE 1.0, multi-task learning is introduced in the pre-training stage, including three types of learning tasks, namely:

  • Word-aware Tasks: Learn to predict the lexical in a sentence.
  • Structure-aware Tasks: Learn syntactic-level information, and learn to reconstruct and reorder multiple sentence structures.
  • Semantic-aware Tasks: learn information at the semantic level, and learn to judge the logical relationship between sentences, such as causal relationship, transition relationship, juxtaposition relationship, etc.

5.3.4 BERT wwm

Paper: " Pre-Training with WholeWord Masking for Chinese BERT " (2019.7 Harbin Institute of Technology + Xunfei), the introduction and use can be found in: Chinese-BERT-wwm .

Whole Word Masking (wwm), Whole Word Mask/Whole Word Mask. Compared with Baidu ERNIE, BERT-WWM is not only a continuous mask of entity words and phrases, but a continuous mask of all words that can form Chinese words. The specific method is that, for Chinese, if a part of a complete word is masked, other parts of the same word will also be masked, that is, all the Chinese characters that make up the same word are masked, which is the whole word Mask.

It should be noted that the Mask here refers to the generalized Mask (replaced with [MASK]; keep the original vocabulary; randomly replaced with another word), not limited to [MASK]the case where words are replaced with labels.

The purpose of this is: During the pre-training process, the model can learn the semantic information of the word, and the embedding of the word after the training has the semantic information of the word, which is friendly to various Chinese NLP tasks.

The subsequent related improved models are: BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large, RBT3, RBTL3.

5.3.5 XLNet

论文:《XLNet: Generalized Autoregressive Pretraining for Language Understanding》(2019.6 CMU+google brain)

XLNet is improved based on two types of pre-training models such as BERT and GPT, and absorbs the strengths of the two types of models respectively. The main idea is to use a new method to achieve bidirectional encoding based on AR (autoregression), and to extend the left-to-right modeling of GPT into out-of-order modeling to make up for the defect that GPT cannot obtain bidirectional context information. The improvement points are as follows (specifically: Interpretation of Google XLNet Principles ):

  • Permutation Language Modeling, permutation language model (Permutation LM)
    • The essence of PLM is the embodiment of multiple decomposition mechanisms of LM joint probability. The sequential disassembly of LM is extended to random disassembly, but the original position information of each word needs to be preserved, the decomposition method is traversed, and the model parameters are shared. The predicted word context can be learned.
  • Two-Stream Self-Attention
    • Solve the problem of no target location information caused by PLM.
  • Transformer-XL
    • Using Transformer-XL to learn longer distances

5.3.6 RoBERTa

论文:《RoBERTa: A Robustly Optimized BERT Pretraining Approach》(2019.7.26 Facebook)

The RoBERTa model is a more robust and optimized BERT model. The specific adjustment points are as follows:

  • More training data / larger batch size / longer training time;

    • In terms of training data, RoBERTa uses 160G of training text, while BERT only uses 16G of training text, up to 500K 500K500K steps . _ _

      There are also two models, roberta-base and roberta-large.

      preview
  • Dynamic Masking (Dynamic Mask mechanism):

    • Using the dynamic mask strategy, RoBERTa copies the pre-trained data 10 copies at the beginning, and each copy uses a different static mask operation (that is, each copy randomly selects 15% of the Tokens for Masking, the same sentence has 10 different mask methods), so that each pre-training data has 10 different mask data.
    • Then, each data is trained for N/10 epochs. This is equivalent to the words that are masked out in each sequence during the training of these N epochs. (Note that these data are not all fed to the same epoch, but different epochs, for example dupe_factor=10, epoch=40, then each mask method will be used 4 times during training.)
  • No NSP and Input Format:

    • Removed NSP (Next Sentence Prediction) task;
    • In RoBERTa, it is in the form of Full-sentence. To cancel the next sentence prediction, each training sample is a continuous sample from a document. It is found from the conclusion that the effect will be better, indicating that the NSP task is not necessary.

5.3.7 T5

论文:《Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer》(2019.10,Google)

T5 (Text-to-Text Transfer Transformer), a unified text-to-text task model. T5 is based on Bert and uses traditional Transformer. In order to be able to handle all NLP tasks, all NLP tasks are converted into a unified text-to-text format, and the input and output of tasks are always text strings. This framework allows to use the same model, loss function and hyperparameters on any NLP task. The specific content can be seen: Google pre-trained language model T5 .

5.3.8 ALBert

论文:《[ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942#:~:text=ALBERT%3A A Lite BERT for Self-supervised Learning of,to GPU%2FTPU memory limitations and longer training times.)》(2019.10 Google)

In order to solve the problem of the excessive amount of parameters of the current pre-training model, researchers at Google designed a lightweight BERT (A Lite BERT, ALBERT) with far fewer parameters than the traditional BERT architecture. ALBERT proposes two methods that can greatly reduce the amount of parameters of the pre-trained model. In addition, it also proposes to replace the Next-sentence prediction (NSP) task in BERT with the Sentence-Order Prediction (SOP) task, which is achieved in multiple natural language understanding tasks. State-Of-The-Art (SOTA) results.

The backbone network of the ALBERT architecture is similar to BERT, that is, using the Transformer encoder and the GELU nonlinear activation function, the three major adjustments are as follows:

  1. Word embedding parameter factorization:

    • The authors believe that word vectors only memorize a relatively small amount of word information, and more semantic and syntactic information is memorized by the hidden layer. Therefore, they believe that the dimension of word embedding does not have to be consistent with the dimension of the hidden layer. To reduce the number of parameters by reducing the dimension of the word embedding, the word embedding parameters are factored, and they are decomposed into two small matrices.
    • Instead of directly mapping one-hot vectors to a hidden space of size H, researchers first map them to a low-dimensional word embedding space E, and then map them to the hidden space. With this decomposition, researchers can change the word embedding parameters from O ( V × H ) to O(V × H)O ( V×H ) DescendingO ( V × E + E × H ) O(V × E + E × H)O ( V×AND+AND×H ) , when H is much larger than E, the amount of parameters is reduced significantly.
  2. Parameter sharing across layers:

    • The parameters of the fully connected layer and the attention layer are shared to avoid the increase of the parameter amount with the increase of the network depth, that is, ALBERT still has multiple layers of deep connections, but the parameters between the layers are the same.
    • Mainly to reduce the amount of parameters ( a slight decrease in performance , a large reduction in parameters, which is a good thing overall).
  3. Inter-sentence order prediction (SOP)

    • It is proposed to replace the NSP task in BERT with the Sentence-Order Prediction (SOP) task, which is to give the model two sentences and let the model predict the order of the two sentences.

    • The positive example selection method of SOP is consistent with BERT (two consecutive segments from the same document), while the negative example is different from the sample in BERT. It is also two consecutive segments from the same document, but the order of the two segments is exchanged to avoid For topic prediction, we only focus on modeling the coherence between sentences.

      In previous studies, it has been shown that NSP is not a suitable pre-training task. It is speculated that the reason is that the model not only considers the coherence between the two sentences, but also the topic of the two sentences when judging the relationship between the two sentences. The topics of the two documents are usually different, and the model will analyze the relationship between the two sentences more through the topic, rather than the coherence between sentences, which makes the NSP task a relatively simple task.

Two parameter reduction techniques can act as some form of regularization, making training more stable and facilitating generalization.

The smallest ALBERT-base parameter is only 12M, the effect is 1-2 points lower than BERT, and the largest ALBERT-xxlarge is 233M. It can be seen that the reduction in the amount of model parameters is still very obvious, but it does not seem to be so obvious in terms of speed. The biggest problem is that this method does not actually reduce the amount of calculation, that is, the inference time is not reduced.

img

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124417008