"Natural Language Processing" chapter7-pre-training language model

This is the study notes for reading "Natural Language Processing - Method Based on Pre-training Model", which records the learning process. Please buy books for detailed content.
At the same time, refer to two videos of Mushen:
GPT, GPT-2, GPT-3 Intensive reading of papers [Intensive reading of papers] Intensive reading of
BERT papers paragraph by paragraph [Intensive reading of papers]

overview

The core of natural language processing is how to better model language. In a broad sense, pre-trained language models can generally refer to language models that have been trained on large-scale data in advance, including static word vector models represented by early Word2vec and GloVe, and dynamic word vector models such as CoVe and ELMo based on context modeling. In 2018 , after the emergence of deep Transformer-based representation models represented by GPT and BERT , the term pre-trained language model was really widely known.

insert image description here

Big Data

Obtaining enough large-scale text data is the beginning of training a good pre-trained language model. Therefore, pre-training data needs to pay attention to " quality preservation " and " quantity preservation ".

  • "Quality preservation" means that the quality of the pre-training corpus should be as high as possible to avoid mixing too much low-quality corpus.
  • "Maintenance" means that the size of the pre-training corpus should be as large as possible, so as to obtain richer context information.

In practice, training data often comes from different sources. It is very difficult to finely preprocess data from all different sources. Therefore, in the process of preparing the pre-training data, it is usually not processed very finely, and only the common problems of the pre-training corpus are dealt with. At the same time, the proportion of low-quality corpus is further diluted by increasing the size of the corpus, thereby reducing the negative impact of poor-quality corpus on the pre-training process. This requires a trade-off between data processing effort and data quality.

large model

After having big data, there needs to be a model that can accommodate the data. Data size and model size are positively correlated to a certain extent. A model with a large enough capacity is needed to learn and store various features in big data. In machine learning, "capacity" usually refers to the "large number of parameters" of the model. How to design a model with a large number of parameters mainly considers two aspects:

  • The model needs to have a high degree of parallelism to compensate for the drop in training speed caused by large models;
  • The model is able to capture contextual information to fully mine the rich semantic information in big data texts.

Combine the above two points. Transformer-based neural network models have become the best choice for building pre-trained language models. First of all, the Transformer model has a high degree of parallelism. The multi-head self-attention mechanism ( ) at the core of Transformer Multi-head Self-attentiondoes not rely on sequential modeling and thus can be processed quickly in parallel. In contrast, traditional neural network language models are usually based on recurrent neural networks ( RNN), while RNN needs to be processed sequentially, with a low degree of parallelization. Secondly, the multi-head self-attention mechanism in Transformer can effectively capture the degree of association between different words, and can describe this degree of association from different dimensions through the multi-head mechanism, so that the model can obtain more accurate calculation results. Therefore, the mainstream pre-trained language models all use Transformer as the main structure of the model without exception.

Big computing power

Train the pre-trained language model, mainly using the graphics processing unit ( Graphics Processing Unit, GPU), and the tensor processing unit ( Tensor Processing Unit, TPU).

GPT

#GPT是transformer’s decoder#
#Bert appeared after four months, and the bert-base model was about the same size as the gpt model#
#GPT3“击力出奇迹”#

insert image description here

In 2018, OpenAI proposed a generative pre-training ( Generative Pre-Training, GPT) model to improve the effect of natural language understanding tasks, officially bringing natural language processing into the "pre-training" era. The era of "pre-training" means using larger-scale text data and deeper neural network models to learn richer text semantic representations. At the same time, the emergence of GPT has broken the barriers between various tasks in natural language processing, so that building a task-oriented natural language processing model no longer needs to know a lot of task background, but only needs to apply these predictions according to the input and output forms of the task. Training the language model can achieve a good effect. Therefore, GPT proposes a new natural language processing paradigm of " generative pre-training + discriminative task fine-tuning ", which makes the construction of natural language processing models less complicated.

#GPT: Use general pre-training to improve language understanding#
#GPT chose a more difficult optimization problem, and the ceiling is higher#

insert image description here

Unsupervised pre-training

The overall structure of GPT is a Transformer-based one-way language model (Transformer decoder), which models the input text from left to right.

GPT uses conventional language modeling methods to optimize a given text sequence x = x 1 . . . xnx=x_1...x_nx=x1...xnThe maximum likelihood estimate of LPTL^{PT}LPT
L P T = ∑ i l o g P ( x i ∣ x i − k . . . x i − 1 ; θ ) L^{PT}=\sum_{i}^{} logP(x_i|x_{i-k}...x_{i-1};\theta ) LPT=ilogP(xixik...xi1;θ) k k k represents the window size of the language model, that is, the word xi x_iis predicted at the current moment based on k historical wordsxi θ \theta θ represents the parameters of the neural network, and this likelihood function can be optimized using stochastic gradient descent.

Specifically, GPT uses a multi-layer transformer as the basic structure of the model. For a window word sequence of length k x ′ = x − k . . . x − 1 x'=x_{-k}...x_{-1 }x=xk...x1, the modeling probability P is calculated by the following way. h [ 0 ] = ex ′ W e + W ph^{[0]} = e_{x'}W^e+W^ph[0]=exWe+Wp h [ l ] = T r a n s f o r m e r − B l o c k ( h l − 1 ) ∀ l ∈ 1 , 2... , L h^{[l]}=Transformer-Block(h^{l-1})\forall l \in {1,2...,L} h[l]=TransformerBlock(hl1)l1,2...,L P ( x ) = S o f t m a x ( h L W e ⊤ ) P(x)=Softmax(h^{L}{W^{e}}^\top ) P(x)=Softmax(hLWe )ex ′ e_{x'}exis x ' x'x one-hot vector representation;W e {W^e}We represents the word vector matrix;W p W^pWp represents the position vector matrix (here only intercepts the windowx ′ x’x corresponding position vector);LLLT ransfomer TransformerThe total number of layers of Trans from m er .

insert image description here

Supervised downstream task fine-tuning

In the pre-training phase, GPT GPTGPT uses large-scale data to train a deepTransformer TransformerThe language model of Transformer has mastered the general semantic representation of the text . The purpose of fine-tuning(to perform domain matching according to the characteristics of thedownstream task(Fine-tuningon the basis of the general semantic representationDownstream task

Downstream task fine-tuning is usually trained and optimized with labeled data. Assume that the labeled data of the downstream task is CCC , where the input for each example isx = x 1 . . . xnx=x_1...x_nx=x1...xnThe length of the composition is nnThe text sequence of n , the corresponding label isyyy . First output the text sequence to the pre-trained GPT, and obtain theoutputhn [ L ] h_n^{[L]}corresponding to the last wordhn[L], and then, the output of the hidden layer is transformed through a fully connected layer to predict the final label.

In addition, in order to further improve the versatility and convergence speed of the fine-tuned model, a certain weight of pre-training loss can be added to the fine-tuning of downstream tasks. This is done to alleviate the catastrophic forgetting ( ) problem during downstream task fine-tuning Catastrophic Forgetting. In the process of downstream task fine-tuning, the training goal of GPT is to optimize the effect on the downstream task data set and emphasize specificity. Therefore, it is bound to partially cover or erase the general knowledge learned in the pre-training stage, and lose certain generality. The catastrophic forgetting problem can be effectively mitigated by combining downstream task fine-tuning loss and pre-training task loss.

insert image description here
Adapt to different downstream tasks

The input forms of different tasks are different, and the input forms of GPT should be adapted according to different tasks.
The following are the input and output forms of 4 typical tasks in GPT, including: single sentence text classification , text entailment , similarity calculation and selective reading comprehension .

insert image description here

(1) Single-sentence text classification
Suppose the input is x = x 1 . . . xnx=x_1...x_nx=x1...xn, the sample of single-sentence text classification will be input into GPT in the following form: < s > x 1 x 2 . . . xn < e > <s> x_1 x_2 ... x_n <e><s>x1x2...xn<e> (2) Text entailment
The input of text entailment consists of two paragraphs of text, and the output consists of classification labels, which are used to judge the implication relationship between the two paragraphs of text. It should be noted that the premise(Premise) andassumption(in the textual entailmentHypothesisare ordered, and the order of the two must be fixed. < s > x 1 ( 1 ) , x 2 ( 1 ) . . . xn ( 1 ) $ x 1 ( 2 ) , x 2 ( 2 ) . . . xm ( 2 ) < e > <s> x_1^{( 1)},x_2^{(1)}...x_n^{(1)}\$x_1^{(2)},x_2^{(2)}...x_m^{(2)}<e ><s>x1(1),x2(1)...xn(1)$x1(2),x2(2)...xm(2)<e> (3) Similarity calculation
The similarity calculation is also composed of two texts, but unlike the text entailment task, there is no sequential relationship between the two texts involved in the similarity calculation. After two sequences are input into GPT, two corresponding hidden layer representations are obtained, andfinally the two hidden layers are added together, and the similarity is predicted through a fully connected layer. < s > x 1 ( 1 ) , x 2 ( 1 ) . . . xn ( 1 ) $ x 1 ( 2 ) , x 2 ( 2 ) . . . xm ( 2 ) < e > <s> x_1^{( 1)},x_2^{(1)}...x_n^{(1)}\$x_1^{(2)},x_2^{(2)}...x_m^{(2)}<e ><s>x1(1),x2(1)...xn(1)$x1(2),x2(2)...xm(2)<e> < s > x 1 ( 2 ) , x 2 ( 2 ) . . . x m ( 2 ) $ x 1 ( 1 ) , x 2 ( 1 ) . . . x n ( 1 ) < e > <s> x_1^{(2)},x_2^{(2)}... x_m^{(2)}\$x_1^{(1)},x_2^{(1)}... x_n^{(1)}<e> <s>x1(2),x2(2)...xm(2)$x1(1),x2(1)...xn(1)<e> (4) Selective reading comprehension
Selective reading comprehension is to let the machine read an article, and it needs to select the correct option corresponding to the question from multiple options, that is, it needs to use (chapter, question, option) as input to correctly The option number acts as a label. Suppose the chapter isp = p 1 p 2 . . . pnp=p_1 p_2...p_np=p1p2...pn, the problem is q 1 q 2 . . . qm q_1q_2...q_mq1q2...qm,第i个选项为ci = c 1 ic 2 i . . . ckic^{i}=c_1^{i} c_2^{i}...c_k^{i}ci=c1ic2i...cki, and assume that the number of options is N: < s > p 1 p 2 . . . pn $ q 1 q 2 . . . qm $ c 1 1 c 2 1 . . . ck 1 < e > <s>p_1p_2.. .p_n\$q_1q_2...q_m\$c_1^{1} c_2^{1}...c_k^{1}<e><s>p1p2...pn$q1q2...qm$c11c21...ck1<e> < s > p 1 p 2 . . . p n $ q 1 q 2 . . . q m $ c 1 2 c 2 2 . . . c k 2 < e > <s>p_1p_2...p_n\$q_1q_2...q_m\$c_1^{2} c_2^{2}...c_k^{2}<e> <s>p1p2...pn$q1q2...qm$c12c22...ck2<e> . . . ... ... < s > p 1 p 2 . . . p n $ q 1 q 2 . . . q m $ c 1 3 c 2 3 . . . c k 3 < e <s>p_1p_2...p_n\$q_1q_2...q_m\$c_1^{3} c_2^{3}...c_k^{3}<e <s>p1p2...pn$q1q2...qm$c13c23...ck3<e takes (chapter, question, option) as input, passGPT GPTGPT modeling obtains the corresponding hidden layer representation, and obtains the score of each option through the fully connected layer. Finally, theNNScore concatenation of N options, viaSoftmax SoftmaxThe soft max function gets normalized probabilities (multiple choice) and is learned via the cross-entropy loss function .

insert image description here

GPT2: Language models are unsupervised multi-task learners.
Redefining different forms of natural language processing tasks as text generation enables model generalization.
Highlights: zero-shot, no training in downstream tasks.

insert image description here

Prompt prompts what tasks to do, and uses natural language descriptions or instructions as prefixes to represent target tasks.

insert image description here
GPT3: Language Models as Few-Shot Learners
GPT3 is a technical report.

insert image description here

GPT3 does not do any gradient updates and fine-tuning, and it can generalize well with only a small number of labeled samples of the target task.
GPT3 mainly demonstrates the small-sample learning ( Few-shot learning) ability of the super-large-scale language model.

Why does the autoregressive model have the ability to learn from small samples? The key lies in the orderliness of the data itself, so that sequential data that appears continuously often contains the input and output patterns of the same task.

insert image description here

"Context learning":
The learning process of language models can actually be seen as a process of meta-learning from many different tasks.

The training of the language model on a sequence is an inner loop ( Inner loop), also known as In-Context Learning. The training of the model on different sequences corresponds to the outer loop ( Outer loop) of meta-learning, which plays the role of generalization between different tasks to avoid the model from overfitting to a specific task. The scale and quality of data play a key role in GPT-3's small sample learning ability.

Due to the need to use a small number of labeled samples as conditions, the input sequence of the GPT-3 model may be long. GPT-3 uses an input size of 2048. Compared with other models, it has higher requirements for memory and calculation.

insert image description here

The loss decreases linearly with an exponential increase in computation.
In summary, language models can perform miracles violently.

BERT

BERT ( Bidirectional Encoder Representation from Transformers) is a pre-trained language model based on deep Transformer proposed by Google in 2018 . BERT not only makes full use of large-scale unlabeled texts to mine rich semantic information, but also further deepens the depth of natural language processing models.

insert image description here
The basic model structure of BERT consists of multi-layer Transformer, including two pre-training tasks: mask language model ( Masked Language Model, MLM) and next sentence prediction ( Next Sentencce Prediction, NSP).

The input of the model consists of two pieces of text x 1 x^{1}x1 andx 2 x^{2}x2 splicing composition, and then model the semantic representation of the context through BERT, and finally learn the mask language model and next sentence prediction. It should be noted that the masked language model has no special requirements on the input form, which can be one piece of text or two pieces of text. The next sentence prediction requires the input of the model to be two paragraphs of text. Therefore, the input form of BERT in the pre-training stage is unified into the form of two text splicing.

insert image description here

In the BERT training, the number of transformer layers, hidden layer dimensions and the number of multi-heads are mainly changed. The model is mainly divided into BERT-large and BERT-base.

insert image description here

The BERT model can learn the calculation of parameters. The dictionary size is 30k. After the projection matrix of q, k, and v is merged between each head, it is a parameter of H H. After getting the output, it will be projected to H H, and 4 ∗ H 2 is obtained. 4*H^24H2 , the size of the next two MLPs isH 2 H^2H2 *8, then multiplied by the number of layersLLL , you can roughly calculate the parameters of BERT-base and BERT-large.

insert image description here

使用 W o r d P i e c e WordPiece The reason for W or d P i ece is that a relatively small dictionary can be used, because the dictionary is too large so that most of the learned parameters are concentrated in the word embedding layer.

BERT's input representation ( Input Representation) consists of the sum of word vectors ( Token Embeddings), block vectors ( Segment Embeddings) and position vectors ( Position Embeddings):

insert image description here

For the convenience of calculation, the dimensions of these three vectors are the same as eee , the input representation vvcorresponding to the input sequencev为: v = vt + vs + vpv = v^{t} + v^{s} + v^{p}v=vt+vs+vp (1) Word vector
Similar to the traditional neural network model, the word vector in BERT also converts the input text into a real-valued vector representation through the word vector matrix, assuming the input sequencexxThe one-hot vector corresponding to x is expressed as ete^tet , its corresponding word vector representsvtv^{t}vt为: v t = e t W t v^{t}=e^tW^t vt=etWt (2) Block vector
The block vector is used to encode which block (Segment) the current word belongs to. The block code(corresponding to each word in the input sequenceSegment Encodingis the serial number of the block where the current word is located (counting from 0).

  • When the input sequence is a single block (such as single-sentence text classification), the block number of all words is 0;
  • When the input sequence is two chunks (such as sentence-to-text classification), the chunk corresponding to each word in the first sentence is coded as 0 and the other as 1.

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/132120817