[Overview of 100 large models] XLNet (Google)

XLNet

Welcome to subscribe and read [Large Model & NLP & Algorithm].

[Overview of 100 large models] XLNet (Google)

Author: Wang Jianing, the content of this article is original, warehouse link: https://github.com/wjn1996/LLMs-NLP-Algo

Subscribe to the column [Large Model & NLP & Algorithm] to get all the NLP, large model and algorithm dry goods data spree accumulated by the blogger for many years, nearly 200 papers, 300 markdown notes written by the blogger himself, and nearly 100 large model data cards . Help NLP research, study and job hunting.


XLNet large model basic information data card

serial number big model name attribution launch time scale pre-training corpus Benchmark Model and Training Method open source paper model address Relevant information
1 XLNet Google 2019-06 <1B Pre-training corpus: Wikipedia and BookCorpus total 13GB.
Giga5 (16GB), ClueWeb 2012-B, Common Crawl
GLUE, SuperGLUE, SQuAD, RACE and other NLU datasets (1) Sorting language model goal: Assuming that the arrangement and combination is x4 -> x3 -> x1 -> x2, then when predicting x3, you can see the information of x4, and mask out the two words x1 and x2; when When the arrangement and combination method is x2 -> x4 -> x3 -> x1, when predicting x3, you can see the information of x2 and x4, and the mask will drop the word x1. No matter which permutation and combination method is selected, the order of the words in the actual sentence input does not change. It is only through the mask method that all content after the position to be predicted in the permutation and combination sequence is masked out, which is equivalent to the autoregressive language model. But utilize the information in the context.
(2) Dual-stream self-attention: content stream refers to the visible information of all words, including the content and location information of the word itself; query stream (query stream) can actually be understood as a location stream, which only contains words location information.
image.png
Github XLNet xlnet-large-cased Check

The XLNet model was jointly released by the CMU and Google Brain teams at the NIPS summit in October 2019. It is another model that topped the list less than a year after the Bert model. XLNet surpassed Bert on a total of 20 NLP tasks, and 18 of them achieved SOTA results at the time.
Before the emergence of XLNet, the two camps of pre-trained language models in the NLP field were AR and AE models. Let's first look at the comparison of the two models in the picture below.

AR model VS AE model
AR (Auto-regressive LM) is an autoregressive language model. The task is to predict the next word based on the known word information in the sentence. The typical representative is RNN/LSTM. For example, to predict the word "a" in the sentence "New York is a city", you need to use "New York is" to predict, that is, the predicted word has a dependency relationship with the previous word. The objective function of the AR model is to predict the maximum likelihood of the following words based on the words in front of the sentence, which has a strict mathematical formula expression.
AE (Auto-encoding LM) is the self-encoding language model. The typical representative is the Bert model. Its task is to randomly mask out some words in the sentence, and re-predict these words marked as [MASK] through the context. Therefore, Bert's model that adds so-called noise to the input and restores words is also called a denoising autoencoder model. The objective function of the AE model is to predict the maximum likelihood probability of [MASK] word according to the context in the sentence. From the mathematical formula, we can find that the "≈" sign is used. This is because the Bert model assumes that there is no relationship between all [MASK] words, that is, the information from other [MASK] is lost when predicting MASK , so The mathematical expression is less strict than the AR model.

Comparison of the advantages and disadvantages of the two language models

The training process of the AR model is one-way. It can only predict from front to back or from back to front, and cannot make full use of context information, so the performance of the model cannot be further improved. The advantage is that the generation task is from left to right, which matches the structure of the AR model very well, so the performance of the AR model on the generation task is even better. The advantage of the AE model is two-way training, and the content of the context is fully considered when predicting. The disadvantage is that on the one hand, an independent assumption is made, based on [MASK] words are independent of each other, so the objective function is an approximate calculation. On the other hand, the [MASK] label is introduced in the pre-training stage, but this label does not exist in the fine-tuning stage, resulting in differences between the two stages.

model improvement

Improvement direction

Combining the advantages and disadvantages of the AR model and the AE model, we hope that the improved model not only has the ability to generate, but also can be trained in two directions, and can avoid the discrepancy between pre-training and fine-tuning. From this, two improvement ideas are derived. The first is based on the Bert model for improvement, and the second is based on the LM model. The proposal of XLNet is more to improve the generation ability of the model, so the final improvement direction is based on the LM model and enables the model to have the ability of two-way training.

Improvement 1 - Permutation Permutation

Drawing on the idea of ​​NADE (Neural autoregressive distribution estimation), the permutation and combination method is introduced into the XLNet model, so that the AR model can also learn contextual information. The idea of ​​permutation and combination is that, for a sequence of length n, the order of each word can be scrambled, and a total of n! kinds of permutation and combination factorization order (Factorization order) can be obtained. To give a simple example, if the original order of a sentence is x=x1x2x3x4, after random shuffling, one of the orderings is selected as x1 -> x4 -> x2 -> x3, then the probability of predicting the third position of the word x3 can be used The formula is expressed as p(x) = p(x3|x1,x4,x2) p(x2|x1,x4) p(x4|x1), where p(x3|x1,x4,x2) means the known first The probability that the word is x1, the second word is x2, and the fourth word is x4, the third word is x3. In this way, although the position of x4 is behind x3, the later information can be used when predicting x3, that is, the context information can be used.
As shown in the figure below, the first arrangement and combination is x4 -> x3 -> x1 -> x2, then when predicting x3, you can see the information of x4 and mask out the two words x1 and x2; the second The permutation and combination method is x2 -> x4 -> x3 -> x1. When predicting x3, you can see the information of x2 and x4, and the mask will drop the word x1. No matter which permutation and combination method is selected, the order of the words in the sentence input does not change in essence. It is only through the mask method that all content after the position that needs to be predicted in the permutation and combination sequence is masked out, which is equivalent to using an autoregressive language. The training method of the model, but at the same time utilizes the information in the context.

Factorization order diagram
The objective function of the permutation and combination language model can be expressed as the following formula: where ZT represents the set of all permutations and combinations, and z represents the selection of one of the permutations and combinations, that is, under the condition that the contents of the previous t-1 positions of the permutation and combination are known, Predict the maximum likelihood probability that the word at position t is XZT.

PLM objective function
In actual training, considering that the sentences are very long, the number n! of permutations and combinations will be very large. If each combination is trained once, the amount of calculation required is huge. Therefore, for a sentence, only one way of permutation and combination is sampled for training. In another sentence, not all positions need to be predicted. XLNet uses a partial prediction method, that is, if the position is predicted by the previous words, the training efficiency is not high due to the limited context information available. Therefore, only the words whose positions are in the last 1/K are selected for prediction, so that more context words are used and the training converges faster. The value of K is generally 6~7 (14%~16%), which is similar to the Bert model mask that drops 15% of the words.

Improvement 2 - Two-stream attention mechanism Two-stream attention

After adopting the method of permutation and combination, it is equivalent to using context information to predict the content of these mask words by masking out some words in the sentence, which is equivalent to combining the advantages of language model and Bert. However, if only the Permutation method is used, there will be a problem. In a permutation and combination sequence, the previous t-1 position words are known to predict the next position word, then the likelihood probability is the same for the same true value. , has nothing to do with the position of the ground truth in the original sentence order.
To give a simple example, in the sentence "I love China very much", the original sequence is expressed as [1, 2, 3, 4, 5]. If the shuffled order is z = [2, 4, 1, xxx, xxx], that is, the three words "love", "very", and "I" are known, then when the next word is predicted to be "China", regardless of Whether it is at position 3 or position 5 in the original order, its predicted probability is the same, that is, it is equivalent to "I love much very China". Obviously, the probability of placing "China" after "love" is higher.
This is further improved. For the predicted target, in addition to knowing the words in the previous position, it is also necessary to add the position information of the target. For example, in the arrangement and combination sequence x3 -> x2 -> x4 -> x1 in the figure below, when predicting x4, you need to know the position and content of x3 and x2 (indicated by a solid line in the figure below), and you also need to know the position of x4 (in the figure below indicated by a dotted line), but cannot expose the contents of x4 in advance. For this reason, XLNet introduces a method of dual-stream self-attention mechanism, where dual-stream refers to content stream and query stream respectively. Content flow means that all information of a word is visible, including the content and location information of the word itself; the query flow can actually be understood as a location flow, which only contains the location information of the word.

In terms of specific implementation, when the content stream performs self-attention, Q gets all the information of the current location (expressed by h(m-1)), and K and V get all the information of all the locations (as shown in Figure a). When the query stream performs self-attention, Q gets the position information of the current position (expressed by g(m-1)), and K and V get all the information of other positions (as shown in Figure b).

Content-stream and Query-stream
XLNet uses two kinds of streams together, and calculates the content stream h(m-1) and query stream g(m-1) for each position of each layer. When a word in a certain position needs to be predicted, only the hidden coding of g(m-1) is used in this position; otherwise, if the word in a certain position is known information, then h(m-1) is used in this position ) hidden encoding. The complete training process is shown in the figure below.

Two-stream co-work and attention masks
Based on the above, the optimized XLNet objective function adds information about the predicted position. In the following formula, zt represents the position of the predicted target. This method is also called target-aware prediction.

PLM optimized objective function

Improvement 3 - Drawing on the idea of ​​Transformer-XL

We know that the advantage of Transformer over the RNN model lies in the problem of long-distance dependence, but the disadvantage is that it requires a lot of memory. If the length of a piece of input text is N, then the self-attention inside the Transformer is calculated between two or two words, which requires N square calculations. Therefore, the problem of Transformer's space complexity is more prominent, and it is necessary to consider distributing the memory to different machines. Before Transformer-XL came out, the solution was to split the long text into several short texts of equal length and train them separately. Although it solves the problem of the amount of calculation, it also breaks the connection between short texts.
The solution proposed by Transformer-XL is to split the long text into several segments, and each segment is connected in series through the RNN method to learn the relationship between them. The figure below shows the improvements made on the basis of the Transformer model structure. First, there are two segment t and segment t+1 that have been split. The input stage is word embedding + positional embedding and output to the first layer of encoder block. For segment t, the hidden state output of each layer of encoder block, on the one hand, is used as the input of the next layer of encoder block, and on the other hand, it is output as the "Memory" information of segment t to the same layer of encoder block of segment t+1. Each segment is concatenated with the hidden state output of the encoder block of the same layer, and each layer repeats the same operation, so as to realize the association of information between two segments, and at the same time, the memory required for calculation is also greatly reduced.

Segment recurrence mechanism

Improvement 4 - Relative positional encoding

After splitting the long text into several segments, another problem arises. Using Transformer's absolute position encoding method, the same position encoding of each segment is the same, resulting in the indistinguishable order of the same position among segments. Therefore, Transformer-XL also needs to introduce relative position encoding.
When absolute position encoding is used, the attention score of Transformer can be decomposed into the form of the left figure below. Among them, i and j represent the positions of query and key respectively, E represents word embedding, and U represents positional encoding, so in fact the following formula is decomposed into 4 polynomials by (Exi+Ui)WqWk(Exj+Uj) according to the multiplicative distributive law of. It can be found that (a) item is about the correlation between two word vectors, (b) © item is about the correlation between the word vector of one position and the position code of another position, and (d) item is about the correlation between two word vectors. Correlation between position codes.

Attention score of absolute and relative positional encoding
In order to replace absolute positional encoding with relative positional encoding, 3 improvements were made to the original decomposition formula (expressed in the formula on the right):

  1. Change all Uj items to Ri-j uniformly, that is, replace the absolute position representation of Key with the relative position representation between Key and query;
  2. Since the absolute position representation of the key has been replaced by a relative position representation, the absolute position representation Ui of the query is meaningless, so the two Ui only need to be represented by a learnable parameter u and v respectively.
  3. Both word embedding and positional encoding in Transformer use the same linear transformation. Therefore, for further refinement, two different linear transformations can be used for the two encodings, and the Wk item can be split into Wk,E and Wk,R.

After adopting the new parameter definition, each item has a more intuitive meaning: (a) item is completely content-dependent representation, (b) item is content-dependent position bias, © item global content representation, (d) Items are global content biases.

Improvement 5 - Relative segment encoding

For the sentence pair task, the input structure of BERT is [CLS, sentence A, SEP, sentence B, SEP], because two sentences need to be distinguished, Bert introduces the absolute segment encoding method, the part in front of the sentence [CLS, sentence A, SEP] is segment A, coded by value 0; the part after the sentence is segment B, coded by value 1.
Drawing on the idea of ​​relative positional encoding, XLNet adopts the relative segment encoding method, that is, the segments are not encoded independently, but are encoded according to the relative relationship between the segments. When calculating attention, if two positions i, j belong to the same segment, then the code is uniformly taken as sij=s+; if two positions i, j belong to different segments, then the code is uniformly taken as sij=s-. When calculating the attention between i and j, an aij component is added to the conventional attention score: aij=(qi+b)Tsij, where qi is the query vector of position i, and b is the offset.
After the introduction of the relative segment encoding method, the relationship representation between segments is more generalized, and the downstream tasks are not limited to sentence pair tasks, but are also applicable to multiple segmnt tasks.

test results

Under the condition of the same model parameter size and data scale, the performance of XLNet almost completely surpassed Bert. Taking the two models of XLNet-Large and BERT-Large as an example to compare, for the reading comprehension tasks SQuAD and RACE, the performance improvement of XLNet is the most obvious, with an average score improvement of 2 points. For the performance test on the GLUE series of tasks, XLNet also has an average score improvement of 1 point compared to BERT.

Fair comparison between BERT and XLNet
Since XLNet uses multiple model improvement methods, in order to better understand how much each improvement contributes to performance improvement, the paper did the following ablation experiments. First of all, we found that after removing the cache mechanism (memory item), the scores of the model in all tasks decreased the most (average decrease of 0.6). The second is to remove the two-way training function, and the average score drop is 0.4. After removing the feature of span-based prediction, the score drops by 0.3 on average. Finally, the NSP task was added to the XLNet model, and it was found that except for a slight improvement in the RACE task, the results on the rest of the tasks were all worse.

XLNet ablation test result

Summarize

From a macro point of view, XLNet has achieved a new breakthrough standing on the shoulders of giant BERT, organically combining the AR model and two-way training. From a microscopic point of view, several improvement methods introduced by XLNet have their own strengths: Permutation LM enables the language model to make full use of contextual information during training; Two-stream encoding distinguishes the attention of the prediction target and the non-prediction target very well. Calculation makes the results more training and more reasonable; Transformer-XL greatly reduces the computing memory and solves the problem of long-distance dependence of the model, and the relative-positional encoding effectively avoids the positional conflict between the two segments. In addition, XLNet's AR-based results inherently match the generation of NLP tasks, so XLNet will inevitably promote the emergence of further improved models in the fields of text summarization, machine translation, and question answering systems.



Original text: https://zhuanlan.zhihu.com/p/477860047

  The blog records the pace of learning and shares the latest technology. Thank you very much for reading. This blog will be updated continuously, hoping to help you technically.


【Large Model & NLP & Algorithm】Column

Nearly 200 papers and 300 markdown notes written by bloggers . Subscribe to this column [Large Model & NLP & Algorithm] column , or go to https://github.com/wjn1996/LLMs-NLP-Algo to get all the following information:

  • Machine learning & deep learning basics and advanced dry goods (notes, PPT, code)
  • NLP basics and advanced dry goods (notes, PPT, code)
  • A full set of large model systems - pre-trained language model foundation, knowledge pre-training, large model overview, large model training and optimization, large model tuning, ChatGPT-like reproduction and application, etc.;
  • Dachang algorithm brush questions;

insert image description here

Guess you like

Origin blog.csdn.net/qq_36426650/article/details/131743860