Victory BERT, Google best NLP pre-training model of open source

2020-03-16 19:35

Lead: small precision model, the efficiency was significantly better than MLM.

Note: Recently, Google announced the AI language model ELECTRA as open source model over TensorFlow release. This new method uses a pre-training mission called alternative token detector (RTD), making it possible to simultaneously position all input from the learning, the training model bidirectional.

And, in the case of the same computing resources, ELECTRA better performance than existing methods; in the case of only 1/30 of the amount of parameters to obtain no less favorable than the most advanced model of the series BERT performance. Google released a related article describes the results of this open source, Lei Feng network AI source of a comment to arrange compiled as follows.

Victory BERT, Google best NLP pre-training model of open source

The language model Present Situation and Problems

In recent years, the latest progress in the language pre-training model allows natural language processing has also made significant progress, including some of the most advanced models, such as: BERT, RoBERTa, XLNet, ALBERT T5 and so on.

Although these methods differ in design, but in particular the use of NLP tasks (for example: sentiment analysis and troubleshooting, etc.) when fine-tuning, has the same idea, namely: the use of a large number of unlabeled text, to build a common understanding of language model.

Therefore, the existing pre-training methods are usually divided into two categories: language model (LM), for example: GPT. Such method of processing an input text in order from left to right, then in the case of the previously given context, to predict the next word.

Another mask is the language model (MLM), for example: BERT, RoBERTa and ALBERT. Such small amounts of word models which are predictive content of the input is blocked. MLM compared to the LM, it has the advantage of bi-directional prediction, because it can see the text you want to predict the word on the left and the right.

However, there are disadvantages MLM model predictions, prediction models are confined to a small subset of the input symbol (masked portion of 15%), thereby reducing the amount of information they receive from each sentence, the computational cost increases .

Victory BERT, Google best NLP pre-training model of open source

Pre-existing training methods and their shortcomings. The arrows indicate which tag is used to generate a given output representation (rectangle). Left: The traditional language model (such as GPT) using only word left of the current context. Right: Mask language model (e.g. BERT) are left to the right using the context, for each input, but only a fraction of the predicted word

The new pre-training model ELECTRA

Precisely in order to overcome the shortcomings of the above two types of language model, Google proposed ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) language model. This is a new method for pre-training, the key point is the pre-coder training text identifier as generators and not to deal with the problem existing language model.

Victory BERT, Google best NLP pre-training model of open source

Papers Address: https://openreview.net/pdf?id=r1xMH1BtvB  

At the same size as the model data, where the amount of calculation, performance of the method is significantly superior to the method of the MLM type, e.g. BERT and XLNet; Furthermore, ELECTRA small model requires only one GPU to obtain training four days.

Specific experimental data show that this model is smaller than the smaller model score BERT high GLUE 5 points, and even greater than GPT model (this model is used more than 30 times the computing power) even better results.

When the amount of calculation and ELECTRA using less than 1/4, and can achieve the performance RoBERTa GLUE XLNet in natural language understanding reference. If you use more computers to train large ELECTRA, the model in the rankings SQuAD 2.0 Q data set and language understanding tasks, access to the most advanced performance. (See specific data fourth section)

The core idea - replacing the token detection

ELECTRA called using alternative detection token (RTD) pre-trained new task that all inputs from the positions: while learning (e.g., the LM), two-way models trained (e.g.: MLM).

Specifically, ELECTRA goal is to learn to distinguish between the input word. It does not use a mask, but from a suggestion to replace the word in the input sampling distribution, which solves the inconsistency mask to bring pre-training and fine-tune the problem.

Then a model retraining discriminator to predict each word is the original word or replacement words. The advantage of a discriminator is: learning model from all words are entering, rather than as MLM, using only words to cover up, so the calculations are more effective.

As many developers think of the confrontation learning methods, ELECTRA really inspired to generate confrontation network (GAN). But the difference is, the model uses a similar but non-confrontational maximum learning.

For example in the figure below, the word "cooked" can be replaced with "ate." Although this is some truth, but it does not fit the whole context. Pretraining task needs to model (i.e., discriminator) to determine which inputs the original tag has been replaced or remain the same.

It is because the binary classification task of the model is applied to each input word, and not only a small amount of the mask word (BERT pattern in the model was 15%), and therefore, the efficiency of RTD method is higher than MLM. This also explains why less ELECTRA just an example, we can achieve the same reason other language model performance.

Victory BERT, Google best NLP pre-training model of open source

When all input from the position learning, the replacement token train detection for two-way

Wherein the neural network from the replacement token generator. Target mask generator is trained language model, i.e. after a given input sequence, according to a certain proportion (typically 15%) will be replaced with the mask in the input word; then obtain a vector representation via a network; using the softmax layer after another, enter the word sequence to predict masked position.

Although the structure generator similar to GAN, but the text is difficult to apply this method to the task, so get training objective function is to cover maximum likelihood word then.

Thereafter, the discriminator generator, and share the same input word embedded. Target discriminator of the input sequence is to determine whether each position is replaced by a word generator, if the position of the word corresponding to the original input sequence is not identical, it is determined as the replacement.

Victory BERT, Google best NLP pre-training model of open source

Discriminator generator and Neural Network Model

Specific findings contrast

The researchers will ELECTRA with other NLP latest models are compared and found that in the case given the same budgetary calculations, it is compared with the previous method has been substantial improvement in its performance and RoBERTa and XLNet fairly, and use less than a quarter of the amount of calculation.

Victory BERT, Google best NLP pre-training model of open source

x-axis shows the amount of computation for the training model (in FLOPs units), y-axis shows dev GLUE score. Compared with the existing pre-trained NLP model, ELECTRA learning efficiency is much higher. It is noteworthy that the current best model (e.g., T5 (11B)) is not suitable GLUE on the drawing, because the calculation models other than their use much (more than 10 times RoBERTa)

To further improve efficiency, the researchers tried a small ELECTRA model that can be well trained in four days inherent single GPU.

Although it is impossible to achieve with the need to train many of the same TPU large model accuracy, performance ELECTRA-small but still very good, even better than GPT, and the amount of computation required but only one-third.

Then, in order to test whether this result can be a large scale, the researchers used more computation (RoBERTa about the same amount, about 10% of T5) trained a large ELECTRA model.

Researchers large ELECTRA, RoBERTa, XLNet, BERT ALBERT model and answer questions in SQuAD 2.0 performance data sets to do the test, the results shown in the table below; you can see on the charts GLUE, ELECTRA outperformed all other models .

But compared to large T5-11b model, the latter scoring on GLUE still higher. But it is worth noting that the size of the ELECTRA is one-third, and 10% of computing training.

Victory BERT, Google best NLP pre-training model of open source

SQUAD 2.0 datasets score ELECTRA-Large and other latest models

Currently, the code for pre-training ELECTRA and fine tune on the downstream task has been released, the current support tasks include: text classification, questions and sequence markers.

The code supports the rapid training of small ELECTRA model on a GPU. After that, Google also plans to release the code applies to pre-training ELECTRA-Large, ELECTRA-Base and ELECTRA-Small's. (ELECTRA model currently available in English, follow-up will be released in more languages)

Original Address:

https://ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html 

GitHub Address:

https://github.com/google-research/electra 

Published 472 original articles · won praise 757 · Views 1.61 million +

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/104930578