Victory BERT! NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

2020-03-13 12:37:59

Thirteen from the bottom of the recessed non-Temple
qubit reports | Public number QbitAI

The NLP pre-training model, you deserve to have.

It's called ELECTRA , from the Google AI, not only has the advantage of BERT, the efficiency is higher than it.

Victory BERT!  NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

 

ELECTRA is a new pre-training method, it is possible to effectively learn how to collect accurate word sentences, that is, we usually say that the token-replacement.

How effective?

RoBERTa and only a quarter of the amount of calculation XLNet, will be able to reach their performance on GLUE. And it has made new breakthroughs in performance SQuAD.

This means that "small-scale, also has a large role" in the single GPU on the training needs of only four days, even higher than the accuracy of the model OpenAI of GPT.

ELECTRA has been used as TensorFlow open source model release includes a number of pre-training and easy to use language representation model.

Let the pre-training faster

Pre-existing training model can be divided into two broad categories: language model  (Language Model, LM) and mask language model (Masked Language Model, MLM).

GPT is a kind of example, LM, which processes the input text from left to right, according to a given context to predict the next word.

And as BERT, RoBERTa and ALBERT belong to MLM, they can predict a small amount of words in the input masked. MLM has a two-way advantage, they can "see" both sides of the text token to be predicted.

MLM but also has its disadvantages: each of the input token and prognosis, these models predict only a very small subset (masked 15%), thereby reducing the amount of information obtained from each sentence.

Victory BERT!  NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

 

The ELECTRA uses a new pre-training mission, called REPLACED token Detection  (RTD).

It is like MLM training as a two-way model, like LM as learning all the input position.

Inspired by generating confrontation Network (GAN) is, ELECTRA to distinguish between "true" and "false" the input data by training model.

BERT input method of destruction is to use "[MASK]" token replacement, but this method is not correct (but somewhat credible) replacing some pseudo-token input by using the token.

For example, in FIG. "Cooked" may be substituted "ate".

Victory BERT!  NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

 

Generating a first prediction using the mask out a sentence token, token using the predicted next alternative sentence [the MASK] tag, and then use a token for each sentence is determined to distinguish between the original or replacement.

Victory BERT!  NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

 

After the pre-training, the task is determined for downstream.

Victory BERT, SQuAD 2.0 performed best

The ELECTRA with other advanced NLP model comparison can be found:

Under the same to calculate the budget, it is a big improvement over previous methods, in the case of less than 25% of the calculated amount of performance RoBERTa and XLNet quite.

Victory BERT!  NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

 

To further improve efficiency, the researchers also tried a small ELECTRA model, it can be trained on a single GPU 4 days.

Although there is no large-scale model of precision required to achieve the TPU to train a lot, but still behave ELECTRA prominent, even more than the GPT (only 1/30 of the amount of computation required).

Finally, in order to see whether the large-scale implementation, researchers used more computation (RoBERTa about the same amount, about 10% of T5), to train a large ELECTRA.

The results showed that, on SQuAD 2.0 test set to achieve the best results.

Victory BERT!  NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

 

Moreover, in the GLUE it has more than exceeded RoBERTa, XLNet and ALBERT.

Open source code has been

In fact, the study already published in early September last year, when. But what is exciting is that, in recent days, finally open up the code!

Victory BERT!  NLP pre-training tool: a small model also has high-precision, single GPU will be able to train

 

ELECTRA is mainly pre-mission training and fine-tuning downstream code. Currently supported tasks include text categorization, answers to questions and sequence markers.

Open source code to support the rapid training of a small ELECTRA model on a GPU.

ELECTRA model is currently only available in English, but the researchers also expressed the hope that the future can publish multilingual pre-training model.

Portal

Google AI blog:
https://ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html

GitHub Address:
https://github.com/google-research/electra

Papers Address:
https://openreview.net/pdf?id=r1xMH1BtvB

- Finish-

Published 472 original articles · won praise 757 · Views 1.61 million +

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/104855439