ELECTRA Chinese pre-training model of open source, 110 parameters, performance comparable BERT

Thank refer to the original -http: //bjbsair.com/2020-03-27/tech-info/7050/
in November last year, NLP Great God Manning joint Google do ELECTRA was released, quickly popular throughout the NLP circles, which ELECTRA-small model parameters was only 1/10 BERT-base model, the performance is still comparable with BERT, RoBERTa and other models.

More recently, Google finally open the ELECTRA, and issued a pre-training model, which for lack of a large force calculation of universities and companies, is simply a godsend.

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

However, its pre-training model released only for the English, they are not as BERT as multi-language version. For other languages ​​(such as Chinese) researchers, it is very regrettable.

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

To solve this problem, today HIT iFlyTek Joint Laboratory (HFL) ELECTRA based on open source, released the Chinese version of ELECTRA pre-training model.

1、ELECTRA

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

ELECTRA pre-training model of the Stanford group SAIL lab Manning and Google brain research team, first appeared in the top 2019 Beijing Zhiyuan General Assembly. As a new text pre-training model, ELECTRA innovative design ideas, less computing resource consumption and fewer parameters, quickly attracted a large number of followers. Especially after last November ICLR 2020 paper reception released, it has caused quite a stir NLP circles.

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Papers link:

https://openreview.net/forum?id=r1xMH1BtvB

Papers in this chart can explain all the problems:

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Legend: the right is the result of an enlarged left.

As shown above, ELECTRA model can obtain better effects than any other model pre-trained in a training step premise less. Also, in the model size, and the case of calculating the same data, the method outperforms ELECTRA MLM-based, such as BERT and XLNet.

Therefore, the conventional formula ELECTRA language learning method represented, the former having a higher efficiency and less computational parameters (ELECTRA-small amounts only of 1/10 BERT-base).

ELECTRA able to achieve such good results, based on their pre-trained new frame, which comprises two sections: Generator and Discriminator.

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

  • Generator: A small MLM, a position [the MASK] original word prediction. Generator will be used to replace the portion of the input text word do.
  • Discriminator: each word is determined whether the input sentence is replaced, i.e. the use of Replaced Token Detection (RTD) pre-training mission, the substituted BERT original Masked Language Model (MLM). Note that this does not use the Next Sentence Prediction (NSP) task.

After the end of the pre-training phase, the model uses only a downstream task group Discriminator as fine adjustment.

In other words, the authors of the GAN CV applied to the field of natural language processing.

It is noteworthy that, despite the GAN training goals are similar, but there are still some key differences. First of all, if the builder happens to generate the correct token, the token is considered "real" and not "fake"; so the model can moderately improve outcomes downstream tasks. More importantly, the generator using a maximum likelihood to train, not to deceive discrimination by confrontational training.

2, Chinese ELECTRA pre-training model

At present, pre-training model of open source ELECTRA only pre-training model in English. But there are many other world languages ​​(eg Chinese) scholars, they need its corresponding language pre-training model.

However, in addition to the official Google BERT, RoBERTa and other pre-training model of multi-language version, the other example XLNet, T5 are no corresponding multi-language version, in English only. One reason is that, compared to only do pre-trained in English, pre-trained multilingual corpus need to collect the corresponding need to deploy different language corpus of proportion, too much trouble. Therefore, the probability of large, ELECTRA will not be a Chinese language version or versions of pre-training model.

On the other hand, as the Chinese community, our own people on how to do Chinese pre-training is more understanding of our own to do the corresponding pre-training could do better than Google official.

News from the HIT Lab senior researcher at the Joint fly, the series has done similar work before the open-source research director Cui Yiming led team, which is based on open source code for pre-training, plus Chinese data set to train Chinese version of the pre-training model. For example, the Chinese version of BERT series model, the Chinese version XLNet and other open source on GitHub after a good response, many Chinese in the evaluation missions have a lot of teams use their pre-training model of open source to improve.

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Open source address: https: //github.com/ymcui/Chinese-BERT-wwm

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Open source address: https: //github.com/ymcui/Chinese-XLNet

After Google open source ELECTRA, Cui Yiming, who launched the Chinese version of ELECTRA again.

Training data set, and still before training BERT series model is consistent with the data, mainly from the large-scale Chinese Wikipedia and general text (Chinese web crawling and cleaning), total token reach 5.4B. WordPiece vocabulary terms follows the original vocabulary Google BERT, including 21,128 token.

In this open-source, the Cui Yiming, who just released ELECTRA-base and ELECTRA-small two models. According to Cui Yiming representation, large version due to the many parameters, parameter settings over more difficult, so that the model launch delayed accordingly.

It has released two versions of their training for about seven days, due to the small version of the parameters only base version of 1/10, in training, Cui Yiming, who will adjust its batch of 1024 (a base four times). Specific details and hyperparametric follows (default parameters not mentioned holding):

  • ELECTRA-base: 12 layer, a hidden layer 768,12 head attention, learning rate 2e-4, batch256, the maximum length of 512, the training step 1M
  • ELECTRA-small: 12 layer, a hidden layer 256,4 head attention, learning rate 5e-4, batch1024, the maximum length of 512, the training step 1M

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

ELECTRA-small only 46 M.

In effect, Cui Yiming, who will be the effect of contrast between the Chinese version of the pre-series model and the training they did before.

Comparison model comprising: ELECTRA-small / base, BERT-base, BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext, RBT3.

Contrast six tasks:

  • CMRC 2018 (Cui et al, 2019.): Reading Text fragments extraction type (Simplified Chinese)
  • DRCD (Shao et al, 2018.): Reading Text fragments extraction type (Chinese)
  • XNLI (Conneau et al, 2018.): Natural language inference (three categories)
  • ChnSentiCorp: sentiment analysis (binary)
  • LCQMC (Liu et al, 2018.): Matching sentences (binary)
  • BQ Corpus (Chen et al, 2018.): Sentences matching (binary)

Downstream task in fine-tuning, ELECTRA-small / learning rate base model of the original paper set the default 3e-4 and 1e-4. It is noteworthy that, where no fine tuning and fine-tuning parameters for any task. To ensure the reliability of the results, for the same model, they use different random seed training 10 times, model performance reporting maximum and average (mean value in parentheses).

Results are as follows:

Simplified Chinese reading comprehension: CMRC 2018 (evaluation indicators: EM / F1)

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Traditional Chinese Reading Comprehension: DRCD (evaluation indicators: EM / F1)

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Natural language inference: XNLI (evaluation indicators: Accuracy)

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Sentiment Analysis: ChnSentiCorp (evaluation indicators: Accuracy)

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Sentence classification: LCQMC (evaluation indicators: Accuracy)

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

Sentence classification: BQ Corpus (evaluation indicators: Accuracy)

ELECTRA Chinese pre-training model of open source, only 1/10 the amount of parameters, performance is still comparable BERT

As can be seen from the above results, for ELECTRA-small model, the effect on most task significantly more than three layers RoBERTa effect (RBT3), or even close to the effect BERT-base, and in the parameter amount only BERT-base model 1/10. For ELECTRA-base models, on most tasks than even BERT-base effect of RoBERTa-wwm-ext.

Its specific use, you can view the Github project:

https://github.com/ymcui/Chinese-ELECTRA

Original articles published 0 · won praise 0 · Views 269

Guess you like

Origin blog.csdn.net/zxjoke/article/details/105139843