ALBERT and ELECTRA, large aircraft with small displacement

Preface

With the emergence of BERT, more and more excellent pre-training language models have emerged, and the pace of learning cannot keep up with the speed of development.
When BERT first came out, I felt that within a predictable time, as long as the pre-training target task is good enough, I believe the effect will be further improved.

Pre-trained language model

name Features mechanism
ELMO Autoregressive language model, two-layer BiLSTM AllenNLP
BERT Self-encoding language model, Transformer Google
GPT,GPT 2.0 Autoregressive, Transformer OpenAI
ERINE BERT combined with knowledge graph Baidu
MASS Joint training of encoder and decoder models Microsoft
XLNet Arrange the language model, Transformer XL CMU & Google
RoBERTa Compared with BERT, the data quality is better, and the next sentence prediction task is eliminated Facebook
SG-Net BERT with syntactic structure Shanghai Jiao Tong University
ALBERT Embedding factorization, cross-layer parameter sharing, and dropout removal, the amount of parameters is reduced by an order of magnitude compared with BERT (Base 110M->11M) Google
T5 The entire NLP pre-training model field provides a general framework and pre-training corpus C4 Google
ELECTRA The generative masked language model (MLM) pre-training task is changed to the discriminative Replaced token detection (RTD) task to determine whether the current token has been replaced by the language model. Stanford

Two recent articles

Let me share the latest two papers in academia. They have something in common: Compared with BERT, the parameter quantity is an order of magnitude less, but the effect is improved.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

1.ELECTRA

Abstract (original)

Although MASK language modeling (MLM) pre-training methods like BERT produce good results on downstream NLP tasks, they require a lot of calculations to be effective. These methods destroy the input by replacing some tokens with [MASK], and then train a model to reconstruct the tokens. As an alternative, we propose a more efficient pre-training called Replaced token detection (RTD) to determine whether the current word has been replaced. Our method is not to shield the input, but to replace some input tokens with plausible alternatives sampled extracted from a small GAN, thereby destroying the input. Then, instead of training a model to predict [MASK], we train a discriminant model to determine whether each Token in the input of [MASK] is replaced by a generator sample. Experiments show that this pre-trained model is more effective because it learns from all inputs, not just from [MASK]. As a result, under the same model size, data, and calculation conditions, the context representation learned by our method is much better than the context representation learned by methods such as BERT and XLNet. For example, if we train a model on a GPU for 4 days, the performance of the model on the GLUE natural language understanding benchmark is better than GPT (training using more than 30 times the calculation). Our method is also very effective in scale. We can reach the performance of RoBERTa using less than 1/4 of the calculation.

  • 14M parameters can achieve the effect of RoBERTa, with less training corpus and faster speed
  • 33M parameters surpass RoBERT and ALBERT in many tasks of GLUE. GLUE is not yet on the list, so many comparison results are incomplete.
  • Not yet open source, Stanford's personality, coming soon
MNLI QQP
XLNet 89.8 91.8
RoBERTa 90.2 92.2
ALBERT 88 *
T5 92 90.2
ELECTRA 90.5 92.4

1.1 The introduction of GAN

GAN is a big hit in CV, but it has been almost meaningless in NLP, and the effect is not very prominent. An outstanding contribution of this article is the introduction of GAN into the pre-trained language model, and achieved the effect of SOTA (state of the art).

Replaced Token Detection(RTD)

RTD

1.1.1Generator Generator

By generating samples for the sentences after MASK, MLM (maximum likelihood) is used here instead of Adversarially. Because of the difference between NLP and CV

1.1.2 Discriminator

Determine whether the current word is the original (original, replaced) through the method of sequence labeling

  • What GAN generates is fake, but the G of the article will generate real samples (part of the vocabulary is real)
  • The gradient cannot be passed from D to G, so this article uses reinforcement learning to train G

1.2 Weight sharing

  • Under normal circumstances, Generator and Discriminator use the same size, but experiments have shown that a smaller Generator will be better.

  • Generator and Discriminator only share token embeddings. If all the weights are shared, the effect will be almost the same.

1.3Smaller Generators

  • Experiments show that the size of the Generator Discriminator 1/4~1/2 works best. The author assumes that too large a Generator will cause trouble to Discriminator.

1.4Training Algorithms

  • Joint training of Generator and Discriminator

  • Only at The Generator with Train MLM for the n-Steps.
    Initially only through MLM training to train Generator

  • Initialize the weights of the discriminator with the weights of the generator. Then train the discriminator with Disc for n steps, keeping the generator's weights frozen.
    Then use the parameters of the Generator to initialize the Discriminator, and use Disc to train the Discriminator while freezing the parameters of the Generator

1.5 contrastive learning

  • Distinguish between fictitious negative samples and positive samples by contrasting learning methods
  • This article is a combination of contrastive learning and GANs

2.ALBERT

Other

reference

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
XLNet: Generalized Autoregressive Pretraining for Language Understanding
T5:Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Guess you like

Origin blog.csdn.net/u013741019/article/details/102883553