A brief introduction to the ELECTRA model

Table of contents

1. Overall overview

Second, the generator

3. Discriminator

4. Model training

5. Other improvements


1. Overall overview

ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately) adopts a "generator-discriminator" structure, which is very similar to the structure of the Generative Adversarial Net (GAN) . The overall model structure of ELECTRA is shown in the figure below.

As can be seen in the figure, ELECTRA is a model connected in series by a generator (Generator) and a discriminator (Discriminator). The role of these two parts is as follows.
 
(1) Generator. A small masked language model (MLM), which predicts the original word at the position of [MASK];
 
(2) Discriminator. Determine whether each word in the input sentence has been replaced, that is, use the Replaced Token Detection (RTD) pre-training task to replace the original MLM of the BERT model. It should be noted that the next sentence prediction (NSP) task is not used here.
Next, we will introduce the modeling methods of the generator and the discriminator in detail by combining the examples in the figure.

Second, the generator

For the generator, its purpose is to take the input text with a mask x= x 1, ···, xn , and learn the contextual semantic representation h = h 1, ···, h n through the multi-layer Transformer model, And restore the text at the mask position, which is the MLM task in BERT. It should be noted that only the masked words are predicted here, that is, for a certain mask position t, the generator outputs the probability corresponding to the original text xt P_{}^{G} \in \mathbb{R}_{}^{|V|} (|V| is the vocabulary size):

In the formula, w^{_{e}^{}}\in \mathbb{R}_{}^{|V|\times d}represents the word vector matrix; represents h_{t}^{G} the hidden layer representation corresponding to the original text xt .
Still taking the above picture as an example, the original sentence x = x 1 x 2 x 3 x 4 x is as follows:
 
the chef cooked the meal
The sentence after random masking is as follows, mark M = {1, 3} as the subscript of all masked word positions, mark x^{^{m}}=m_{1}x_{2}m_{3}x_{4}x_{5} as input sentence after masking, as follows:  
 
[MASK] chef [MASK] the meal
 
Then the goal of the generator is to restore m 1 to x 1 (ie the), and m 3 to x 3 (ie cooked). In the ideal case, that is, when the accuracy of the generator is 100%, the mask token [MASK] can be accurately restored to the corresponding word in the original sentence. However, in real situations, the accuracy of MLM is not that high. If the masked sentence is fed directly  x^{m}  into the generator, the sampled sentence will be produced  x^{s}:
 
the chef ate the meal
As can be seen from the above example, m1 successfully restores the word the through the generator, while m3 samples (or predicts) the word ate instead of cooked in the original sentence.
 
The sentences generated by the generator will be used as input to the discriminator. Since the sentence rewritten by the generator does not contain any artificial preset symbols (such as [MASK]), ELECTRA solves the problem of inconsistent input between pre-training and downstream tasks through this method.

3. Discriminator

Affected by the accuracy of the MLM, the sentences sampled by the generator  x^{s} are somewhat different from the original sentences. Next, the goal of the discriminator is to identify which words from the sampled sentence are the same as the words corresponding to the original sentence x, that is, the replacement word detection task. The above tasks can be achieved by binary classification methods.
For a given sample sentence  x^{s}, the corresponding hidden layer representation is obtained through the Transformers model  h^{D} = h_{1}^{D}\cdots h_{n}^{D} . Then, the hidden layer representation at each moment is mapped into probabilities through a fully connected layer.

In the formula, w\in \mathbb{R}^{d} represents the weight of the fully connected layer (d represents the dimension of the hidden layer); M represents all masked word position subscripts; σ represents the Sigmoid activation function. Assuming that 1 means that it has been replaced, and 0 means that it has not been replaced, then the prediction label corresponding to the sentence "the chef ate the meal" generated by the generator sampling is as follows, which can be recorded as y = y1···yn, namely:
 
00100

4. Model training

The generator and discriminator are trained with the following loss functions respectively:

Ultimately, the model learns the model parameters by minimizing the following loss:

where X denotes the entire large-scale corpus; \Theta ^{G}  and  \Theta ^{D} denote the parameters of the generator and discriminator, respectively.
Note: Since the connecting part of the generator and the discriminator involves sampling, the loss of the discriminator will not be directly passed back to the generator, because the sampling operation is not derivable. In addition, after the pre-training is over, only the discriminator needs to be used for downstream task fine-tuning, instead of the generator.

5. Other improvements

(1) Smaller generators . From the previous introduction, it can be found that the main structure of the generator and the discriminator are both composed of BERT, so the two can use the same parameter scale. But this results in roughly twice as long pre-training as a single model. In order to improve the efficiency of pre-training, the number of parameters of the generator in ELECTRA is smaller than that of the discriminator. The specific implementation will reduce the hidden layer dimension, fully connected layer dimension and the number of attention heads of the Transformer in the generator. For discriminators with different model sizes, the scaling ratio is also different, usually between 1/4 and 1/2. Taking the ELECTRA-base model as an example, the scaling ratio is 1/3. The table below shows
Comparison of the parameters of the generator and discriminator of the ELECTRA-base model.
 

Why reduce the size of the generator, but not the size of the discriminator? As mentioned above, the generator will only be used in the pre-training phase, but not in the downstream task fine-tuning phase, so it is reasonable to reduce the size of the generator.
 
(2) Parameter sharing . In order to achieve more flexible modeling purposes, ELECTRA first introduced the word vector factorization method, and mapped the word vector dimension to the hidden layer dimension through the fully connected layer. As mentioned above, ELECTRA uses a smaller generator, so there is no direct parameter sharing between the generator and the discriminator. In ELECTRA, parameter sharing is limited to input layer weights, which include word vectors and position vector matrices.

Guess you like

Origin blog.csdn.net/weixin_45684362/article/details/130941601