[AI Theory Learning] Language Model: BERT’s Optimization Method


BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model in natural language processing with powerful text understanding capabilities. However, BERT also has some shortcomings, which are mainly reflected in the following aspects:
1) The training method and the testing method are inconsistent. Because 15% of the input sequence is randomly replaced with the MASK mark during training, but this mark does not exist during testing or fine-tuning, because it will affect the model performance.
2) For the replaced MASK tags, BERT's loss function uses the approximately equal sign, that is, it is assumed that those tagged words are independent given the non-tagged words. But this assumption does not (always) hold.

In addition, when the number of model parameters is relatively large, the effect of natural language understanding tasks is better, but the effect of natural language generation tasks is not good, and there is a lack of dependence between segments. Therefore, many new models have been proposed, such as XLNet, ALBERT, ELECTRA and other models.

Graphical XLNet model

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Paper abstract: Denoising autoencoding pre-training models with the ability to model bidirectional context, such as BERT, perform better than pre-training methods based on autoregressive language modeling. However, due to relying on using masks to corrupt the input, BERT ignores the dependency between mask positions, and there is a difference between pre-training and fine-tuning . In view of these advantages and disadvantages, we propose XLNet, a generalized autoregressive pre-training method, which achieves the learning of bidirectional context by maximizing the expected likelihood of all permutations of the decomposition order, and overcomes BERT through its autoregressive formulation limitations . In addition, XLNet Transformer-XLintegrates the idea of ​​​​into pre-training, which is a state-of-the-art autoregressive model. From an empirical perspective, under comparable experimental settings, XLNet outperforms BERT on 20 tasks, often by a wide margin, including tasks such as question answering, natural language reasoning, sentiment analysis, and document ranking.

In order to retain the advantages of the two-way learning of the BERT model, and at the same time improve the problems such as the inconsistency between the training and prediction stages caused by randomly resetting 10% of the input sequences labeled as [MASK] and the failure to consider the dependencies between these sequences labeled by [MASK]. , XLNet uses Permutation Language Modeling (PLM) to solve it, and uses the Transformer-XL architecture to solve the problem of insufficient correlation between segments.

1. Permutation Language Modeling

Permutation Language Modeling (Permutation Language Modeling) is a model trained to predict a token given the previous text. It is similar to a traditional language model, but instead of predicting tokens in order, it predicts tokens in some random order. To illustrate this more clearly, here is an example:

“Sometimes you have to be your own hero.”

A traditional language model will predict tokens in the following order:

“Sometimes”, “you”, “have”, “to”, “be”, “your”, “own”, “hero”

Each token uses all previous tokens as context.

However, in permutation language modeling, the predicted order is not necessarily to the right. For example, it might be

“own”, “Sometimes”, “to”, “be”, “your”, “hero”, “you”, “have”

Among them, "Sometimes" will be conditional on seeing "own", "to" will be conditional on seeing "own" & "Sometimes", and so on.

Suppose there is an input sequence { x 1 , x 2 , x 3 , x 4 } \{x_1,x_2,x_3,x_4\}{ x1,x2,x3,x4} , according to the permutation language model, the input sequence can be decomposed into multiple permutations. This uses the autoregressive model to predictx 3 x_3x3When, you can see the above (x 1, x 2) (x_1,x_2) at the same time(x1,x2) and below( x 4 ) (x_4)(x4) . As shown below:x 3 x_3
Different arrangements of the graph permutation language model when predicting x3
in different factorization ordersx3An example of a permutation language modeling target whose input sequence x is the same. The corresponding decomposition method in the upper left picture is 3 − > 2 − > 4 − > 1 3->2->4->13>2>4>1 , so predictx 3 x_3x3You cannot pay attention to (attend to) any other words, and can only predict based on the previous hidden state. The corresponding decomposition method in the upper right picture is 2 − > 4 − > 3 − > 1 2->4->3->12>4>3>1 , so it can be calculated according tox 2 , x 4 x_2,x_4x2,x4to predict x 3 x_3x3, both x 3 x_3 are usedx3The word on the left also uses x 3 x_3x3words on the right. In the same way, the lower left picture and the lower right picture can be deduced in the same way.

Note that the model in permutation language modeling is forced to model bidirectional dependencies. From an expectation perspective, the model should learn to model dependencies between all combinations of inputs, while traditional language models only learn one-way dependencies.

2.XLNet integrates Transformer-XL concept

In addition to using permutation language modeling, XLNet also leverages Transformer-XL, further improving its results.
Key ideas behind the Transformer XL model:

  1. Relative positional embeddings. Each segment should have a different position encoding, so Transformer-XL adopts relative position encoding.
  2. Recurrence mechanism.
    • The representation computed for the previous segment is repaired and cached for reuse as an extended context when the model processes the next new segment.
    • The maximum possible dependency length is increased by a factor of N, where N represents the depth of the network.
    • Resolved context fragmentation issue to provide necessary context for identifiers preceding new segments.
    • Without the need for repeated calculations, Transformer-XL is over 1800 times faster than vanilla Transformer during evaluation on language modeling tasks.

Hidden state cached and frozen from the previous paragraph remains unchanged when performing permutation language modeling of the current paragraph. Since all the words of the previous paragraph are used as input, there is no need to know the order of the previous paragraph .

3. Use Two-Stream Self-Attention mechanism

For the language model using the Transformer model, when predicting the token at position i, the entire embedding of the word, including the positional embedding, is masked . This means that the model is isolated from knowledge about its location when predicting its location .

For language models using the Transformer model, when predicting a token at position i, the entire embedding for that word is masked out including the positional embedding. This means that the model is cut off from knowledge regarding the position of the token it is predicting.


What problems does Permutation Language Modeling bring?

Permutation can enable the AR model to see the context from both directions, but it also brings problems that the original Transformer cannot solve. The goal of permutation language modeling is as follows:
max ⁡ θ E z ∼ ZT [ ∑ t = 1 T log ⁡ p θ ( xzt ∣ xz < t ) ] \max_{\theta} \mathbb{E}_{\mathrm{z }\sim \mathcal{Z}_T}[\sum_{t=1}^T \log p_\theta (x_{z_t}|\mathrm{x}_{\mathrm{z}_{<t}}) ]imaxEzZT[t=1Tlogpi(xztxz<t)]
where:

  • Z \mathrm{Z}Z:a factorization order
  • p θ p_\thetapi: Likelihood function
  • x z t x_{z}t xzt:the t t h t^{th} tth token in the factorization order
  • x z < t x_z<t xz<t:the tokens before t t h t^{th} tth token

This formula is the objective function used for permutation language modeling, which means using t-1 tokens as the context to predict the t-th token.
The standard Transformer fails to meet two requirements:

  1. In order to predict token xt x_txt, the model should only see xt x_txtposition instead of xt x_t should be seenxtcontent . _
  2. In order to predict token xt x_txt, the model should convert xt x_txtAll previous tokens are encoded as content .

Considering the first requirement above, BERT merges positional encoding with token embedding (see the figure below), so position information cannot be separated from token embedding:
BERT encoding

Does BERT have a problem separating positional embeddings from token embeddings?

BERT is an AE language model that does not require separate position information like the AR language model. Unlike XLNet, which requires position information to predict the t-th token, BERT is used [MASK]to represent the token to be predicted (we can [MASK]regard it as a placeholder). For example, if BERT uses x 2 x_2x2 x 1 x_1 x1and x 4 x_4x4to predict x 3 x_3x3, then x 2 x_2x2 x 1 x_1 x1and x 4 x_4x4The embedding contains position information and [MASK]other related information. Therefore, the model has a good chance of predicting [MASK]x 3 x_3x3

BERT's embeddings contain two types of information, namely positional embeddings and token/content embeddings (here, we skip sequence embeddings because we don't care about the next sentence prediction (NSP) task), as shown in the figure below.
BERT Embeddings
Position information is easy to understand. It tells the model the location of the current token. Content information (semantics and syntax) contains the "meaning" of the current token, as shown in the figure below.
BERT embedding
An intuitive example of the embedding relation used in the Word2Vec paper is: queen = king − man + woman queen=king−man+womanqueen=kingman+woman


In order to solve this problem, XLNet introduces the Two-Stream Self-Attention mechanism , as shown in the following figure:
Figure 2 Calculation process of dual-stream self-attention mechanism
Figure 2 Two-Stream Self-Attention of target perceptual representation. (a) Content stream attention, which is the same as standard self-attention. (b) Query stream attention, it has nothing to do with content xzt x_{z_t}xztaccess information. (c) Overview of permutation language modeling training under dual-stream attention.

As the name suggests, two-stream self-attention contains two types of self-attention. One is content stream attention , which is the standard self-attention in Transformer. The other is query stream attention . XLNet introduced it to replace [MASK]the tokens in BERT.

For example, if BERT wants to find the context word x 1 x_1 with contextx1and x 2 x_2x2Predict x 3 x_3 given the knowledgex3, which can be [MASK]represented by x 3 x_3x3mark. [MASK]Just a placeholder. And x 1 x_1x1and x 2 x_2x2The embedding of contains positional information, helping the model "know" [MASK]that x 3 x_3x3

But the situation is different for XLNet. A token x 3 x_3x3There will be two roles. When it is used as content to predict other tokens, we can use content representation(learning via content flow attention) to represent x 3 x_3x3. But if we want to predict x 3 x_3x3, we should only know its location but not its content . That's why XLNet uses query representation(learning via query stream attention) to retain x 3 x_3x3The previous context information only contains x 3 x_3x3location information .

In order to intuitively understand dual-stream self-attention, we can replace the query representation in XLNet [MASK], which can be simply understood. They just choose different methods to accomplish the same task.

In this way, Query Stream can be used to predict the location to be predicted without leaking the content information of the current location. The specific operation is to use two sets of hidden states (hidden states) ggg andhhh . Among themggg only has position information, as Q in Self-Attention. hhh contains content information, then it is K and V.

As shown in Figure 2, from the figure, the original order of the sentences is [x 1, x 2, x 4, x 4] [x_1,x_2,x_4,x_4][x1,x2,x4,x4] . We randomly get a factorization order[ x 3 , x 2 , x 4 , x 1 ] [x_3,x_2,x_4,x_1][x3,x2,x4,x1] . The upper left corner represents the calculation of content representation. If we want to predictx 1 x_1x1For content representation, we should have token content information from all four tokens. KV = [ h 1 , h 2 , h 3 , h 4 ] K_V=[h_1,h_2,h_3,h_4]KV=[h1,h2,h3,h4] andQ = h 1 Q=h_1Q=h1. The lower left corner represents the calculation of query representation. If we predict x 1 x_1x1Query representation of , we cannot see x 1 x_1x1the content itself. KV = [ h 2 , h 3 , h 4 ] K_V=[h_2,h_3,h_4]KV=[h2,h3,h4] andQ = g 1 Q=g_1Q=g1

The picture on the right shows the entire calculation process. Let's look at it from bottom to top, first h ( ⋅ ) h(\cdot)h ( ) andg ( ⋅ ) g(\cdot)g ( ) are initialized toe ( xi ) e(x_i)and ( xi) Japaneselolw . Then the output of the first layer h (1) h(1)is calculated from the content mask (Content Mask) and query mask (Query Mask)h ( 1 ) andg (1) g(1)g ( 1 ) , then calculate the second layer, third layer, etc.

Note the content mask and query mask on the far right. They are all matrices. Let’s look at the content mask first. There are 4 red dots in the first line, indicating the first token ( x 1 x_1x1) can pay attention to (attend to) all other tokens, including itself (i.e. x 3 − > x 2 − > x 4 − > x 1 x_3->x_2->x_4->x_1x3>x2>x4>x1). The second line has two red dots, indicating the second token ( x 2 x_2x2) can pay attention to (attend to) two tokens ( x 3 − > x 2 x_3->x_2x3>x2). The difference between query mask and content mask is that you cannot focus on yourself, so the diagonal lines are all white dots.

In summary: there is only one order for input sentences. But we can use different attention masks to implement different factorization orders.


Autoregressive vs. Autoencoder Models

Unsupervised representation learning has achieved great success in the field of natural language processing. Typically, these methods first pretrain neural networks on large-scale unlabeled text corpora and then fine-tune the model or representation on downstream tasks. Under this shared high-level idea, different unsupervised pre-training objectives have been explored in the literature. Among them, autoregressive (AR) language modeling and autoencoding (AE) are the two most successful pre-training objectives. Relating this to the Transformer architecture, the Transformer encoder is an AE model, while the Transformer decoder is an AR model .

The following tree diagram ( source ) shows the Transformer encoder/AE model (blue), Transformer decoder/AR model (red), and Transformer encoder-decoder/seq2seq model (grey): The
Transformer Encoder vs Decoder
AR model is derived from a series of It learns in time steps and uses measurements from previous operations as input to a regression model to predict values ​​at the next time step . AR models are often used for generative tasks, such as those in the field of natural language generation (NLG), such as summarization, translation, or abstract question answering. Representatives include ELMO, GPT, etc.
AR model
AE-based pre-training does not perform explicit dense estimation, but aims to reconstruct the original data from the corrupted input ("fill in the blanks"). AE models are often used for content understanding tasks, such as tasks in the field of natural language understanding (NLU) involving classification, such as sentiment analysis or extractive question answering. A famous example is BERT, which has been the state-of-the-art pre-training method. Given a sequence of input tokens, parts of certain tokens are replaced by special symbols [MASK], and the model is trained to recover the original tokens from the corrupted versions.

AE language models are designed to reconstruct original data from corrupted input . Since dense estimation is not part of the goal, BERT allows leveraging bidirectional context for reconstruction. As a direct benefit, this bridges the two-way information gap mentioned in AR language modeling, thereby improving performance. However, artificial symbols such as [MASK] used by BERT during pre-training disappear from the real data during fine-tuning, leading to inconsistencies between pre-training and fine-tuning. Furthermore, since the predicted tokens in the input are masked, BERT cannot model joint probabilities like the multiplicative rule in AR language modeling. In other words, BERT assumes that the predicted tokens are independent of each other given the unmasked tokens, which is an oversimplification since high-order, long-distance dependencies exist in natural language .

Using masked language modeling as a common training objective for pre-trained AE models, we predict the original values ​​of masked tokens in the corrupted input. BERT (and all its variants such as RoBERTa, DistilBERT, ALBERT, etc.), XLM are examples of AE models.
bi-direction


ALBERT method

Models such as BERT and GPT have different versions. In many cases, if the corpus is sufficient, the larger the model size, the better the performance. However, there are exceptions, that is, the larger the model size and the more parameters, the lower the performance. This is called model degradation ( ) Model Degradation.
Model Degradation
From the graph given in the original paper, we can see how performance degrades. BERT-xlarge performs worse than BERT-large, even though it is larger and has more parameters.
BERT-large vs BERT-xlarge
How to reduce model complexity while maintaining performance or even improving performance? ALBERT is one of these methods.
ALBERT
Paper abstract: Increasing the size of pretrained models for natural language representation often improves performance on downstream tasks. However, after a certain point, further increasing the model size becomes more difficult due to GPU/TPU memory limitations and longer training times. To address these issues, we propose two parameter reduction techniques to reduce memory consumption and improve BERT’s training speed . Extensive empirical evidence shows that our proposed approach enables the model to scale better relative to the original BERT. We also use a self-supervised loss focused on modeling consistency across sentences and show that it consistently helps in downstream tasks with multi-sentence input. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters than BERT-large.

Simply, ALBERT (A lite BERT) reduces the number of parameters but maintains the performance of BERT, but it only reduces the space complexity and reduces the number of parameters from 108M to 12M, but does not reduce the time complexity. That is, ALBERT reduces the number of parameters, but does not reduce the amount of calculation .

So, how does ALBERT reduce the number of parameters ?

  • Factorized embedding parameterization (factorization of word embedding): Decompose the word embedding matrix into two fewer matrices.
  • Cross-layer parameter sharing (cross-layer parameter sharing): This technology can reduce parameters in deep networks.

Parameter reduction techniques are like regularization methods. ALBERT will have 18 times fewer parameters than BERT-large, and the training speed will be 1.7 times faster.

ALBERT also proposed another method to replace NSP (Next-Sentence Prediction Loss) technology. This new technology is called Sentence-Order Prediction (SOP). SOP is a Self-Supervised Loss.

Therefore, ALBERT utilizes three techniques:

  1. factorized embedding parameterization (factorization of word embedding)
  2. cross-layer parameter sharing (cross-layer parameter sharing)
  3. sentence-order prediction (SOP, sentence order prediction)

The backbone structure of ALBERT is the BERT model, which also uses the GELU activation function. Define the size of the dictionary to be EEE , the number of encoder layers isLLL , the size of the hidden layer isHHH , attention head isH/64 H/64H/64

1. Decompose the Vocabulary Embedding matrix

In BERT and subsequent modeling improvements such as XLNet and RoBERTa, WordPiece embeddings EEE and transformer hidden layer sizeHHH is bound together, that is,E ≡ HE ≡ HEH. _ These embeddings are learned from a one-hot encoded representation containing 30,000 words. They are projected directly into the hidden space of the hidden layer.

Suppose we have a vocabulary of size 30,000 and the dimension of word-piece embedding is E = 768 E=768E=768 , the size of the hidden layer isH = 768 H=768H=768 . If we increase the number of hidden units in the block, then we also need to add a new dimension to each embedding. This problem is also common in XLNET and ROBERTA.
Vocabulary Embedding Matrix
For both modeling and practical reasons, this decision appears to be suboptimal, for the following reasons:

  • From a modeling perspective, the purpose of WordPiece embeddings is to learn context-independent representations, while the purpose of hidden layer embeddings is to learn context-dependent representations . As experiments on context length demonstrate (Liu et al., 2019), the power of BERT-like representations lies in leveraging context to provide the signal for learning such context-sensitive representations. So the WordPiece embeddings size EEE and hidden layer sizeHHH decoupling allows us to use the total model parameters more efficiently. According to the modeling requirements,H >> EH >> EH>>E 's condition.
  • From a practical perspective, typically the dictionary size VVV is very large ifE ≡ HE\equiv HEH , increaseHHThe size of H will make embedding matrixV × EV\times EV×E is very large, which causes the model parameters to be too large and the training speed to slow down.

Therefore, ALBERT solves this problem by decomposing the large vocabulary embedding matrix into two smaller matrices . This separates the size of the hidden layer from the size of the vocabulary embedding . This allows us to increase the size of the hidden layers without significantly increasing the parameter size of the vocabulary embedding .
Schematic diagram of decomposing the Vocabulary Embedding matrix
We project the One Hot encoding vector into a lower dimensional embedding space with dimension E = 100 E=100E=100 , and then project this embedding space to the hidden spaceH = 768 H=768H=768 . ALBERT decomposes the embedding matrix parameters into two matrices by using the decomposition method. That is to say, the parameter quantityO ( V × E ) O(V\times E)O(V×E )变为O ( V × E + E × H ) O(V\times E+E\times H)O(V×E+E×H)

At implementation time, V × EV\times E is randomly initializedV×E E × H E\times H E×Matrix of H , calculating the representation of a word requires multiplying the one-hot vector of a word byV × EV\times EV×E -dimensional matrix (that is, lookup), and then multiply the result byE × HE\times HE×An H- dimensional matrix is ​​sufficient. The parameters of the two matrices are learned through the model.

We choose to use the same EE for all word piecesE , because they are much more evenly distributed in the document than whole-word embedding, where it is important to choose different embedding sizes for different words.

The code for the above matrix decomposition process is as follows (the reference code is PyTorch version ALBERT ):

class AlbertEncoder(nn.Module):
    def __init__(self, config):
        super(AlbertEncoder, self).__init__()
        self.hidden_size = config.hidden_size
        self.embedding_size = config.embedding_size
        self.embedding_hidden_mapping_in = nn.Linear(self.embedding_size, self.hidden_size)
        self.transformer = AlbertTransformer(config)

    def forward(self, hidden_states, attention_mask=None, head_mask=None):
        if self.embedding_size != self.hidden_size:
            prev_output = self.embedding_hidden_mapping_in(hidden_states)
        else:
            prev_output = hidden_states
        outputs = self.transformer(prev_output, attention_mask, head_mask)
        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
class AlbertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings.
    """
    def __init__(self, config):
        super(AlbertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.embedding_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.embedding_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.embedding_size)
        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        self.LayerNorm = AlbertLayerNorm(config.embedding_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None, position_ids=None):
        seq_length = input_ids.size(1)
        if position_ids is None:
            position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
            position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        words_embeddings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)
        embeddings = words_embeddings + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

2. Cross-layer parameter sharing

The Bert large model has 24 layers, while its basic version has 12 layers. As we add more layers, we increase the number of parameters exponentially.
BERT model parameters
To solve this problem, ALBERT uses the concept of cross-layer parameter sharing. To illustrate, let’s look at an example of a 12-layer BERT base model. Instead of learning unique parameters for each of the 12 layers, we only learn the parameters of the first block and reuse that block in the remaining 11 layers .
ALBERT cross-layer shared parameters
We can share only the parameters of the feedforward layer, or only the attention parameters, or the parameters of the entire block itself. This article shares the parameters of the entire block.

Compared with the 110 million parameters of the BERT base, the ALBERT model has only 31 million parameters using the same number of layers and 768 hidden units. For an embedding size of 128, the impact on accuracy is minimal. The main drop in accuracy is due to feedforward network parameter sharing. The impact of shared attention parameters is minimal.
Figure: Effect of cross-layer parameter strategy on performance
Figure: Impact of cross-layer parameter strategy on performance

In Albert's source code, model parameter sharing is implemented modeling_utilsthrough . Implemented as a base class , its subclasses can be called later to implement parameter sharing. And the core is .PreTrainedModel_tie_or_clone_weightsPreTrainedModel_tie_or_clone_weightsinit_weighttie_weighttie_weight_tie_or_clone_weights
_tie_or_clone_weights

3. Use SOP instead of NSP

BERT uses NSP (next-sentence prediction) as loss. NSP is a two-classification problem. The positive samples for training are consecutive sentences in one document, and the negative samples are sentences in different documents. However, subsequent research found that the effect of NSP was unreliable, mainly because the tasks of NSP were too simple.
Next sentence prediction
Figure. Next Sentence Prediction

NSP mainly includes two tasks, topic prediction and consistency prediction . Compared with consistency prediction, topic prediction is simpler and overlaps with what MLM learns. Because the positive samples come from the same document, while the negative samples come from different documents, for example, the first sentence comes from entertainment news, and the latter sentence comes from social news. The two sentences are incoherent and not the same topic, so the difference may be relatively large.

MLM is similar to cloze, the model needs to predict the word [MASK]. The training samples of MLM are continuous text streams, and these training texts all come from a topic . This is also the reason why it overlaps with NSP.

ALBERT focuses on the consistency of sentences and proposes a new task SOP (sentence-order prediction). The positive sample acquisition method is the same as Bert. Negative samples are consecutive sentences reversed in order . This also allows the model to focus on predictions of sentence continuity .
Sentence Order Prediction
Figure Sentence Order Prediction takes two consecutive segments in the same document as positive categories, exchanges the order of the same segments, and uses them as counterexamples. This forces the model to learn finer-grained distinctions about discourse-level coherence properties.

ALBERT proposed the guess that NSP is ineffective because it is not a difficult task compared to masked language modeling. In a single task, it incorporates both topic prediction and coherence prediction. The topic prediction part is easy to learn because it overlaps with the loss of the masked language model. Therefore, NSP will give higher scores even if coherence predictions are not learned .

4. Other optimization methods

Other optimization methods are as follows:
1) BERT directly masks the words and changes them to n-gram masks, where n values ​​​​are 1~3, which avoids the independence problem between [MASK] to a certain extent. When segmenting Chinese words, the performance of word masking operations will be improved to a certain extent compared to the performance of character masking operations.
2) Delete dropout. Because BERT did not suffer from overfitting during training, dropout was deleted.

ELECTRA method

Existing pre-training methods are generally divided into two categories. The first category is language models (LM), such as ELMo, GPT, GPT-2, etc., which process text in order from left to right (or right to left), and then predict the next given the previous context. word. The other type is masked language models (MLM), such as BERT and ALBERT, which predict a small number of word contents that have been masked in the input.

Compared with LM, MLM has the advantage of bidirectional prediction, but the prediction of such models is limited to a small subset of input identifiers (15% of the input sequence), thereby reducing the amount of information they obtain from each sentence and increasing to calculate the cost. In addition, because there is no MASK annotation in the test part, it may lead to problems such as inconsistency between the training and testing phases, affecting the performance of the model.

In order to overcome the shortcomings of MLM, the XLNet method was proposed. It uses a permutation language model to achieve better results. However, since each round of pre-training in XLNet is arranged according to the rows and columns of the mask matrix, the fine-tuning stage is ordinary. Transformer processing.

In order to further improve the learning efficiency of the pre-trained language model, we proposed the RTD (replaced token detection) task as a replacement for the MLM task - this is a learning task with a somewhat similar architecture. First, GANthrough a smaller Generatorpair of BERT The special token [MASK] is replaced, and then another one is trained Discriminatorto predict each word in the input, that is, the model will learn from all tokens in the input sequence , which is different from BERT's 15%, and we think this is also What makes ELECTRA training faster than BERT. As shown in the figure, ELECTRA can always use less computing power and fewer model parameters to achieve better results than models such as BERT.
Figure 1 Comparison chart of model computing power consumption
Figure. Replaced token detection pre-training consistently outperforms masked language model pre-training under the same computational resource budget. The image on the left is a magnified view of the dashed box. The vertical axis is the GLUE score, and the horizontal axis is FLOPs (floating point operations), the statistics of floating point calculations provided in Tensorflow. As can be seen from the figure above, ELECTRA of the same magnitude has always crushed BERT, and after training for a longer number of steps, it achieved the effect of the SOTA model at the time - RoBERTa. It can also be seen from the curve on the left that the ELECTRA effect still has room for improvement.

1. Overview of ELECTRA

The innovation of ELECTRA lies in:

  • A new model pre-training framework is proposed, which uses a combination of generator and discriminator, but it is different from GAN. The ELECTRA model uses maximum likelihood estimation rather than adversarial learning
  • The generative Masked language model (MLM) pre-training task is changed to the discriminative Replaced token detection (RTD) task to determine whether the current token has been replaced by the language model.
  • In the generator part, MLM is still used because the Masked Language Model can effectively learn context information. That is, this model can be used to predict 15% of the words dug out and replace them. If the replaced words are not the original ones, Words will be labeled as being replaced, and other words in the sentence will be labeled as not replaced, so it can learn the embedding of the word well, and uses Weight Sharing to share the embedding information of the Generator to the Discriminator
  • The Dicriminator predicts whether each token output by the Generator is original, thereby efficiently updating the parameters of the Transformer and speeding up the proficiency of the model. At this time, the prediction model is converted into a two-classification model. This transformation can bring about an improvement in efficiency. Predicting words at all positions will lead to much faster convergence.
  • This model uses a small Generator and a Discriminator to train together, and uses the loss of the two to add up, so that the learning difficulty of the Discriminator gradually increases, and more difficult tokens (plausible tokens) can be learned.
  • When the model is fine-tuning, discard the Generator and only use the Discriminator.

2. RTD structure

Since the implementation of BERT's MLM is not very efficient, only 15% of the tokens are useful for updating parameters, and the other 85% do not participate in the update of gradients, and there is a mismatch between pre-training and fine-tuning, because in In the fine-tuning stage, there will be no [MASK] token.

Therefore, ELECTRA adopts a new structure and uses a new pre-training task: RTD (Replaced Token Detection), which determines whether all words in each sample have been replaced to speed up training. As shown below:
Replaced token detection diagram
Figure. Schematic diagram of Replaced token detection. The generator can be any model. Its task is to predict the tokens that have been randomly masked. It is usually a smaller BERT model and is trained together with the discriminator. The task of the discriminator is to distinguish which token has been tampered with by the generator. So much so that it is inconsistent with the original token. Although the structure of these models is similar to GANs, we use maximum likelihood instead of adversarial training to train the generator because of the difficulties in applying GANs to the text domain. After pre-training, we discard the generator and only fine-tune the discriminator (ELECTRA model) in downstream tasks .

The model consists of two parts, namely the generator and the discriminator. Both are Encoder structures of Transformer, but their sizes are different:

  1. Generator The generator
    is a small Masked Language Model (usually 1/4 the size of the discriminator). The specific function of this module is that it adopts the classic BERT MLM method:

    • First, randomly select 15% of the tokens and replace them with [MASK] tokens (it cancels BERT’s 80% [MASK], 10% unchange, 10% random replaced operations. The specific reason is because it is not necessary, because we finetuning the judgment used device)
    • Then, use Generator to train the model so that the model predicts Masked tokens and obtains corrupted tokens. The objective function of Generator is the same as BERT, hoping that the Masked tokens can be restored to the original Original tokens (as shown in the figure above, tokens, and are therandomly cookedselected to be masked, and then the generator predicts the corrupted tokens, which become thesum ate)
  2. The discriminator
    receives input that has been rewritten by the generator. The role of the discriminator is to determine whether each input token is original or replaced. Note: If the token generated by the generator is consistent with the original token, then the token is still original. Therefore, for each token, the discriminator will perform a binary classification and finally obtain the loss

The above method is called Replaced Token Detection.

3. Loss function

Specifically, the generator G and the discriminator D are the two neural networks we trained. Both contain encoders (i.e. Transformer networks), thereby changing the input sequence from x = [ x 1 , . . . , xn ] x =[x_1,...,x_n]x=[x1,...,xn] maps toh ( x ) = [ h 1 , . . . , hn ] h(x)=[h_1,...,h_n]h(x)=[h1,...,hn] . The tasks of the two are different. The training goal of the generator is still MLM (the author also verified later that this method is better), and the goal of the discriminator is sequence labeling (determining whether each token is true or false). Both are trained at the same time, but the gradient of the discriminator is not passed to the generator. So we will choose different loss functions to measure their errors. The objective function is as follows:
RTD objective function
Because the task of the discriminator is relatively easy, RTD loss will be very small compared to MLM loss, so a coefficient is added, and the author used 50 during training.

Another thing to note is that the loss on all tokens is calculated when optimizing the discriminator. In the past, tokens that were not masked were ignored when calculating BERT's MLM loss. In subsequent experiments, the author also verified that performing loss calculations on all tokens will improve efficiency and effectiveness.

4. The difference between ELECTRA and GAN

In fact, there are still many differences between the Generator-Discriminator architecture used by ELECTRA and GAN. The author lists the following points:
The difference between ELECTRA and GAN

5. Disadvantages of ELECTRA

Note that the binary nature of the discriminator may make it unsuitable for downstream tasks . It seems reasonable enough for BERT to use MLM for pre-training and then take on downstream tasks, because we can think that in the process of sampling the whole word, MLM actually establishes a word representation based on its context for all tokens; however, the discriminator The task is binary classification, that is, dividing the representation of the token into two spaces, which may lead to premature degradation of the information in its hidden space.

In addition, the discriminator itself is pre-trained with a binary classification task, so it will obviously help when it faces "close to binary classification" tasks (such as GLUE's CoLA task), but when faced with tasks such as sequence annotation, text The results may be less effective for tasks that are less "categorical" such as generation .

Supplement: An ELECTRA code implemented using PyTorch

References

  1. Detailed explanation of XLNet
  2. ALBERT: Lightweight BERT language model ICLR2020
  3. ELECTRA code interpretation beyond the BERT model
  4. ELECTRA Chinese pre-training model is open source, with only 1/10 the number of parameters, and its performance is still comparable to BERT
  5. XLNet: Generalized Autoregressive Pretraining for Language Understanding
  6. Hugging face: XLNet
  7. XLNet Fine-Tuning Tutorial with PyTorch
  8. Understand how the XLNet outperforms BERT in Language Modelling
  9. Autoregressive vs. Autoencoder Models
  10. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  11. BERT and ALBERT
  12. Visual Paper Summary: ALBERT (A Lite BERT)
  13. ELECTRA detailed explanation
  14. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Guess you like

Origin blog.csdn.net/ARPOSPF/article/details/132677638