Detailed explanation of UniLM, unified language model (Unified Language Model, UniLM)

Prior knowledge

foreword

The pre-training model can be divided into three categories according to the training method or network structure:

One is the Auto-Encoding language model represented by BERT[2] , which uses MLM for pre-training tasks. Auto-encoding pre-training models are often better at discriminative tasks , or called Natural Language Understanding (Natural Language Understanding) , NLU) tasks such as text classification, NER, etc.

The second is the auto-regressive (Auto-Regressive) language model represented by GPT[3] , which generally uses generation tasks for pre-training , similar to when we write an article, the auto-regressive language model is better at doing generation tasks (Natural Language Generating, NLG), such as article generation, etc.

There is also a pre-training model based on the encoder-decoder model architecture , such as MASS [4], which encodes the input sentence into a feature vector through the encoder , and then converts the feature vector into an output text sequence through the decoder . The advantage of the pre-training model based on Encoder-Decoder is that it can take into account both the self-encoding language model and the autoregressive language model: a classification layer can be connected after its encoder to make a discriminative task, while using the encoder and decoder at the same time Then you can do generation tasks.

The unified language model (Unified Language Model, UniLM) [1] to be introduced here looks from the network structure, and its structure is the same encoder structure as BERT. But judging from its pre-training tasks, it can not only use the masked context for training like an autoencoding language model , but also train from left to right like an autoregressive language model . Even a model like the Encoder-Decoder architecture can encode the input text first , and then generate a sequence from left to right.

1. Detailed explanation of UniLM

We just introduced that the three different types of pre-training architectures often require different pre-training tasks for training. But these tasks can be summarized as predicting unknown content based on known content . The difference is which content is known to us and which needs to be predicted. The core content of UniLM unifies the tasks used to train different architectures into a framework similar to the mask language model , and then  adapts different tasks through a variable mask matrix (Mask Matrix ) .  The content of all the cores of UniLM can be summarized in Figure 1.

 

Figure 1: The network structure of UniLM and its different pre-training tasks.

1.1 Model input

First, for an input sentence, UniLM uses WordPiece to segment it. In addition to the token embedding obtained by word segmentation, UniLM adds position embedding (in the same way as BERT) and segment embedding (Segment Embedding) for distinguishing two segments of text pairs. In order to get the feature vector of the whole sentence, UniLM adds [SOS]flags at the beginning of the sentence. In order to separate the different segments, it adds [EOS]flags to it. For specific examples, refer to the content in the blue dotted box in Figure 1.

1.2 Network structure

As shown in the red dotted box in Figure 1, UniLM uses the L-layer Transformer architecture. In order to distinguish between different pre-training tasks that can share this network, UniLM adds a mask matrix operator to it. Specifically, we assume that the input text is expressed as  , it gets the input of the first layer after passing through the embedding layer  , and then   gets the final feature vector after passing through the layer Transformer, expressed as . Different from the original Transformer, UniLM adds a mask matrix to it , taking the l layer as an example, at this time, the Transformer is transformed into the form shown in formula (1) to formula (3).

 

 

Among them   are the weight matrix of Transformer's Query, Key, and Value,  which is the mask matrix used to control the pre-training task that we mentioned many times before . By covering the encoded features, the prediction can only focus on the features related to specific tasks , thus realizing different pre-training methods. So what exactly is his working method?

 

 

1.3 Unification of tasks

UniLM has a total of 4 pre-training tasks. In addition to the three language models shown in Figure 1, there is also a classic NSP task. We will introduce them separately below.

Bidirectional language model : The bidirectional language model is the top task in Figure 1. Like the masked language model, it uses the context to predict the masked part. In this way of training, because the model needs to analyze according to all contexts, � is a zero matrix.

One-way language model : The one-way language model can be from left to right or from right to left. The example in Figure 1 is from left to right, which is the mask method used in GPT[3]. In this prediction method, when the model predicts the content of the t-th time slice, it can only see the content before the t-th time slice, so �� is an upper triangular matrix whose upper triangle is all −∞ (the second in Figure 1 shaded part of the mask matrix). Similarly, when the unidirectional language model is from right to left, � is a lower triangular matrix.

Seq-to-Seq language model : In Seq-to-Seq tasks, such as machine translation, we usually first encode the input sentence into a feature vector through an encoder, and then decode this feature vector into a prediction content through a decoder. The structure of UniLM is very different from the traditional Encoder-Decoder model. It only consists of a multi-layer Transformer. When performing pre-training, UniLM first splices two sentences into a sequence, and [EOS]divides the sentence by , expressed as: [SOS]S1[EOS]S2[EOS]. When encoding, we need to know the complete content of the input sentence, so there is no need to overwrite the input text. But when decoding, the decoder part becomes a left-to-right unidirectional language model. Therefore, for the block matrix corresponding to the first segment (S1 part) in the sentence, it is a 0 matrix (upper left block matrix), and for the corresponding block matrix of the second segment (S2 part) of the sentence, it is the upper triangle Part of the matrix (upper right block matrix). So we can get the bottom � in Figure 1. It can be seen that although UniLM adopts the encoder architecture, it can also pay attention to all the features of the input and the generated features of the output like the classic Encoder-to-Decoder when training the Seq-to-Seq language model.

NSP : UniLM also adds NSP as a pre-training task like BERT.

1.4 Training and fine-tuning

Training : During training, 1/3 of the time is used to train the two-way language model, and 1/3 of the time is used to train the one-way language model, in which half of the stations are from left to right and from right to left, and the last 1/3 is used To train the Encoder-Decoder architecture.

Fine-tuning : For NLU tasks, we can directly regard UniLM as an encoder, then [SOS]obtain the feature vector of the entire sentence through the flag, and then obtain the predicted category by adding a classification layer after the feature vector. For the NLG task, we can splice sentences into a sequence " [SOS]S1[EOS]S2[EOS]" as described above. where S1 is the entire content of the input text. For fine-tuning, we randomly mask out parts of the target sentence S2. At the same time, we can mask out the target sentence [EOS]. Our purpose is to let the model predict when to predict [EOS]and stop predicting, instead of predicting a length that we set in advance.

2. Summary

UniLM and many Encoder-Decoder architecture models (such as MASS) are like unifying NLU and NLG tasks , but there is no doubt that UniLM's architecture is more elegant. When MASS is doing NLU tasks, it only uses the Encoder part of the model , thus discarding all the features of the Decoder part. One problem with UniLM is that when doing classic Seq-to-Seq tasks such as machine translation, its masking mechanism causes it not to use the full-sentence features corresponding to the [SOS]tokens , but to use the sequence of input sentences. This method may lack the capture of the characteristics of the entire sentence , resulting in a lack of control over the global information in the generated content.

Reference

[1] Dong, Li, et al. "Unified language model pre-training for natural language understanding and generation." Advances in Neural Information Processing Systems 32 (2019).

[2] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

[3] Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving language understanding by generative pre-training.

[4] Song, Kaitao, et al. "Mass: Masked sequence to sequence pre-training for language generation."arXiv preprint arXiv:1905.02450(2019).

[5] Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv preprint arXiv:1609.08144 (2016).

 

Guess you like

Origin blog.csdn.net/qq_39970492/article/details/131226734