A brief introduction to the UniLM model

Table of contents

1. Summary

2. In-depth expansion

2.1 Pre-training task

2.2 Model fine-tuning 


1. Summary

If you compare a Transformer-based bidirectional language model (such as the masked language model in the BERT model) with a unidirectional autoregressive language model (such as the decoder of the BART model), you can find that the difference between the two is mainly that the model can use Which part of the information in the sequence performs the calculation of the hidden layer representation at each moment. For a two-way Transformer, the calculation of the hidden layer at each moment can use any word in the sequence; for a one-way Transformer, only the word information in the current moment and "history" can be used. Based on this idea, the researchers proposed a unified language model ( Unified LanguageModel, UniLM ) with a one-way Transformer structure.
Different from the encoder-decoder structure of the BART model, UniLM only needs to use a Transformer network to complete the pre-training of language representation and text generation at the same time, and then apply it to language understanding tasks and text generation tasks through model fine-tuning. Its core idea is to control the attention range of each word by using different self-attention mask matrices, so as to realize the control of information flow by different language models.

2. In-depth expansion

2.1 Pre-training task

The UniLM model provides a unified framework for pre-training with bidirectional language models, unidirectional language models, and sequence-to-sequence language models. Among them, pre-training based on bidirectional language model enables the model to have the ability of language representation, which is suitable for downstream tasks of language understanding; while pre-training tasks based on unidirectional language model and sequence-to-sequence language model enable the model to have the ability of text generation. The figure below shows the self-attention mask patterns corresponding to different pre-training tasks .

Assuming that the self-attention matrix of the L-layer Transformer is AL, in UniLM, AL can be calculated by the following formula :

In the formula, QL and KL are respectively the vector corresponding to the query and the key obtained by the L-th layer context representation after linear mapping: d is the dimension of the vector. UniLM adds a mask matrix M ∈ R (dimension: n x n) on the basis of the original self-attention calculation formula, n is the length of the input sequence, M is a constant matrix, defined as follows :

By controlling M, different pre-training tasks can be achieved.

(1) Bidirectional language model . The input sequence consists of two text fragments, separated by a special token [EOS]. Similar to the BERT model, some words are randomly sampled in the input text and replaced with [MASK] tags with a certain probability, and finally the correct word is predicted at the corresponding position of the output layer. In this task, any two words in the sequence are "visible" to each other, and thus can be "noticed" during the forward pass. Reflected in the Transformer model, it is a fully connected self-attention calculation process, as shown in Figure (a). At this point, no changes are made to the original self-attention mask matrix, ie M=0.
(2) One-way language model . Includes forward (left-to-right) and backward (right-to-left) autoregressive language models. Take the forward language model (figure (b)) as an example. For the calculation of the hidden layer representation at a certain moment, only the context representation at the current moment and its left (previous layer) can be used. The corresponding self-attention distribution is a triangular matrix, with gray representing an attention value of 0. Correspondingly, the value of the mask matrix M at the gray area is negative infinity (−∞).
(3) Sequence-to-sequence language model . Using the mask matrix, the sequence-to-sequence language model can also be conveniently implemented, and then applied to the conditional generation task. At this point, the input sequence consists of two text fragments that serve as the condition and the target text (to be generated) respectively. The words in the conditional text segment are "visible" to each other, so fully connected self-attention is used; for the target text segment, it is generated word by word in an auto-regressive manner. At each moment, all contextual representations in the conditional text can be used. And part of the left context representation that has been generated, as shown in Figure (c). In related literature, this structure is also called prefix language model (PrefixLM) .
Different from the encoder-decoder framework of the BART model, the encoding and decoding parts here share the same set of parameters , and the cross-attention mechanism between the conditional text and the conditional text is also different in the process of autoregressive generation.

2.2 Model fine-tuning 

(1) Classification tasks . For classification tasks, the fine-tuning method of UniLM is similar to that of BERT. Here, a bidirectional Transformer encoder (M=0) is used, and the last hidden layer representation at the first mark [BOS] of the input sequence is used as the representation of the text, which is input to the target classifier, and then the label of the target task is used Data fine-tuning model parameters.
(2) Generate tasks . For the generative task, words in the target text fragment are randomly sampled and replaced with [MASK] tokens, and the learning goal of the fine-tuning process is to recover these replaced words. It is worth noting that the [EOS] token at the end of the input sequence is also randomly replaced, allowing the model to learn when to stop generating.

Guess you like

Origin blog.csdn.net/weixin_45684362/article/details/130172182