Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting翻译

Summary

In this article, we generalize text filling (for example, masking language models) by using sequence span rewriting (SSR) as a self-supervised sequence-to-sequence (seq2seq) pre-training target. SSR rewrites incorrect text span content into real content through a supervised model, and provides a more fine-grained learning signal for text representation. It is more than filling the text with many downstream seq2seq tasks (rewriting the source sentence into the target sentence) Unanimous. Our experiments using the T5 model on various seq2seq tasks show that SSR can greatly improve the pre-training of seq2seq. In addition, we observed that SSR is particularly useful in improving the pre-training of small seq2seq models using powerful imperfect span generators, which demonstrates a new perspective on seq2seq pre-training that transfers knowledge from large models to smaller models.

1 Introduction

Insert picture description here
  Text filling (for example, masking language models) has become a common learning goal for pre-trained models in natural language processing (NLP). As shown in Figure 1, it provides a self-supervised learning method for the model by filling the masked part for plain text reconstruction, and enables the model to effectively learn the text representation based on the context and the prediction of the masked content.
  In this article, we generalize text filling by rewriting the text . Specifically, we propose sequence span rewriting (SSR) as a sequence-to-sequence (seq2seq) pre-training target. As shown in Figure 1, SSR provides a self-supervised learning method to rewrite incorrect text span content into real content. Compared with text filling, SSR has two advantages: 1) It provides more and more fine-grained learning signals on how to improve the text span by rewriting, because the prediction of the model is not only based on context, but also depends on Incorrect span; 2) It is more consistent with downstream seq2seq tasks (for example, text summarization), in which the source sentence is mapped to the target sentence according to certain rewriting patterns.
  As mentioned earlier, the plain text format is always replaced with empty text (that is, MASK), and SSR replaces the text span with incorrect text. Although there are several direct methods (for example, random or rule-based noise and destruction) to generate incorrect spans, most of them cannot generate diverse and informative text for spans . Therefore, the model can only learn limited and worthless rewriting patterns, thereby degrading SSR to text filling. In order to make full use of SSR, we recommend using a pre-trained text filling model as an incorrect span generator (inspired by ELECTRA) to generate incorrect spans while ensuring its quality and diversity, as shown in Figure 1. In this way, SSR enables the model not only to learn to reconstruct the mask sequence, but also to learn meaningful and diverse rewriting patterns, including the correction of paraphrase, grammar, common sense and even factual errors to improve the text sequence.
  In our experiments, we apply SSR to a T5 model that has been pre-trained with a text-filled target, and use the T5-large model as our incorrect span generator. We show that SSR improves the original T5 model and its continuous training changes in monolingual seq2seq tasks (including text summarization, problem generation, and grammatical error correction). In addition, we observed that SSR is particularly helpful in improving the pre-training of smaller seq2seq models through a powerful incorrect span generator, which demonstrates the potential for transferring knowledge from large models to smaller models for seq2seq .

2. Related work

(1) NLP pre-training
  such as ELMo, GPT's early pre-training methods are based on language modeling. Recently, Radford et al. (2019); Brown et al. (2020) showed that very large language models can act as unsupervised multi-task/small-sample models.
  BERT introduces the goal of shielding language modeling by shielding certain characters in the text and predicting them based on their left and right contexts. Recent work has shown that the performance of BERT can be further improved by training for a longer time, binding parameters across layers and by discrimination instead of generation. However, in BERT like the shielded language model, the prediction is not autoregressive, which reduces its effectiveness for NLG tasks . In order to enable the masked language model for natural language generation tasks, Song et al. (2019) used decoder autoregression to generate masked characters instead of directly connecting MLP after the encoder.
  UniLM uses a set of shields to embed fine-tuning BERT, some of which only allow the context to be viewed to the left. This allows UniLM to be used for both generation and discrimination tasks.
  Recently, BART and T5 proposed to use text filling as a self-supervised goal to pre-train a text-to-text converter suitable for a wide range of NLU and NLG tasks. Specifically, they delete the text span in the input text and train the model to recover the original text in an autoregressive manner. CALM proposes to pre-train the text-to-text converter by composing sentences based on keywords or concepts. All of the above methods can be regarded as denoising autoencoders, which are trained to receive noisy input as the source sequence and restore the original input. However, the noisy sentences they received during pre-training are incomplete sentences, so there will be a gap between pre-training and fine-tuning.
  (2) Compression and acceleration of pre-training model
  Recently, many attempts have been made to compress and accelerate large-scale pre-trained language models. For example, Shen et al. (2020) used Hessian information to quantize BERT to 2 bits. Michel et al. (2019) trimmed unnecessary attention heads in the transformer layer to reduce the parameters of the BERT model. DistilBERT uses knowledge distillation to compress BERT. Recently, Xu et al. (2020) introduced gradual module replacement to train a more compact BERT model. In addition, Zhou et al. (2020c); Schwartz et al. (2020) proposed input adaptive inference to speed up the inference stage of the pre-trained model. However, so far, there has been very little work on compressing large-scale pre-trained text-to-text transformers for downstream NLG tasks.

3. Sequence span rewrite

Insert picture description here
  The key idea of ​​the sequence span rewriting target is to train a text-to-text transformer to rewrite machine-generated text. The text may contain a variety of noises, such as paraphrases, vocabulary substitutions, grammatical errors, wrong world knowledge/common sense, text expansion and simplify. Specifically, using sequence span rewriting targets for self-supervised training involves three steps: (1) text span masking; (2) text filling and (3) sequence span rewriting . We will describe these steps in detail and discuss the advantages of sequence span rewriting goals in terms of general pre-training and distillation of large text-to-text transformers.
  (1) Text Span Masking
  In order to generate training data for sequence span rewriting in a self-supervised manner, we first randomly sample multiple text spans and mask them . Specifically, replace each span with a single [MASK] character with the span length derived from the Poisson distribution (λ=3). The number of spans is controlled so that approximately 30% of all characters are blocked. The 0 length span corresponds to the insertion of [MASK] characters. This step is to ensure that the noisy input text used for the text filling step matches the pre-training data distribution of BART and T5 used for data generation.
  (2) Text filling    The
  second step is to use the pre-trained text-to-text transformer with the text filling target to restore the shielded text span. Specifically, we input the noisy sentence into the text filling model to generate the predicted text span in an autoregressive manner. In order to improve the diversity of rewriting patterns, we use kernel sampling, which cuts off the unreliable tail of the probability distribution, and samples the dynamic kernels of characters that contain most of the probability quality.
  (3) Sequence span rewriting
  uses the data generated in the previous step, and then we train the model to rewrite the machine-generated text span into the original text. Specifically, we use the special tag <si> <s_i><si> < / s i > </s_i> </si> Represents the source to be rewritten in the sequenceiiThe start and end of i text range, and use<si> <s_i><si> Separate different original text ranges in the target sequence. We train the model to generate the target text range from left to right autoregressive through maximum likelihood estimation. The whole pre-training process is shown in Figure 2.
  (4) Pre-training through rewriting
  Here we discuss several key advantages of the proposed sequence span rewriting target over the conventional text filling target used for pre-training text-to-text transformers.
  First, the task of sequence span rewriting is closer to the downstream sequence conversion task, because there is a reference for generating the target text span in the source sequence, thereby reducing the gap between pre-training and fine-tuning. Many important NLG tasks (such as machine translation, text summarization, grammatical error correction, and text style conversion) can be regarded as text rewriting problems. The input text can be rewritten into another language, compressed text, correct sentences and Another style.
  Second, sequence span rewriting introduces more noise patterns, including paraphrase and simplified text span, missing or redundant information, grammatical errors, and errors related to world knowledge or common sense. In contrast, traditional self-supervised pre-training is usually based on rule-based noise functions, such as text span masking, character masking/deleting, sentence scrambling, etc.
  Finally, the sequence span rewrite target enables the model to learn from the errors learned from the fill model used for data generation (if the rewrite model is initialized by the fill model, the model itself), thereby providing more useful self-supervision.
  (5) Distillation by rewriting
  The self-supervised sequence span rewriting also provides a new perspective for improving small models with the knowledge of large-scale text-to-text pre-training models. This can be achieved by using a large teacher model pre-trained with text filling targets to generate filling data, and using the generated data and sequence span to rewrite the targets to pre-train a small student model. Unlike traditional knowledge distillation or sequence-level knowledge distillation, sequence span rewriting enables the student model to simultaneously utilize the teacher’s output and authenticity. In a sense, the model only needs to learn how the teacher model cannot predict. In a sense, it is also related to promoting learning and residual learning. In addition, the goal of sequence span rewriting reduces the multimodal problem described in Zhou et al. (2020a). Providing source-side references also reduces the ability to complete pre-training tasks, because the model can refer to the machine-generated text range during the generation process, instead of unconditionally generating as in the case of filling text.

Guess you like

Origin blog.csdn.net/qq_28385535/article/details/112917867