In-depth understanding of deep learning - BERT derived model: T5 (Text to Text Transfer Transformer)

Category: General Catalog of "In-depth Understanding of Deep Learning"


The full name of T5 is Text to Text Transfer Transformer, which is a general model in the field of pre-trained language models proposed by Google. This model converts all natural language problems into text-to-text form and solves them with a unified model. In order to obtain a unified high-quality pre-trained language model, T5 inevitably embarked on the road of "strengthening miracles", using a larger model and more data, but the size of the model and data is only T5's path to the strongest One of the means of the model, the core concept of T5 is: use the prefix task statement and text answer generation to unify the input and output of all natural language processing tasks. Almost all pre-trained language models before this need to add a nonlinear layer in the downstream task fine-tuning process to convert the output of the model into the output format specified by the task. T5 does not need to make any changes to the model, only needs to provide fine-tuning data for downstream tasks; does not need to add any nonlinear layers, the only thing that needs to be done is to add a task declaration prefix before the input data, the input format of T5 is shown in the figure below and output format. The green part represents the translation task, the red and yellow parts represent the CoLA task and the STS-B task respectively, the blue part represents the summary generation task, the left box represents the input sample of T5, and the right box is the corresponding output result.
T5 input format and output format slice description
T5 converts all natural language processing tasks into an almost consistent format, that is, the input is a text sequence with a task prefix statement, and the output text sequence is the result of the corresponding task. Its input format and output format are similar to the format of GPT-3 under Few-shot Learning settings. Unlike GPT-3, T5 is suitable for all natural language processing tasks, while GPT-3 is limited by the model structure, only in It has unique advantages on text generation tasks. Since the details of T5 papers and comparative experiments are very rich, this article selects key algorithms and model details for introduction.

algorithm details

model structure

T5 considers three model structures in the structure selection, as shown in the figure below, which are Encoder-Decoder structure (traditional Transformer structure), Decoder structure (GPT structure) and Prefix LM structure (UniLM structure).
Structural selection diagram of T5
T5 has tested these three model structures and found that the Transformer Encoder-Decoder structure works best, so following the law of practice to gain true knowledge, the model structure of T5 adopts the traditional Transformer structure.

training data

The text for T5 training comes from the Common Crawl dataset, which is crawled from the Internet (about 20TB of text data is crawled every month). T5 selected the data in April 2019, after cleaning, obtained 750GB of data that met the training requirements, used it as the training data, and named it the C4 data set (Colossal Clean Crawled Corpus). The specific cleaning requirements are as follows:

  • Only sentences ending with normal symbols such as periods, exclamation points, question marks, and quotation marks are kept.
  • Delete all page data whose content is less than 5 sentences and the sentence length is less than 3 words.
  • Delete all page data containing pornographic words.
  • Remove words containing ".JavaScript" (for web data).
  • Delete page data that contains the placeholder "lorem ipsum" (often seen in typesetting tests).
  • Delete the page data where the pair of curly braces "{}" often appears in programming languages.
  • Delete the sentences repeated more than 3 times in a row, and keep only one sentence.
  • Using the language detection tool langdetect, only retain the data whose language is detected as English and whose confidence level exceeds 0.99 (training a model for English tasks).
Formatting of input and output

The biggest difference between T5 and other models is that it is more friendly to fine-tuning training for downstream tasks. There is no need to make any changes on the model side, just simply rewrite the training data of downstream tasks, and you can use T5 to complete the corresponding tasks. Specifically, before inputting data, add the prefix statement of the task, and convert the output data into a text representation, and the classification task can use the category name. Especially for tasks where the output result is a continuous value, according to the label distribution of the training data, the numerical format is converted into a text format by fuzzy methods, such as continuous scores of 1 to 5 points, and quantified in buckets at intervals of 0.2 , you can get a series of string label categories, such as "1", "1.2", "4.8", "5" and so on. The following describes the rewriting methods of two classic tasks.

  • CoLA (The Corpus of Linguistic Acceptability, judging whether the grammar of the sentence is acceptable, belongs to the binary classification task, output 0 means unacceptable, output 1 means acceptable):

Original input: John made Bill master of himself
Original label: 1
T5 input: cola sentence: John made Bill master of himself.
T5 label: acceptable

  • STS-B (Semantic Textual Similarity Benchmark) is a semantic similarity detection task, which outputs continuous values ​​within 1 to 5 points:

Original input 1: Representatives for Puretunes could not immediately be reached for comment Wednesday.
Original input 2: Puretunes representatives could not be located Thursday to comment on the suit.
Original label (numeric type): 3.25
T5 input:

  • stsb sentence1:Representatives for Puretunes could not immediately be reached for comment Wednesday.
  • sentence2: Puretunes represen-tatives could not be located Thursday to comment on the suit.

T5 tag (string type): 3.2

training process

T5 has conducted many comparative experiments to select the most suitable training method. Specifically, there are the following three types (use "|" to separate the input and output of T5):

  • Standard language model: Know the first half sentence, predict the second half sentence. For example: Thank you for inviting|me to your party last week.
  • BERT style: cover up some words and restore the covered words. For example: Thank you <MASK>``<MASK>me to your party <MASK>week|Thank you for inviting me to your party last week.
  • Out-of-order restoration: Disrupt the order of the text and restore the correct word order. For example: party me for your to.last you inviting week Thank|Thank you for inviting me to your party last week.

Experiments have shown that the BERT-style training method works best, which is actually one of the training methods used by BART (BART also uses other noise methods). The BERT-style training method still has many details that need to be considered, such as mask range and replacement word selection strategy, etc. Therefore, T5 has done 3 sets of experiments to select the best replacement word method. Specifically, there are the following three types ( Use "|" to split the input and output of T5):

  • Single words are <MASK>replaced with , as in BERT. For example: Thank you <MASK>``<MASK>me to your party <MASK>week|Thank you for inviting me to your party last week.
  • Several consecutive words are replaced together, and only the replaced words are predicted. For example: Thank you <X>me to your party <Y>week. | <X>for inviting <Y>last<Z>
  • Randomly discard several words and only predict the discarded words. For example: Thank you me to your party week.|for inviting last

Experiments show that the method of replacing several consecutive words together works best. Although the idea is consistent with BART's Text Infilling noise training method, it is more concise in form. BART uses the original text as the label of the Decoder, while T5 only needs to predict that it is covered Part of the words can be. At the same time, T5 also conducted more detailed experiments to test the performance of the model under different mask ranges to find the most suitable mask length. The experimental results show that the length of 3 is the most suitable, and the replacement probability follows the principle of 15% of BERT is the best.

To sum up, T5 has used the techniques used in pre-trained language models in recent years, and has done a lot of comparative experiments. Although T5 does not propose a new model structure or a new training mode, it cleverly rewrites the input and output so that T5 can fine-tune training on the downstream task data set without changing the model structure. progress. With the help of large model size and training data, T5 with 11 billion parameter scale has achieved SOTA effect on almost all tasks, once again proving the law of "strengthening miracles" in the field of pre-training language models. From the introduction of Transformer to the emergence of T5, the wheels of history rolled forward, and the pre-trained language model based on Transformer returned to its original appearance, which also established the unshakable position of Transformer as a feature extractor in the field of natural language processing.

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015 [
2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J] . arXiv preprint arXiv:2106.11342, 2021.
[3] Che Wanxiang, Cui Yiming, Guo Jiang. Natural Language Processing: A Method Based on Pre-Training Model [M]. Electronic Industry Press, 2021. [4]
Shao Hao, Liu Yifeng. Pre-training language model [M]. Electronic Industry Press, 2021.
[5] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019
[6] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Practice[ M]. People's Posts and Telecommunications Press, 2023
[7] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131362321