Artificial intelligence automatic writing software MASS universal pre-training method based on

Since 2018, pre-training is undoubtedly one of the hottest research topics of natural language processing. By using BERT, GPT and XLNet other common language model automatic writing software, researchers in the field has made many significant breakthroughs in natural language understanding. However, these mainstream pre-training methods did not bring significant improvements to natural language generation tasks. To this end, Microsoft Asia Research Institute proposed a new universal pre-training methods --MASS, than BERT can get better results and GPT in this task.

Artificial intelligence automatic writing software
BERT and XLNet great success in natural language understanding tasks (such as sentiment classification, natural language SKd reasoning and reading comprehension) aspects. However, in addition to the task of natural language understanding, natural language processing (NLP domain) has many language sequence to generate a sequence of tasks, such as automatic writing software , text summarization generation, generate dialogue, questions and answers, text style conversion. For these tasks, an encoder - Note - mainstream decoder frame method.
FIG encoder - Note - frame decoder
shown in Figure 1, the encoder source sequence x as input and converts it to a sequence hidden, and then the decoder extracts from the sequence information shown hidden by noting the encoder mechanism and automatically generating a target sequence of text Y
BERT and XLnet generally pre-trained encoder natural language understanding. GPT is a pre training language model decoder. When a sequence to BERT and GPT sequence generation task language, we usually need to separately encoder and decoder pretraining. In this case, the encoder - Note - Note decoder frame and mechanisms are not joint training. However, the focus mechanism is very important in this type of task, once missing, Bert and GPT will not be able to achieve the best performance.
A new pre-training method
for natural language generation from sequence to sequence tasks, Microsoft Research Asia, machine learning team presents a new pre-training methods, namely masking sequence to the pre-training sequence. MASS random screening length of the sentence section K of the encoder and - the shield segment decoder prediction frame - Note.
FIG 2 mass frame
2, the third to sixth flag encoder side is blocked, while the decoder side, only the prediction flag is masked, and the other marker is shielded. Mass pre-training has the following advantages:
other indicia decoder side (unshielded flag encoder side) is masked, thus promoting the decoder extracts additional information to help predict continuous sentence fragments, and promote encoder - Note - Decoder joint training structure;
In order to provide more useful information to the decoder, the encoder is forced to extract the meaning of the flag is not shielded, which may improve the ability to understand the source encoder text sequence;
decoder is designed for predicting the continuation flag (sentence fragments), which may improve the modeling language ability decoder.
Former unified framework for training
quality is an important parameter of super-K (the length of the shield section). By adjusting the value of K, the standard may be MASS language modeling language modeling masking BERT in combination with GPT, extending to a common pre-trained MASS frame.
When k = 1, according to the design, marking MASS encoder side is shielded, the side shield predictive decoder indicia, as shown in FIG. Decoder does not input information, corresponding to the shield so MASS BERT the language model.
When the flag FIG. 3 when k = 1, the encoder side is shielded, the side predictive decoder shielding numerals
as when k = m (m is the length of the sequence), in MASS, all markers encoder side are masked , while all the markers predictive decoder side, as shown in FIG. The decoder can not extract any information from the encoder. GPT quality equivalent to the standard language mode.
Figure 4 when k = m, the encoder side all words are masked, and all markers predictive decoder side, this is equivalent to the standard GPT language model
probability formula quality under different K value shown in Table 1, wherein M is the length of the sequence, U and V are the start and end positions of the masking segments.
A position marker sequence from U to V is shielded position. As can be seen, when k = 1 or m, Bert probability formula MASS corresponds masking language model and GPT standard language model.
Table 1 Quality probability formula under different K-values
researchers experimental analysis quality performance under different k values, shown in Figure 5:
FIG. 5 shows a prior training and during the fine tuning MASS performance in all masked length k, comprising a) PPL b) WMT13 EF French translation sentences c) BLEU d) ROUGE e) is WMT13 unsupervised EF translation generated PPL dialogue when k is equal to half the length of the sentence, a downstream task can achieve the best performance. Half mask word in a sentence can be well balanced pretraining part of the encoder and decoder. If pretraining prefer encoder side (k = 1, i.e. BERT) or decoder side deflection (k = m, LM / GPT ), optimum performance can not be achieved, which also shows the sequence in the sequence generated MASS language task advantage.
Task order to generate test sequence language
pre-training
is worth noting that, MASS only a single language for unsupervised pre-training data (for example, WMT news crawl data, Wikipedia data, etc.). ). MASS support cross-language tasks (machine translation) and monolingual tasks (such as text generation and dialogue summary generation). In the pre-training tasks across languages (such as English and French translation), researchers can simultaneously pre-trained English in one model - English and French - French language and the use of additional embedded vector to distinguish language. In unsupervised machine translation, low resource machine translation, text summarization generate dialogue and generate four areas, researchers MASS fine-tuning to verify its validity.
Unsupervised Machine Translation
Machine translation task on unsupervised, researchers will MASS were compared, including the most advanced method of facial XLM with previous methods. XLM BERT created using masking pre-trained language model and language model are standard encoder and decoder pre-training.
The results are shown in Table, MASS on WMT14 Britain and France, WMT16 Britain and Germany and Romania, six English translation directions are better than 2 XLM, and get the latest optimal results.
Table 2 compares the quality and previous unsupervised machine translation methods; British and French translation of the report are available on the 2014 test news, other news can be found in the 2016 test. Due to the use of different combinations of XLM MLM and CLM at the encoder and the decoder, which showed the highest reported values for each language pair BLEU the XLM.
Low resource machine translation
Low resource machine translation refers to the use of limited bilingual training data for machine translation. The researchers simulated WMT14 English - French, WMT16 English - German and English - Romanian translation of low resource scenarios (respectively 10K, 10 Million and 1M bilingual data).
FIGS. 6 and Comparative Mass low resource machine translation method of
FIG. 6 shows that, in the performance of different data MASS scales have different degree than the baseline model has not been pre-trained, and monitors the data less, more significantly improved results. Text Summarizer
researchers MASS LM and BERT (BERT with a pre-trained encoder, a standard pre-trained language model LM decoder) and DAE (de-noising from the encoder) were compared. As can be seen from Table 3, and the quality is better than BERT LM DAE.
Table 3 generates a text summary task MASS and compare the two pre-training method of
dialogue generation
researchers compared MASS BERT LM. Table 4 shows the quality of slaughtered below BERT LM.
Table 4 compares data between quality and LM BERT
MASS language in order to generate sequential tasks continue to make significant progress. Facebook researchers said they expected future performance test MASS task in natural language understanding, and hopes to expand applications to include other sequences MASS of voice, video sequence to generate a task.
If you have an Internet problem, you can ask me, thank you! If you want to learn artificial intelligence together, welcome message exchange.

Guess you like

Origin www.cnblogs.com/phploser/p/12191567.html