The latest language learning method represents XLNet, beyond BERT on the 20 tasks

The latest language learning method represents XLNet, beyond BERT on the 20 tasks

Original Address: https://blog.csdn.net/qq_31456593/article/details/93015488

Abstract : With the ability to model the context of two-way, compared with the pre-training method based on self-regression modeling language, based on pre-trained auto coding based BERT to achieve better performance. However, depending on the use of the input mask, BERT ignore dependencies between the shielding position, and is subjected to a pre-trained - Effect trim differences. Based on these advantages and disadvantages, we propose XLNet, A generalized autoregressive pre-training method, which (1) the order of arrangement of all anticipated the possibility of a two-way learning by maximizing the context of decomposition, and (2) because of its auto-regressive nature to overcome the limitations of BERT. In addition, XLNet the most advanced self-creative regression model Transformer-XL is integrated into the pre-training. According to the experiment, XLNet performance on tasks better than 20 BERT, and has greatly improved and achieve the most advanced results in 18 missions, including Q & A, a natural language reasoning, sentiment analysis and document ranking.

Unsupervised learning represents a great success in the field of natural language processing [7,19,24,25,10]. Typically, these methods first on large-scale text corpus marked pre-trained neural network, then the model downstream tasks or represent fine-tuning. Under this share high-level ideas, literature explores the different unsupervised pre-training goals. Among them, autoregressive (AR) modeling languages ​​and automatic coding (AE) are the two most successful pre-training goals.

AR modeling language from trying to use regression models to estimate the probability distribution of the text corpus [7,24,25]. In particular, a given text sequence x = (x1, ..., xT), AR modeling language to be decomposed into a product of the former possibility p ( x ) = t = 1 T P ( x t x > t ) p(x)=\prod_{t=1}^T P(x_t|x_{>t}) Or the product after p ( x ) = t = T 1 P ( x t x < t ) p(x)=\prod_{t=T}^1 P(x_t|x_{<t}) . Training the model parameters (e.g., neural network) to simulate conditions of each distribution. (Forward or backward), and therefore invalid since only trained language model AR coding depth unidirectional bidirectional text modeling context. In contrast, the downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pre-training.

In contrast, based on pre-trained AE is not an explicit density estimation, but is intended to reconstruct the original data from the input. A notable example is the BERT [10], which uses the most advanced pre-training methods. Sequence given input token, the token is part of a special symbol [the MASK] Alternatively, the model train and to restore the original version of the token from damage. Since the density estimate is not part of the target, thus allowing bidirectional context reconstruction using BERT. As a direct benefit, which shut down the two-way information gap AR language modeling to improve performance. However, during training, BERT used in pretraining [the MASK] other artificial symbol does not exist in the actual data, resulting in differences in pre-trained network. In addition, due to the predicted token recap in the input, and therefore you can not use the product rule for BERT joint probability modeling, as in language modeling. In other words, BERT hypothesis prediction token with a given token unmasked independent, since the higher order, the remote in a natural language-dependent are ubiquitous, thus overshooting [9].

Pros and cons of a pre-existing language training objectives compared in this work, we propose a generalized auto regression method, which takes advantage of AR and AE language modeling, while avoiding their limitations.

  • First, conventional AR model no longer used as a fixed forward or sequentially to the cracker, XLNet maximizes the expected sequence logarithmic likelihood all possible permutations of the order of decomposition. Since the permutation operation may comprise the context of each token from the position of the left and right. Desired, each location to learn the use of context information from all locations that capture the bi-directional context.

  • Second, as a general language model AR, XLNet not dependent on data corruption. Therefore, XLNet not be pre-trained by BERT - Effect of fine-tuning differences. At the same time, since the return target also provides a natural way to use the product rule to predict the decomposition of the joint probability token, eliminating the independence of the assumptions made in BERT.

In addition to the new pre-training objectives, XLNet also improved pre-trained architectural design.

  • The XLNet Transformer-XL [9] repeating segment coding schemes and the counter mechanism is integrated into the pre-training, thereby improving the performance empirically, particularly for tasks involving longer sequence of text.

  • The Transformer (-XL) architecture is simply applied language modeling based arrangement does not work since decomposition order is arbitrary and the target is not clear. As a solution, we recommend re-parameterization Transformer (-XL) network to eliminate ambiguity.

According to the experiment, XLNet achieve the most advanced results on 18 tasks, namely 7 GLUE language understanding tasks, three reading comprehension tasks, including SQUAD and the RACE, 7 text classification tasks, including Yelp and IMDB, and ClueWeb09-B document ranking task. In a series of experiments fair comparison, XLNet in multiple benchmarks is always better than BERT [10].

Perceptual coding sequence is position essentially on XLNet. This is important for understanding language, because the model is reduced to disorderly bag of words, the lack of basic skills. These differences stems from the motivation of the fundamental differences - the previous model aims to "disorder" by the induction bias into the model to improve the density estimation, and motivation XLNet is a two-way learning context by enabling the AR model language.

XLNet on each task performance:

Reading Comprehension

1561005917295

1561005219376

Text Categorization

1561005275146

GLUE task

1561005297083

Document ordering

1561005341595
论文:[XLNet: Generalized Autoregressive Pretraining for Language Understanding](XLNet: Generalized Autoregressive Pretraining for Language Understanding)

Code: zihangdai / xlnet

Published 143 original articles · won praise 345 · views 470 000 +

Guess you like

Origin blog.csdn.net/qq_31456593/article/details/93015488