Interpretation of two NLP highly cited papers | BERT model, SQuAD data set

This article is an interpretation of two highly cited papers in the field of natural language processing (NLP) in recent years.

 

1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Author: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Google AI)

论文出处:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Click here to get the "paper address"

research problem

This paper introduces a new language representation model BERT ( B idirectional E nCoder R epresentations from T ransformers), through the joint text unlabeled context information in the pre-training two-way deep representation, only one additional output layer, you can pre The training model is adjusted, and it can be obtained on a variety of language-related tasks without the need for extensive modifications to the architecture of a specific task.

Research methods

The model includes two steps: pre-training and fine-tuning: in the pre-training stage, unlabeled data for different training tasks is trained. In the fine-tuning stage, first initialize the BERT model with pre-training parameters, and then use the labeled data from downstream tasks to fine-tune the pre-trained parameters.

BERT is a multi-layer two-way Transformer model Vaswani et al. (2017). The input includes three parts, namely the word vector, the sentence vector to which the word belongs, and the position vector of the word. The image representation is shown in the figure below, where [CLS] And [SEP] are special symbols placed at the beginning of each input to separate sentences from the user.

The article proposes two unsupervised tasks to pre-train BERT, namely Masked Language Model (MLM) and Next Sentence Prediction (NSP): MLM masks some words in a sentence and then lets the model predict Block words to train the model. In the experimental setting, about 15% of the words are randomly blocked. However, such training methods also have shortcomings. The masked words are equivalent to being erased from the data set, and the pre-training phase may be inconsistent with the fine-tuning phase. Therefore, there are three ways to deal with masked words: 80% are replaced with [MASK], 10% are replaced with random words, and the other 10% are not changed. The NSP task is to enhance the model's ability to understand the relationship between sentences. Among the sentence pairs A and B selected during training, there is a 50% probability that B is really the next sentence of A, and the 50% probability is not the next sentence of A. The pre-training corpus uses text paragraphs from BooksCorpus and English Wikipedia.

13364.png

Model fine-tuning tested the effects on 11 natural language processing tasks, including 8 evaluations in the General Language Understanding Evaluation (GLUE) benchmark test set, SQuAD 1.1 and SQuAD 2.0 reading comprehension data sets, and Situations With Adversarial Generations (SWAG) data set. BERT is more stable than the baseline method. The following table shows the comparison results on GLUE.

13365.png

Analysis conclusion

The BERT model proposed in the article has achieved the most advanced results on 11 natural language processing tasks. The improvement of model effects brought about by language model transfer learning shows that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, even resource-poor tasks can benefit from a deep one-way architecture. The main contribution of the article is to further generalize these findings to a deep two-way architecture, allowing the same pre-training model to be successfully applied to a wide range of NLP tasks.

 

2. Know What You Don’t Know: Unanswerable Questions for SQuAD

Author: Pranav Rajpurkar, Robin Jia, Percy Liang (Stanford University)

论文出处:Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.

Click here to get the "paper address"

research problem

Reading comprehension systems (models) can usually find the correct answer to the question in the context document, but the answers they give are not so reliable for questions that do not have the correct answer in the context. Existing data sets either only focus on answerable questions, or use easily recognizable automatically generated unanswerable questions as the data set. In order to make up for these shortcomings, the article introduces the latest version of the Stanford Question and Answer Data Set (SQuAD)-SQuAD 2.0, which integrates the answerable questions in the existing SQuAD and more than 50,000 difficult-to-answer questions written by public workers. The difficult-to-answer questions are similar to the answerable questions. In order to perform better in SQuAD 2.0, the system must not only answer questions when possible, but also determine when the context of the paragraph does not support answers, and avoid answering questions. The SQuAD 2.0 dataset is a challenge to existing models in natural language understanding tasks.

research content

Data set: Crowdsourced staff are hired on the Daemo platform to write unanswerable questions. Each task consists of an entire article from SQuAD 1.1. For each paragraph in the article, the staff can ask up to 5 questions that cannot be answered by paragraph alone. At the same time, they must refer to the entities in the paragraph and give a reasonable answer. At the same time, the staff members are shown the questions in SQuAD 1.1 for reference, and try to make those difficult to answer questions similar to the answerable questions.

The article evaluates the performance of three existing model architectures on two data sets, so that these models not only learn the distribution of answers, but also predict the probability that a question is an unanswerable question. When the model predicts that the probability that a question cannot be answered exceeds a certain threshold, the model gives up learning the answer distribution. The following table shows the performance of the three models on two data sets (SQuAD 1.1 and SQuAD 2.0). The results show:

  • The best-performing model (DocQA + ELMo) still has a 23.2 gap with humans on SQuAD 2.0, which means that the model has a lot of room for improvement;
  • Using the same model architecture on the two data sets, compared to SQuAD1.1, the gap between the optimal model and the F1 value of people is larger on SQuAD 2.0, indicating that SQuAD 2.0 is a more difficult data to learn for the existing model .

13366.png

In order to prove the difficulty of answering questions in SQuAD 2.0, the article uses TFIDF and rules to randomly generate some difficult questions on the SQuAD 1.1 data set, and still uses the same model for comparison. The results show (as shown in the table below) that the best model is still the lowest on the SQuAD 2.0 data set, which once again proves that SQuAD 2.0 is a difficult challenge for existing language understanding models.

13367.png

Research result

The article proves that SQuAD 2.0 is a challenging, diverse and large-scale data set, which forces the model to learn under what circumstances a question cannot be answered in a given environment. We have reason to believe that SQuAD 2.0 will promote the development of new reading comprehension models, which can know what they don’t know, so that they can understand language at a deeper level.

 

Past review:

The TOP100 list of NeurIPS ten-year highly cited scholars is released! These big cows are worthy of worship!

NeurIPS 2019 | National University of Science and Technology and Xiamen University jointly proposed FreeAnchor: a new anchor matching learning method

Michael Jordan won the 2020 IEEE Von Neumann Prize and has trained many college students including Bengio

Guess you like

Origin blog.csdn.net/AMiner2006/article/details/103458461