NLP's machine reading comprehension field data set summary

One, cloze-style (cloze style )

  1. CNN/Daily Mail dataset

From the paper "Teaching machines to read and comprehend." by Hermann et al., 2015

This is a cloze-style reading comprehension dataset (in English) created from CNN and Daily Mail news articles using heuristics. Close-style means that a missing word must be inferred. In this case, the "problem" is created by replacing the entity from the main points that summarize one or more aspects of this article. Replaced the Coreferent entity with the entity tag @entityn, where n is a different index. The task of the model is to infer the missing entities in the bullets based on the content of the corresponding article, and to evaluate the model based on accuracy .

This data set is a relatively classic data set in the field of machine reading comprehension in nlp. Many models proposed by machine reading theory papers use this data set for verification and comparison.

  1. Children’s Book Test

From the paper [Hill et al., 2016] " The goldilocks principle: Reading children's books with explicit memory representations. "

Select 21 consecutive sentences from a children’s book . Then, treating the first 20 sentences as context, the problem is to infer the missing words in the 21st sentence.

 

2. Muti-Choice (Multiple Choice Questions)

  1. MCTest

 Richardson et al. constructed the first comprehensive reading comprehension data set MCTest since the neural network wave in 2013. The data set contains 660 fictional stories, each with 4 questions and 4 candidate answers.

论文:Mctest: A challenge dataset for the open-domain machine comprehension of text.

  1. RACE

Lai , who in 2017 collected in a 2 Wanduo articles and 10 Wanduo Road Test of English topics from Chinese middle and high school students involved in the field is very wide. These questions are raised by experts, originally to test the level of human reading comprehension. Therefore, answering this question requires a certain reasoning ability of the machine.

论文:RACE: large-scale reading comprehension dataset from examinations.

3. Span-Prediction (Q&A)

  1. SQuAD

An English reading comprehension data set proposed by Rajpurkar et al. " Squad: 100,000+ questions for machine comprehension of text. " in 2016 . The SQuAD data set is a question-and-answer question rather than a multiple-choice question, so it has no candidate answers to refer to, but it limits the answers to continuous fragments in the original text. It contains a large amount of data in that manual workers 536 found a Wikipedia 10 Wan problem. Each question corresponds to a specific paragraph, and the answer to the question is located on a span of the paragraph. The team-based challenge greatly promoted the prosperity of MRC .

Rajpurkar et al. released the SQuAD 2.0 data set in 2018 . SQuAD is the most classic English data set of machine reading comprehension in the field of reading comprehension. Many excellent papers or SOTA models (such as BERT ) use the SQuAD dataset.

Obtain the official source of SQuAD 1.0 and 2.0 : https://rajpurkar.github.io/SQuAD-explorer/

  1. DuReader

DuReader is a Chinese machine reading comprehension data set released by Baidu at ACL 2018. All questions and original texts are derived from Baidu search engine data and Baidu Know Questions and Answers community. The answers are manually compiled. The experiment was conducted on DuReader's single-document, extracted subset of the class. The training set contains 15,763 documents and questions, and the verification set contains 1,628 documents and questions. The goal is to extract consecutive fragments from the text as answers. [Link: https://arxiv.org/pdf/1711.05073.pdf]

 

  1. AI2 Reasoning Challenge (ARC) dataset

This is a question answering English data set, which contains 7787 real elementary-level multiple choice science questions. The data set is divided into a challenge set and a simple set. The challenge set only contains questions that are incorrectly answered by retrieval-based algorithms and word co-occurrence algorithms. The evaluation of the model is based on accuracy .

Get the ARC dataset URL: http://data.allenai.org/arc/   (public URL)

  1. DRCD

DRCD is a traditional Chinese reading comprehension data set released by Delta Research Institute. The goal is to extract consecutive fragments from the text as answers. We first convert it to simplified Chinese during the experiment. [Link: https://github.com/DRCKnowledgeTeam/DRCD ]

  1. TriviaQA

Contains more than 650K question-answer-evidence triples. Compared with other data sets, TriviaQA has considerable syntactic and lexical variability between the question and the corresponding answer-evidence sentence, and requires more cross-sentence reasoning to find the answer.

论文:Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.

Four, other forms

  1. NLPCC2016-DBQA

NLPCC2016-DBQA is an evaluation task organized by the International Natural Language Processing and Chinese Computing Conference NLPCC in 2016. Its goal is to find a suitable document from the candidates as the answer to the question. [Link: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf ]

  1. NarrativeQA

 [Kocisky' et al., 2018] proposed NarrativeQA, a more difficult data set designed to increase the difficulty of the question and make it difficult to find the answer. The data set contains 1567 complete books and script stories. Questions and answers are written by humans, and most of them are in more complex forms, such as "when/where/who/why".

五 Reference

[1] A Survey on Neural Machine Reading Comprehension
 

Continue to update. . .

Guess you like

Origin blog.csdn.net/Thanours/article/details/102627109