Summary of large model training data sets

LLM dataset summary

GLUE

Introduction

Most of the current word-level NLU models above are designed for specific tasks, and a general model that can perform various tasks has not yet been realized. In order to solve this problem, the author proposed GLUE, hoping to promote the development of general NLU systems through this evaluation platform.

Task

  1. The GLUE benchmark contains 9 English sentence understanding tasks covering a wide range of domains and data sizes. These 9 tasks include:
  • CoLA: Grammatical acceptability judgment, judging whether a sentence conforms to English grammar
  • SST-2: Sentiment analysis, determining the emotional polarity of movie reviews
  • MRPC: Paraphrase judgment, judging whether two sentences are semantically equivalent
  • STS-B: Semantic similarity, evaluates the semantic similarity of two sentences
  • QQP: Paraphrase judgment, judging whether question pairs in Quora are semantically equivalent
  • MNLI: Natural language inference, determining whether one sentence can be inferred from another sentence
  • QNLI: Adapting the SQuAD question answering task into a natural language inference task
  • RTE: textual entailment, determining whether a text fragment can be inferred from another text fragment
  • WNLI: Adapting the Winograd Schema Challenge as a natural language inference task

Data set size

  1. The task data set size is as follows:
  • CoLA: 8500
  • SST-2: 67000
  • MRPC: 3700
  • STS-B: 7000
  • QQP: 364000
  • MNLI: 393000
  • QNLI: 108000
  • RTE: 2500
  • WNLI: 634

Insert image description here

SQuAD

Introduction

Stanford Question Answering Dataset (SQuAD) is a reading comprehension data set of more than 100,000 questions asked by crowd-sourced workers on Wikipedia articles. The answer to each question is a part of the text of the corresponding reading passage.

The construction of the SQuAD dataset is divided into three stages: 1. Screening of articles; 2. Collecting question-answer pairs on these articles through crowdsourcing; 3. Collecting additional answers. First, the authors obtained the first 10,000 articles from the English Wikipedia via Project Nayuki's internal PageRanks, and then randomly selected 536 articles from these articles. Individual paragraphs were extracted from these 536 articles, and images, charts, etc. were removed.

The reason for choosing to use Wikipedia articles as the corpus is that Wikipedia articles cover a wide range of topics, from musical celebrities to abstract concepts. Additionally, Wikipedia’s internal PageRanks can help source high-quality articles. By crowdsourcing questions and answers, you can expand your data set more quickly and increase its diversity.

Task

Reading Comprehension Dataset

Data set size

  1. The dataset contains a total of:
  • 536 Wikipedia articles
  • 23,215 paragraphs
  • 100,000+ Questions - Answers

Among them, 80% is used as training set, 10% is used as development set, and 10% is used as test set.

So an overview of the dataset size is as follows:

  • Number of articles in the training set: 429
  • Number of paragraphs in training set: 18,572
  • Number of questions in the training set: approximately 80,000
  • Number of articles in development set: 53
  • Number of development set paragraphs: 2,321
  • Number of development set questions: approximately 10,000
  • Number of articles in the test set: 54
  • Number of test set paragraphs: 2,322
  • Number of questions in the test set: about 10,000

download link

https://data.deepai.org/squad1.1.zip

https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

XSUM

Introduction

XSum-WebArxiveUrls.txt: The XSum dataset consists of 226,711 Wayback archived BBC articles , spanning nearly a decade (2010 to 2017) and covering various fields (such as news, politics, sports, weather, business, technology, science, health, family, education, entertainment, and the arts).

download link

https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset

Continuously updating...

Guess you like

Origin blog.csdn.net/qq128252/article/details/134903010