[Large model supplementary course] The cornerstone data set of contemporary AI

The Internet is the biggest foreshadowing of AI

Recently, AI has become popular. As an NLP engineer who is a little bit behind, I feel that I have a lot of lessons to make up recently. It just so happened that Meta published a new large model paper [1] yesterday . After browsing it, I found that it is very suitable as an entry point for supplementary lessons.

Today's part is about the data set used in pre-training , which is the most important thing. It is not an exaggeration to say that data is the cornerstone of contemporary AI. The data used by GPT3 is actually not made public, and what Meta mentioned in this paper should be regarded as the most complete version of the open source model. The data they used are shown in the table below, let's look at them one by one.

Dataset

Sampling prop.

Epochs

Disk size

CommonCrawl

67.0%

1.10

3.3 TB

C4

15.0%

1.06

783 GB

Github

4.5%

0.64

328 GB

Wikipedia

4.5%

2.45

83 GB

Books

4.5%

2.23

85 GB

ArXiv

2.5%

1.06

92 GB

StackExchange

2.0%

1.03

78 GB

CommonCrawl

The largest dataset, their website is https://commoncrawl.org/[2] . I feel that this is really a great project. In 7 years, I have crawled a lot of Internet pages, covering 40 languages.

Screenshot of CommonCrawl website

According to the latest data on their blog, the February 2023 edition contains 400TB of data (the plain text data is more than 9 TB), and more than three billion web pages.

The crawl archive for January/February 2023 is now available! The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content. Page captures are from 40 million hosts or 33 million registered domains and include 1.3 billion new URLs, not visited in any of our prior crawls.

而LLaMA里面CommonCrawl的数据只有3个多TB,大概是总数据量的三分之一。可见数据的后处理工作量是相当大的。

We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline (Wenzek et al.,2020). This process deduplicates the data at the line level, performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an ngram language model.

C4

占比第二大的数据集是C4,他的全称是Colossal Clean Crawled Corpus(4个C,所以叫C4)。这个数据集是在CommonCrawl数据集的基础上后处理而来。

根据C4官网[3]的介绍,用500个worker处理CommonCrawl数据集得到C4数据集需要大约16个小时

The C4 dataset we created for unsupervised pre-training is available in TensorFlow Datasets, but it requires a significant amount of bandwidth for downloading the raw Common Crawl scrapes (~7 TB) and compute for its preparation (~335 CPU-days). We suggest you take advantage of the Apache Beam support in TFDS, which enables distributed preprocessing of the dataset and can be run on Google Cloud Dataflow. With 500 workers, the job should complete in ~16 hours.

为什么要加入这个CommonCrawl的子集,我们最后再讨论。

Github

第三占比的是Github数据集,这个在多年以前的预训练语言模型例如BERT、GPT里几乎没有人用。之前似乎看过一种说法是代码数据的加入对语言模型的逻辑推理能力有极大的帮助。这个部分后面计划专门花点时间学习。

Wikipedia

维基百科数据因为质量高、覆盖面广是预训练语言模型的常用语料了,多年之前大家就爱使用。和Books数据集一道基本是预训练语言模型的标配。这里有一个很有趣的数字是整个维基百科的数据量只有不到100GB,甚至比github上的代码还少,这可是人类很大一部分知识啊。

Deberta论文里不同预训练模型使用数据的对比。所有模型和2023年的大模型比数据量都小了一个量级

Books

论文里的books数据集特指books3,这个数据集没有特别正式的官网,其介绍出现在一个github issue[4]里。根据作者的介绍,它包含了约20万本书籍。

books3.tar.gz (37GB), aka "all of bibliotik in plain .txt form", aka 197,000 books processed in exactly the same way as I did for bookcorpus here. So basically 11x bigger.

这个数据集也是社区共同努力的结果

This is possible thanks to two organizations. First and foremost, thank you to the-eye.eu. They have a wonderful community (see discord), and they are extremely interested in archiving data for the benefit of humanity. Secondly, thank you to "The Pile", which is the project that has been meticulously gathering and preparing this training data. Join their discord if you're interested in ML: https://www.eleuther.ai/get-involved

略显讽刺的是,以Open命名的OpenAI却没有公开他们在论文里使用的Books2数据集。

books3.tar.gz seems to be similar to OpenAI's mysterious "books2" dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it's "all of libgen", but it's purely conjecture. Nonetheless, books3 is "all of bibliotik", which is possibly useful to anyone doing NLP work.

ArXiv

这个数据集感觉也是最近几年才流行加到预训练语言模型里的。学术论文的逻辑性比较强,我估计这也和近年来模型的推理能力提升有密切的关系。

StackExchange

StackOverflow各位读者,特别是码农朋友可能更加熟悉,StackExchange可以理解为是它的超集。StackExchange包含有不限于计算机的各种各样不同领域的高质量问答。在LLaMA的训练数据里,Meta只保留了若干个子领域。

We kept the data from the 28 largest websites, removed the HTML tags from text and sorted the answers by score (from highest to lowest).

小结

我感觉除了Books以外,CommonCrawl应该包含了剩下的其他数据集,Meta在训练的时候还额外加入它们,是否等价于调整了数据的权重让高质量的网络内容出现得更多一些?论文中在C4数据集处有提到一点原因,说是加入不同预处理的数据有助于模型提升。

During exploratory experiments, we observed that using diverse pre-processed CommonCrawl datasets improves performance.

从这个角度看,要训练一个高质量的基础模型真的是有很多细节需要掌握,不是简单地堆数据、堆算力就能搞定的。

另外,我今天在微博上看到有人嘲讽说Meta一开源,国内的“自主创新”马上就要来了。但其实不难看出这个模型里中文语料的比例应该是很低的。首先最大头CommonCrawl只保留了英文,维基只保留了拉丁语系20种语言的内容,ArXiv和StackExchange上面本来就几乎没有中文。也就是说,中文基本只有可能比较大规模地出现在Books和Github这两块。如此说来,这个模型的中文水平应该不会好到哪里去,这个博主也有点为黑而黑的意思。

国内模型将迎来共产主义?

3年前GPT3的repo[5]里有个按照语言统计的数据量,在文档维度,中文只占到了0.11631%。从这个角度,各位家长一定要坚持让孩子学好英文,即使将来人工智能真的到来了,最好的版本一定是用英文交互的。

参考资料

[1]

LLaMA论文: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

[2]

CommonCrawl: https://commoncrawl.org/

[3]

C4官网: https://www.tensorflow.org/datasets/catalog/c4

[4]

Books3: https://github.com/soskek/bookcorpus/issues/27#issuecomment-716104208

[5]

GPT3 training data by language: https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_document_count.csv

Guess you like

Origin blog.csdn.net/AbnerAI/article/details/129256233
Recommended