ChatGPT Speed Run Manual - Introduction to GPT Training Dataset

Introduction to GPT Training Dataset

All artificial intelligence algorithms are divided into two steps: training and reasoning. The effect of the algorithm depends largely on the quality of the training data itself. The training data used by ChatGPT, the openai company has not announced the details separately. However, considering that ChatGPT is developed on the basis of the pre-order GPT algorithm, we can analyze the training data set of GPT-3 from the side.

Dr. Alan D. Thompson, a well-known figure in the field of artificial intelligence, published an article introducing the data sets currently commonly used in the field of large language models. According to the token data disclosed in the openai paper, it is speculated that the size of the training data set used by GPT-3 is 753.4GB. The specific distribution is as follows:

  • Wikipedia: 11.4GB. Wikipedia is the world's leading free, multilingual, online encyclopedia with more than 300,000 volunteers contributing content. The English version is generally involved in the training, including 6.62 million articles and more than 4.2 billion words. Among them, biography accounted for 27.8%, geography accounted for 17.7%, culture and art accounted for 15.8%, history accounted for 9.9%, biomedicine accounted for 7.8%, sports accounted for 6.5%, business accounted for 4.8%, science and engineering and mathematics accounted for 3.5% %.
  • Gutenberg Book: 21GB. The Gutenberg Book Corpus, a project created by e-book inventor Michael Hart, is the world's first free e-book website. The website collects books in various languages. There are more than 50 books in 12 languages, and 500 books in Chinese, but they are basically ancient books. Generally used for training is a selected version of SPGC in the corpus. Because it is an online site, we can directly see the list of the top 100 books arranged by day. For example, on March 10, 2023, Shakespeare's "Romeo and Juliet" ranked first, and the only Chinese book in the top 100 was coincidentally the 88th-ranked "The Peony Pavilion" by Tang Xianzu.
  • Bibliotik Journey: 101GB. Bib is the largest e-book site on the Internet. It distributes and downloads through P2P, and the number of seeds exceeds 500,000. In order to train the GPT-Neo large model in 2021, EleutherAI Lab integrated and selected this e-book dataset, accounting for 12.07% of all data in the last Pile dataset used by EleutherAI Lab.
  • Reddit links: 50GB. Reddit is a popular social media platform, and the WebText dataset crawls all webpages with more than three outbound links from the Reddit platform, representing the vane of popular content.
  • Common Crawl: 570GB. This is a dataset that has been crawling since 2011, including original web pages, metadata, and extracted text, stored on AWS, with a total of over 1PB, and continues to increase at a rate of 20TB per month. Generally used for training is only the C4 part of Common Crawl. From the perspective of data analysis, except for Google’s patent website, which accounts for a high proportion of 0.48%, the proportion of other source websites is relatively average, remaining below 0.04%.

In openai's own public training data statistics by language ( https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv ), the proportion of English words in the training data set is as high as 92%. In addition, French accounted for 1.81%, German accounted for 1.47%, other languages ​​accounted for less than 1%, and Chinese accounted for 0.1%. But the actual Q&A ability of ChatGPT in various languages ​​is far beyond openai's own expectations. Human languages ​​may communicate to some extent beyond human comprehension.

There are also other news that the training corpus size of GPT-3 is as high as 45TB. The gap between the two data is too large. It is possible that 45TB is the sum of the total size of the above data sources before selection.

To what extent can these data sets represent the entire Internet? The www.worldwidewebsize.com website has long tracked the total number of web pages on the Internet that can be retrieved on search engines such as Google and Bing. So far, the total number of indexed web pages is 5.85 billion. There is another long-term track of the HTML size of web pages, and the average Internet web page size is currently 1.2MB. It is estimated that the text size of the entire Internet is 7000TB. After removing all kinds of HTML tags and roughly removing similar long-tail content according to the 80/20 rule, we can arbitrarily believe that the text on the entire Internet will be about 1000TB in size. But directly using this 1000TB data to train AI dialogue may not be the best solution. Many years ago, Microsoft Xiaoice "learned" the accident of swearing is a clear proof.

In addition, because ChatGPT's thinking chain ability needs to deliberately exercise logical ability, the training data may also include code data sets from GitHub, StackExchange programming question and answer data sets, etc.

We can see that the current training data of ChatGPT basically come from the English Internet world, and there is a lack of understanding of Chinese Internet data. This is also an opportunity for China's Internet giants. However, there is indeed a lack of open and standardized datasets of this magnitude on the Chinese Internet. There may not even be a corresponding form. For example: There are almost no social media platforms such as reddit and hackernews in China that mainly focus on outbound links and Q&A comments. Almost all existing Chinese corpora come from major universities and scientific research institutions, such as BBC of Beijing Language and Culture University, OpenSLR of Tsinghua University, CCL of Peking University, NEPD of Nanjing Agricultural University, WuDaoCorpora of Zhiyuan Research Institute, etc. When Fudan University released the MOSS artificial intelligence dialogue robot, it admitted that it used the standard corpus of the English Internet world without any special Chinese data.

It is difficult for scientific research institutions to maintain a real-time updated data set for a long time, so this aspect depends on the efforts of Chinese Internet companies themselves, such as: Baidu Encyclopedia, Zhihu Q&A to provide preferred content, Jingdong, Dangdang free e-book distribution, HowNet free Publication of periodicals and magazines, opening of outbound links in WeChat Moments, integration of Weibo hot search lists and comments, etc. On the other hand, the exploration of the supervisory level is also considered. Yao Qian, director of the Science and Technology Supervision Bureau of the China Securities Regulatory Commission, recently published a signed article "Custody and Governance of ChatGPT Large Model Training Data" in the 6th issue of "China Finance" in 2023. For the supply of high-quality data, "self-reliance and openness must be considered as a whole. It may be considered to establish filtered domestic mirror sites for specific data sources such as Wikipedia and Reddit for use by domestic data processors."

Guess you like

Origin blog.csdn.net/shiyunzhe2021/article/details/130176785