Li Wei: Data Transformation in the Era of Big Models

 Datawhale dry goods 

Author: Li Wei, Shanghai Artificial Intelligence Lab

foreword

Today, I will share some knowledge about the data transformation in the era of large models to those students who want to learn more about large models. As the product director of OpenDataLab, a Shanghai artificial intelligence laboratory, I will introduce our work in open data and large model data. I hope this information can be helpful to you.

Development and Research Direction of Large Models

First, I briefly introduce the development and research directions of large models. A large model is called "big" mainly because of the dramatic changes in its parameter scale . In the field of large models, an important research direction is "scaling law", that is, there is a smooth power-law development law between the model effect and the amount of parameters, data, and calculations of the model.

37fe060832215533422cfbb8407060a3.png

According to this rule, with the exponential growth of model parameters and training data volume (calculated by token), as well as the increase of model calculation, the loss of the model on the test set will decrease exponentially, and the better the model effect will be. This study also shows that parameter size is the main driver of model power. In the case of a given calculation amount and a small parameter scale, increasing the parameter amount of the model contributes much more to the model than the amount of data and the number of training steps. This research conducted by OpenAI in 2020 had a profound impact on the training direction of subsequent large models, including later models such as GPT-3, which were also verified accordingly.

Subsequently, more research institutions joined in the exploration of large model parameter scale. For example, DeepMind conducts more systematic research than OpenAI in 2022. They calculated through quantitative experiments that the Loss of model training has an optimal balance point under the change of model parameter quantity and training data quantity. Compared with the 100-billion-level models such as GPT-3, these models have not reached their theoretical optimal point, and may only achieve the theoretical effect of the 10-billion-level model.

So DeepMind came up with the Chinchilla model, which has a quarter the size of Gopher's parameters but four times the training data. In the case of a small parameter scale but a large amount of training, the effect of the entire model is better than that of a model with a large parameter scale but insufficient data. This also verifies the importance that we should expand the parameter scale and data volume in a balanced manner.

08b493e48f4b60052ddb98bdfbed6944.pngThe number of parameters and the amount of training data in language models such as Chinchilla and Gopher (source: Deep Mind)

Indeed, we can see that the development trend of large model research is to seek the best balance between parameter size and data volume. Meta will launch a tens of billions-level model LLaMA in 2023, and its training data is 4.7 times that of GPT-3. The model outperforms GPT-3 on various downstream tasks. During the training process, Meta experimented with parameters ranging from 7 billion to 65 billion, and found that when the training data approached or exceeded a trillion tokens, the effect of downstream tasks was still improving.

4f0fd3246c09bc81023219673b56c0bf.png

This shows that with limited computing power resources, there is still room for optimization of tens of billions of models, and increasing the amount of training data can significantly improve the effect of the model. Recently, Stanford improved the Alpaca 7B model based on the LLaMA 7B model by adding fine-tuning data. This new method provides new ideas for subsequent research institutions.

In general, the research on large models is not only about increasing the parameter scale of the model, more data and better training methods are also key. We also expect to see more new models and new research methods to promote the development of large model research.

Large model data composition

Based on the parameter quantity of the model, more and more researchers begin to study large model data in depth. Among them, the main research object of the large model is the pre-training model. Regarding the data used for pre-training, Alan D. Thompson conducted an exhaustive study. The objects of his research include some well-known large models from 2018 to 2022, such as the GPT series (from GPT-1 to GPT-4), and models such as Gopher. He analyzed in detail the ratio and composition of the training data for these models.

35582076daa42b794902e3e46e2349a9.png

As shown in the figure on the right, important components of the large model include encyclopedia data (such as wiki), book data (such as books.), periodical data (such as general), and social news. Among them, the largest proportion is the web page data (CC) crawled through Chrome.

From the data composition of these large models, we can find many similarities. This laid a good foundation for the follow-up large model research. We can also see that during the development of large models, new large models continue to emerge, and the size of pre-trained data is also doubling.

Although the researchers of these models claim that the training data they use is public, most research institutions or teams do not publicly state the data sources used by their models, including the number of tokens used by each model and the ratio of different data types and details of the content. Only part of the publicly available information can provide a reference for our research data.

Based on the above research, we found that during the evolution of the GPT series models, their data formulations are also changing. GPT-1 mainly uses book-like corpus, such as Books, etc. These are important sources of people's daily written language, and their quality is relatively high. While GPT-2 mainly uses news data (such as Reddit), the overall form is more formal, but it contains a lot of social data, such as people's daily oral communication methods. Then came GPT-3, the size of its pre-training data has increased dozens of times, and the data ratio is more detailed and diversified, including Reddit links, various books, encyclopedia data, Wiki data, and web pages such as WebText2 and Common Crawl data. The biggest part of it is Common Crawl. After a certain amount of high-quality screening, basically all the corpus on the web page is input into GPT-3, so we can see amazing performances such as ChatGPT.

83d70f8f46c8b97556d8e4ace69e8710.png

At the GPT-4 stage, we can see that it has added some data that GPT-3 does not have, such as data in the form of conversations, data in the form of code on GitHub, and some elementary school and university math problems that have been specially added. This is a breakthrough of GPT-4 on the large corpus of GPT-3.

It can be seen that by introducing the corpus of codes and math problems, the ability of the model's thinking chain has been greatly enhanced, making it a high-quality improvement in reasoning, including answers to mathematical applications. We also noticed the Pile dataset, which is a very well-known dataset for large model pre-training. As we can see in its overall form, it is actually a data set that contains dozens of different types of data.

1315fc0e7253357e9d85a0ea3a0e40e6.png

We have done some research and analysis on these data, which can be subdivided by different levels and dimensions. From the perspective of ChatGPT's language capabilities, including text capabilities, these are closely related to its pre-trained data capabilities. In the entire pre-trained data, it can be roughly divided into the following types of corpus:

  • The first category is dialogue forms, including multi-round or single-round dialogue between users, and formal or informal question-and-answer forms;

  • The second category is community forums, where the text data of forums is very diverse, because different people publish different content, including their speaking styles, which increases the diversity of language abilities;

  • The third category is the course materials of schools and institutions. These data can provide ChatGPT with an in-depth understanding of texts in knowledge-based fields;

  • The fourth category is books and encyclopedia data, which is the most common data type;

  • The fifth category is official documents, which is a special type of text, but it is also part of the large model corpus; the sixth category is papers, which is also a special form of text corpus;

  • The last category is news reports from news and entertainment media, which is also an independent type of corpus and can be used as a corpus input for large language models.

Introduction to OpenDataLab

OpendataLab is an open data platform dedicated to providing data support for large models from three aspects. First, we provide open data resources for algorithmic models. There is a large amount of data and corpus on our platform, where users can find the information they need. We provide flexible data support and optimize the download speed so that users can obtain data faster in China. In addition, we provide a command-line interface for faster access to relevant open-source datasets.

86f3b1b9937e9cbbb6f575456551a08a.png

data set

Currently, our platform has more than 5,400 public datasets with a total capacity of 80TB. We conduct compliance checks on the data on the platform to ensure that the copyright or license information of all data is clear and clear. In addition, we also classify the data on the platform, including labeling type, task type, data type, and applicable application scenarios, so that users can better find the data they need.

e7062e17b3d7b9a03a75e4bd7b36c6d5.png

We hope to support the training and fine-tuning of large domestic models. To this end, we have set up some thematic data sections on OpendataLab. We provide the basic corpus for large language model pre-training, and set up search and filter functions on our filter bar, users can query all corpus related to the large model ChatGPT with one click.

At present, we already have more than 1,000 text-based corpora suitable for large models, including the most well-known The Pile dataset, which covers high-quality data and corpora in 22 different fields, as well as public high-quality web page data. For example, the C4 data set, which is a high-quality large-scale data set processed by Common Crawl, is used as the corpus basis by the majority of GPT model users. In addition, we also collected evaluation data related to large models. Currently, the platform contains dozens of language-specific assessments, which users can obtain and download on our platform.

Multimodal pre-training and evaluation data

47e2f3999a3df02efd5e8bc842d66d5f.png

We are also collecting state-of-the-art multimodal pre-training and evaluation data. These data can be used to generate scenarios such as AIGC, including research on multi-modal large models such as graphics and text, video and text. Our platform contains the largest public image-text dataset LAION-5B, which includes 80TB of image data and 5.85 billion image-text pairs. These data have been processed by the Line team and are very suitable for scientific research. We also have the SA-1B data set, which is the largest image segmentation data set released by the very well-known Segment Anything model recently. It contains 11 million pictures and 1.1 billion mask data, which is very suitable for the research of large visual multimodal models , especially in the field of image segmentation has a wide range of applications.

At the same time, we have also collected all the most comprehensive Benchmarks related data in the multimodal field. Besides the pre-training data, we also include the fine-tuning data. As we all know, the ability of the ChatGPT-3 large model is largely revealed through instruction fine-tuning.

79b09054769e2ff5aa4acbace913c26f.png

We collected existing public instruction fine-tuning data, including Databricks dolly 15K, as well as OpenAIAssistant's recently launched open data and Firefly's high-quality Chinese instruction data. We also standardized these fine-tuned instruction data, so that different instructions can be combined for fine-tuning training through one-click DataLoader, which greatly facilitates data acquisition and processing.

In addition, OpendataLab also provides a series of data-related tools. For large model data collection, we have developed some tools during the data acquisition process and provided them to developers to support more flexible data acquisition methods. For example, for large-scale datasets such as LAION-5B, we have opened the download tool on GitHub, and users can download all LAION-5B original image data more flexibly and distributedly. For the Segment Anything dataset, users can also obtain data faster with one line of code.

data collection tool

At the same time, we are developing data collection tools that can support large models. The entire platform can provide more flexible configuration of data labeling and collection forms, support man-machine real-time dialogue, online model evaluation, and different tool configuration model output. Our tools can also support image-text collection, and collect and review image-text pairs through flexible configurations. For video tools, we can support video capture and video description to support the data collection and labeling requirements of multimodal and generative models.

d1aba6c175bbf42c988458a5cf4abad5.png

Smart Labeling Tool

In addition, we have open-sourced LabelU, an intelligent labeling tool that can meet most of the two-dimensional data labeling needs, including subdivided field data labeling for subsequent fine-tuning scenarios, as well as different forms of image or text labeling.

data description language

At the same time, we are working on some common data standard languages ​​to support the data requirements of large models. In fact, data has many pain points in large models. Whether it is a team such as GPT or DeepMind, a large data team is required to perform time-consuming and laborious tasks such as data collection, processing, and cleaning. OpendataLab will also standardize data in the process of data research, provide it to the open platform in a unified format, and open source our processing tools, including data conversion, cleaning, etc., so that users of the open platform can prepare data more quickly.

2c47e8bfeab72a224dc1b089d7c6efb9.png

We also propose an innovative data description language called Dataset Description Language (DSDL). It has a certain degree of versatility, describes the entire data set in a unified way, can cover data sets in different fields and methods, and makes data more easily interconnected. Based on the form of JSON, DSDL can better decouple media images, especially in the multimodal field, and its annotation files can support lightweight annotation distribution. Moreover, it has certain scalability and can better support different types of data. We recently launched nearly a hundred standard data annotations on OpendataLab.

The standard data set of the platform can be searched by filtering. We will provide the standard package of DSDL, and users can use it together with the original raw data through our instructions after downloading. Through the unified DataLoader, different types of data can be integrated together for large model training. In the past, the corpus of large models came from various sources and in different formats. After the standardization of DSDL, dozens or even hundreds of related corpora can be trained in a unified manner with one click. This also enables multi-task dataset normalization across multiple modalities, supporting faster training and inference of large models.

We also expect more students to join the ranks of large-scale model research and jointly promote the development of this field. If you have any questions or topics you want to discuss further, you can ask us at any time.

b643f27f40c6b7f7140f88d7fdc43faa.png

It's not easy to organize, so I like it three times

Guess you like

Origin blog.csdn.net/Datawhale/article/details/130695492