Collection and organization of data sets for large models

The emergence of GPT has made large models popular, and large models are closely related to artificial intelligence, machine learning, and deep learning. It can be used in image recognition, language processing, prediction and other aspects. However, large models require a large amount of data for training and optimization, so data collection and organization is a very important part of large model training.

How to collect data sets

1. Data collection method

Determine appropriate data sources, select appropriate public data sets based on the training model, and publish them on public platforms;

Obtain data sets from partners;

Data collection through self-developed tools;

Purchase through specialized agencies.

Here is a list of some publicly available dataset websites:

ImageNet: It is a recognition system built by computer scientists at Stanford in the United States to simulate human beings.

MS COCO: An image data set released by the Microsoft team, which collects a large number of daily scene images containing common objects, and provides pixel-level instance annotations to more accurately evaluate the effects of detection and segmentation algorithms, and is committed to promoting the research progress of scene understanding. .

Google Open Image: Data set released by the Google team. It contains 16M bounding boxes for 600 object categories on 1.9M images, making it the largest existing dataset annotated with object locations.

GitHub (github.com): Although a software development platform, there are many data scientists and researchers sharing datasets.

National websites: National Meteorological Bureau, Bureau of Agriculture, Bureau of Statistics, Bureau of Seismology, etc.

Data Lake (www.data-lake.org): It is a platform that brings together various public data sets, covering multiple subject areas.

Quandl (www.quandl.com): An open platform for financial and economic data, providing a rich set of financial market and macroeconomic data. ,

Amazon Open Dataset (registry.opendata.aws): A repository of public data sets provided by Amazon that contains a variety of data related to Amazon products and services.

 

2. Related technologies

Crawlers are the most common method, and corresponding data sets can be quickly collected through crawlers. But be careful not to crawl data that is not public and cannot be used commercially.

If it is enterprise research and development, the enterprise itself has a database, and the corresponding data can be extracted by querying the enterprise database. and associated sensor returns forming a data set.

 

3. Data organization method

After obtaining the data set, the data needs to be sorted and cleaned to improve the data quality and increase the model training effect.

First, data preprocessing is required, including data cleaning, deduplication, denoising, and data standardization. Remove unnecessary data, repair missing values ​​and errors in the data set, handle abnormal data and noise, convert data into unified formats and units, ensure data quality, avoid interference with the model, and improve the efficiency of model training.

During the data sorting process, in order for the model to better learn and understand the data, it is often necessary to add labels and annotations to the data. Manual annotation is the most common method. In addition, automatic annotation technology is used to automatically add labels to the data through machine learning algorithms.

In order to facilitate the training and evaluation of the model, the data set also needs to be divided. A common way of dividing the data set is to divide the data set into three data sets: training set, verification set and test set, and use cross-validation to evaluate the performance of the model. Stratified sampling ensures that the data of each category is representative in the three test sets and avoids data bias.

 

Ensure the quality of data sets

The quality of the data set is crucial to the effect of model training. Therefore, in actual work, it is necessary to evaluate the quality of the data set to ensure the accuracy, consistency and completeness of the data. In addition, the data set, like the database, needs to be updated and maintained to ensure that the data in the data set has good timeliness and facilitates the recording of different versions.

If you want to collect more data, first share the data sets you own, so that you can get what others share with you. It is also a mutually beneficial way to form a shared data platform.

 

Summarize

In the data collection process of large models, we always encounter some problems. We don't know where to find the data set, how to identify which data can be used, how to manage the data, etc. We can complete large model training step by step by purposefully selecting data sources and cooperating with professional organizations to continuously optimize data collection. In the future, we will also rely on large models trained on data sets to complete a lot of work, and related research will become more and more mature.

Guess you like

Origin blog.csdn.net/WhiteCattle_DATA/article/details/131729403