AI models: data collection and cleaning

In order to train an AI model, sufficient data needs to be collected and prepared. The data should cover a variety of situations and scenarios to ensure the system operates accurately under all circumstances. The original source of the data should be authentic and should cover the expected usage of the system. Data should be sampled and processed according to specific needs and can come from a variety of sources, such as public datasets, third-party data providers, internal datasets, simulated datasets, etc. The data for many large model trainings can be broadly divided into two categories. The first is general text data, which includes web pages, books, online messages, and online conversations. This type of data is widely used because of its easy acquisition and large data scale. Model utilization, general text data is easier to improve the generalization ability of large models; the second is special text data, mainly some multi-language category data, scientific-related output data and codes. This type of data can improve the specialization of large models. task capabilities. When preparing data, attention should also be paid to the quality of the data, such as its accuracy, completeness, and consistency. In addition, privacy and security issues should also be considered. If the data contains sensitive information, such as the user's personally identifiable information, desensitization measures should be taken to ensure the security and privacy of the data. Data collection and preparation is one of the important steps in testing AI systems and requires adequate planning and preparation to ensure the accuracy and comprehensiveness of the test.

After the data collection is completed, the data usually needs to be cleaned. Cleaning here refers to the processing of some "bad" content in the data. The bad here refers to the noise, redundancy, toxicity and other contents of the data, thus Ensure the quality and consistency of data sets.
Insert image description here

Regardless of whether the collected data set is general text data or special text data, it must undergo a series of data cleaning before it can be used for LLM model training. When facing the initially collected data set, it is necessary to first improve the data quality of the data set through quality filtering. ,The conventional approach is to design a set of ,filtering rules to eliminate low-quality data, thereby ,improving data quality. Commonly used rules include language-based filtering rules, metric-based filtering rules, and keyword-based filtering rules.

Guess you like

Origin blog.csdn.net/chenlei_525/article/details/132601028