LLM Data Pipelines: Analyzing the complex process of processing large language model training data sets

Editor's Note: In the process of training a large language model, building a high-quality training data set is a very critical step, but there is very little information about the general data processing flow (Data pipelines) for building the data set required for large model training.

This article mainly introduces the data processing process based on the Common Crawl dataset. First, the article outlines the differences and application scenarios of Common Crawl's different data formats WARC, WAT, and WET. Then, the article elaborates several key steps in the data processing process, including obtaining data from data sources, deduplication, language identification, using model screening, and the "whether it is a reference source" screening added in LLaMA, etc. In each step, the article summarizes different data processing schemes and their advantages and disadvantages.

High-quality data will eventually lead to high-quality language models. The data processing process requires a lot of experiments and computing resources. Every decision will have an impact on the final result, and we need to carefully evaluate it.

The following is the translation, Enjoy!

Author | Christian S. Perone

Compile | Yue Yang

Erik Desmazieres's "The Library of Babel". 1997.

For many years, we have not stopped training language models (LMs), but the relevant information about the general data processing pipeline (Data pipelines) to build the data sets required for large model training is extremely scarce, and it is extremely challenging to find this part of the information. The reason may be that we often think that the data sets needed for large language model training must exist (or at least used to exist? It's just that it is becoming more and more difficult to reproduce these data sets). However, we have to take into account the numerous decisions involved in creating such a data processing pipeline, each of which can have an important impact on the quality of the final model, as we tried to replicate the difficulties encountered in LLaMA (LLaMA: Open and Efficient Foundation Language Models [1]). Some people may think that the data construction process has become more important than the model construction process because the current large model can be expanded well and the model structure has not changed much. But in reality, no matter how the model evolves, the data will always be critical.

This article briefly describes the processing pipeline (pipeline) used to create the LLaMA training data. There are many variants of this processing flow, so this article will also introduce the details of its variant methods in relevant places, such as RefinedWeb (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only[2]) and The Pile (The Pile: An 800GB Dataset of Diverse Text for Language Modeling[3]).

This article is mainly based on Meta's CCNet (CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data[4]) and the process described in the LLaMA paper. CCNet is designed to work with the largest, but also most challenging data source in terms of quality: Common Crawl [5].

01OverviewThe big picture

The entire process of CCNet (plus some minor modifications made by the LLaMA paper) is shown below, including the following stages: obtaining data from the data source, deduplication, language identification, and using the model filtering (filtering) and the "is-reference" filtering added in LLaMA . Next I will introduce each of these stages one by one.

Overview image of the modified CCNet processing flow in LLaMA

02 Common Crawl

The Common Crawl (CC) dataset is a large-scale crawl of the Internet by the non-profit organization of the same name [5], and adopts loose conditions for everyone to use. Organizing this dataset is not an easy task, you need to filter spam, decide which URLs to crawl, get a lot of data from different servers, etc. Therefore, if you use this dataset, please consider donating [6] to support their work.

Common Crawl provides a variety of dataset formats that can be used. Currently, there are mainly three different main formats (besides indexing): WARC, WAT, and WET .

WARC/WAT/WET format

1) WARC format

The WARC format is the format with the largest amount of data, because it is the unprocessed raw data after crawling. This format records the HTTP response header in a very clever way, so we can even get all the crawled server information. It is rarely used in natural language processing (NLP) due to its large amount of data and contains data not required for training large language models. However, as one of the main data formats of Common Crawl, its data content is very rich and may be very useful for making multi-modal data sets , which is why I think that in the next few years, WARC and WAT (described below) may be widely used by everyone.

2) WAT and WET formats

Datasets in these two formats are secondary data sources for Common Crawl, and they are both processed data. These two formats are often used to train language models, and this is where the different data processing pipelines start to diverge. These formats contain different types of records (records), where WAT contains more metadata than WET, including HTML tag content and links . WET is primarily a plain text format. [Translator's note: "records" refers to data records or data items mentioned in the text, which are single entries or data units in a specific format. In this paragraph, "records" refer to different types of data stored in WAT and WET formats. "metadata" refers to metadata, which is data describing data. In this context, "metadata" refers to additional information about the record, such as where it came from, when it was created, who it was, etc. ]

If you want to see a use case of WARC/WAT/WET, please refer to this link [7]. For the sake of brevity, examples are omitted here, but the data in these formats are very interesting and worth a look, both to use and to understand how to load and parse these data.

Today, CCNet (CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data[8]) uses the plain text WET format, which deserves our attention. However, there are other processing flows that use WAT because they believe that if they want to extract high-quality text data, they must use WAT instead of WET (bypassing the CommonCrawl processing to extract text). An example that does not use WET format files is The Pile (The Pile: An 800GB Dataset of Diverse Text for Language Modeling [9]), they use jusText [10]. They mentioned that this method can extract higher quality text than using WET-formatted files.

As you're probably aware, we've only just started with CC and there are already multiple options for extracting data from it. Another recent processing flow called RefinedWeb [11] (used in Falcon) also directly uses WARC, skipping the step of text extraction in the CC process (ie, the step of generating WET files). RefinedWeb uses trafilatura [12] instead of jusText for text extraction.

03 URL Filtering URL filtering

Although not mentioned in CCNet, many processes use public block lists of adult/violent/malware etc sites for URL filtering. For example, a block list of 4.6 million domain names is used in RefinedWeb, and a word-based filtering of the URL is also used. At this step, you can be creative and aggregate multiple block lists from different sources.

04 Deduplication deduplication

Now let's discuss deduplication, which can be a controversial step. In the article "Deduplicating Training Data Makes Language Models Better [13]", you can have a certain understanding of the relevant research results. However, "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling [14]" argues: "...deduplication of our training data has no obvious benefit to the performance of the language model.". Therefore, this paper argues that the deduplication process is still an open question, but given the excellent results achieved by LLaMA, we should not ignore the deduplication step in any new model training pipeline, and we may see more related excellent research in the near future .

Now let's discuss how CCNet works for deduplication. CC snapshot files are large, for example the WET file size for March/April 2023 is 8.7 TiB and the WAT file size is 21.1 TiB (both files are already compressed!). CCNet first splits these WET snapshot files into 5GB fragments and saves them in JSON format, where each entry corresponds to a crawled web page.

The next step after sharding is paragraph normalization, since deduplication is done at the paragraph level. They normalize each paragraph by converting it to lowercase, replacing numbers with placeholders, and removing all Unicode punctuation [15] (you can also choose to replace them) and accents. [Translator's Note: Accent marks are a type of punctuation used to indicate the stress of a syllable or a special note on pronunciation. In some languages, such as French, Spanish, and German, accent marks are used to change the way letters are pronounced or to emphasize certain syllables. They can be represented by changes in shape, size or position. For example, accent marks in French include the accent (´), circumflex (`), and surround (ˆ). During text normalization, removing accent marks can help unify the representation of text, making comparison, matching, and deduplication more accurate. ]

Next, calculate the SHA1 hash of each paragraph and use the first 64 bits for deduplication. Thereafter, deduplication can optionally be performed by comparing all shards or a fixed number of shards . If you are more interested in this step, you can refer to their paper [16] for more details.

It is worth noting that the RefinedWeb dataset seems to be processed more aggressively, using fuzzy deduplication and using "strict settings", which leads to "its deletion rate is much higher than other datasets" (CCNet related reports claim that its duplicate data accounts for 70% of the text ) . This undoubtedly had a significant impact on the diversity of the dataset.

Another important aspect of deduplication is described in the CCNet paper: this step removes a lot of template content, such as navigation menus, cookie reminders, and contact information. Additionally, it removes English content from other language pages and makes language identification more reliable, as we discuss below.

Here is an overview of the steps:

As you can see, the first step in the process is to strip whitespace, then convert the text to lowercase, and replace numbers with placeholders (such as zeros). Next, it removes (or replaces) Unicode punctuation, performs a SHA1 hash on the text, and uses the first 8 bytes for a deduplicated comparison (in paragraphs). It should be noted that the deduplication process should not be confused with the training process. This process is only used to calculate the final hash value and deduplication data, not to train the model.

In RefinedWeb, a method similar to Gopher [17] is adopted to eliminate excessive lines, paragraphs or repeated n-gram sequences (n-gram repetitions) in the document before de-duplication filtering. (Translator's Note: When a continuous n word or character sequence in a document is repeated in multiple places, it will be considered as n-gram repetitions.) Then use the MinHash algorithm (an algorithm for calculating document similarity and inclusion [18]), and found that this algorithm is very effective for removing SEO templates (that is, SEO text that repeats on multiple websites ) . This method also performs accurate deduplication, but due to the huge data size of CC, they adopted an alternative proposed by CCNet, that is, first sharding the data, and then performing deduplication in each shard.

05 Language Language recognition

Let's now go into detail about language identification, scoring, and filtering. In CCNet, they adopted fastText [19], a trick package for efficient text classification [20], and trained using data from Wikipedia, Tatoeba, and SETimes. fastText supports 176 languages ​​and generates a score for each language.

In CCNet, if the score of the most probable language judged by the algorithm is lower than 0.5 (50%), the web page will be discarded; otherwise, the web page will be identified as the most probable language for subsequent operations.

It is important to note that although the LLaMA dataset filters out non-English data from the CC dataset, it also utilizes other datasets for training that contain content in other languages ​​(e.g. Wikipedia). In my experience, LLaMA is also good at handling other languages ​​such as Portuguese.

Like CCNet, the RefinedWeb process uses fastText to identify languages. There is an important difference here. RefinedWeb uses a different score judgment threshold, that is, 0.65 instead of 0.5, and this processing flow adjusts the order of deduplication and language recognition. Language recognition is performed first, and then deduplication processing.

06 LM Filtering Use model filtering

So far, we have completed data deduplication, language identification and the first filtering . However, going through these steps does not mean that the quality of the data is necessarily good . This is why CCNet also needs another filtering step: CCNet uses the perplexity of the language model trained on the language of the target domain as a relatively good quality evaluation indicator . They trained a 5-gram Kneser-Ney model on the Wikipedia dataset in the same language as the target domain, and then used these models to compute the perplexity of each paragraph.

After having the value of perplexity, it is also necessary to determine the threshold. The CCNet paper describes how they computed three equal parts (head, middle, and tail) from the distribution of perplexities in each language, since perplexities vary widely across languages. (Translator's Note: Perplexity refers to the language model's assessment of the understanding and coherence of a paragraph. By calculating the perplexity of different paragraphs, a set of perplexity values ​​can be obtained, and then the distribution of this set can be analyzed to determine thresholds or other statistical features for judging the quality or reliability of a paragraph.) The following is an excerpt of an important content in the paper:

(…) Some documents despite being valid text ends up in the tail because they have a vocabulary very different from Wikipedia. This includes blog comments with spokenlike text, or very specialized forums with specific jargon. We decided to not remove content based on the LM score because we think that some of it could be useful for specific applications. (…)

Although the text of some passages is valid, these passages are finally classified as the end because the vocabulary used is more different from Wikipedia. These include blog comments using colloquial-like text, or specialized forum content with industry-specific jargon. We decided not to remove these based on language model scores, as we think some of them might be useful in specific applications.

What this means is how to do it depends on your application domain, since blind threshold filtering using only a language model trained on Wikipedia may cause you to delete important data. In RefineWeb, they avoid using language models for filtering and instead rely only on simple rules and heuristics . They used a process very similar to that used in Gopher, filtering outliers by "total length, ratio of symbols to words, and other criteria to ensure that the document is authentic natural language." They emphasize that this also requires special handling for each language , since over-reliance on heuristics related to language features is often problematic.

07 “Is reference” filtering

This part is not mentioned in CCNet, but it appears as an additional supplementary step in the LLaMA dataset. Therefore, I decided to describe it here as well. Although this step is not described in detail in the LLaMA paper, it seems to work by training a simple linear classifier (not sure which features are used) to classify pages cited as references in Wikipedia and a random sample of pages, and then discarding pages classified as "uncited references".

This step may appear simple at first glance, but it has an important impact on the quality of the dataset, but also depends on the threshold set. I think the LLaMA dataset is more conservative in filtering with LM, mainly to avoid removing relevant data, so they added this extra step to deal with possible quality issues, but this is just my guess.

08 Appendix: RefinedWeb diagram

There is a very beautiful Sankey processing flow chart in the RefinedWeb paper:

图片来源:The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. Guilherme Penedo et al. 2023. https://arxiv.org/abs/2306.01116.

This is a very informative graph that tells us how much data is being discarded. Personally, I'm impressed with the amount of data removed during the deduplication step.

09 Conclusion

Hope you enjoyed this article. The main purpose of this post is to give a brief overview of the data processing steps and decisions that need to be taken before training a large language model (LLM). Of course, there are many other important aspects, such as the mixture of data sets of different proportions, tokenization, etc. Since the CC dataset is generally the largest dataset in the LLM training field, I decided to focus on the data processing process directly related to this particular dataset before word segmentation.

In the data preprocessing process, many designs, strategies are made considering performance requirements, because we are dealing with large data blocks from CC datasets. In my opinion, investing more computing resources may find a better balance in terms of data, especially considering the cost of training LLM. However, it is difficult to predict how different decisions made in the data processing pipeline will affect the trained LLM model, which is why small experiments, manual data inspection and exploratory data analysis are crucial to understand the specific situation.

In conclusion, every company has a data set that fits its needs. It’s also a long-term investment that involves a lot of experimentation, engineering input, attention to detail, and an intuitive ability to use judgment under uncertainty. However, it is also an investment that pays off in the long run.

END

References (swipe up and down)

1.https://arxiv.org/abs/2302.13971v1

2.https://arxiv.org/abs/2306.01116

3.https://arxiv.org/abs/2101.00027

4.https://aclanthology.org/2020.lrec-1.494/

5.https://commoncrawl.org/

6.https://commoncrawl.org/donate/

7.https://commoncrawl.org/the-data/get-started/#WARC-Format

8.https://aclanthology.org/2020.lrec-1.494/

9.https://arxiv.org/abs/2101.00027

10.https://github.com/miso-belica/jusText

11.https://huggingface.co/datasets/tiiuae/falcon-refinedweb

12.https://trafilatura.readthedocs.io/en/latest/

13.https://arxiv.org/abs/2107.06499

14.https://arxiv.org/abs/2304.01373

15.https://github.com/facebookresearch/cc_net/blob/main/cc_net/text_normalizer.py#LL10C1-L10C14

16.https://aclanthology.org/2020.lrec-1.494.pdf

17.https://arxiv.org/abs/2112.11446

18.https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf

19.https://fasttext.cc/docs/en/language-identification.html

20.https://arxiv.org/abs/1607.01759

This article is authorized by the original author and compiled by Baihai IDP. If you need to reprint the translation, please contact us for authorization.

Original link :

https://blog.christianperone.com/2023/06/appreciating-llms-data-pipelines/

The 8 most in-demand programming languages ​​in 2023: PHP is strong, C/C++ demand is slowing Musk announced that Twitter will be renamed X, and the logo will be changed for five years, Cython 3.0 is officially released GPT-4 is getting more and more stupid? The accuracy rate dropped from 97.6% to 2.4%. MySQL 8.1 and MySQL 8.0.34 were officially released. The father of C# and TypeScript announced the latest open source project: TypeChat Meta Enlargement move: released an open source large language model Llama 2, which is free for commercial use . React core developer Dan Abramov announced his resignation from Meta. ChatGPT for Android will be launched next week. Pre-registration starts now . needs? Maybe this 5k star GitHub open source project can help - MetaGPT
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/IDP/blog/10090547