Efficient and stable ChatGPT large model training skills summary, make training more effective!

100e11be9949c1050e4056bc0cf40688.png

Text|python

foreword

Recently, ChatGPT has become a hot topic on the Internet. ChatGPT is a human-computer dialogue tool based on large-scale language model technology (LLM, large language model). Now mainstream large-scale language models use the Transformer network to conduct self-supervised training through extremely large-scale data. But how do you structure your self-supervised training data? What innovations have you made on the basic Transformer structure? In order to ensure that the training process is efficient and stable, what black technologies are there? Today, I will introduce a review paper from Renmin University to decipher the training skills of these large models.

Paper address:
https://arxiv.org/pdf/2303.18223.pdf

Research and test portals for each large model

ChatGPT Portal (wall-free, can be tested directly):
https://yeschat.cn

GPT-4 Portal (wall-free, can be tested directly, if you encounter browser warning points, you can continue to visit):<br>
https://gpt4test.com

Collection and processing of training data

Large-scale language models have higher requirements on the scale and quality of training data. So what kind of corpus is used in the current large model? What role do these corpora play? How to clean and preprocess the corpus? Are there any special details of the large model that we need to deal with?

Data Sources

In terms of data sources, the training data of large-scale language models can be general corpus and special corpus. General corpus, such as web pages, books, and dialogue texts, has a relatively large proportion and can provide language knowledge for the model on various topics; while special corpus, such as multilingual data, scientific and technological corpus, codes, etc., can bring solutions to specific problems for the model. task capacity. The composition ratio of the existing large model training corpus is shown in the figure below:

712311f797e63a7133d72ef9e4687762.png

Among the general corpora, the webpage corpus is large in scale, but it contains not only high-quality corpus such as Wikipedia, but also low-quality corpus such as spam, and generally needs to be filtered. Question answering corpora, such as social media platforms such as Reddit, can potentially improve the ability of models to answer questions. Social media usually involves multi-person dialogue, and the dialogue data can be organized into a tree structure according to the reply relationship, so that each branch is a complete dialogue content. Book corpus is a rare long-length written language text, which can help the model learn rigorous linguistic knowledge, model long-distance dependencies, and improve the coherence of generated content.

In special corpus, multilingual corpus can improve the ability of the model in translation, multilingual summarization, question answering and other tasks. Scientific and technological corpus can help the model understand special symbols, terms, and expressions by obtaining arXiv papers, textbooks, and mathematics online communities, and improve the performance of the model in scientific and technological tasks and reasoning. The code corpus mainly comes from question-and-answer communities such as Stack Exchange and open source projects on GitHub, including code, comments, and documents. Recent studies have shown that code corpus can improve the model's ability to reason complexly (chain-of-thought), because of its long-distance dependencies and inherently sophisticated logic.

For the address of some open source corpus, please refer to our previous post: Necessary resources for training ChatGPT: a complete guide to corpus, models and code libraries .

Cleaning and Pretreatment

f2b8ff4f3f2827ef3d836afe8c8ff425.png

After obtaining the corpus, people generally use the process in the above figure to clean and preprocess the corpus to improve the quality.

Specifically, in the first step of corpus cleaning, samples such as Wikipedia can be used as positive examples to train a binary classifier to screen high-quality corpus. However, recent research has shown that this screening method may introduce bias. So now it is recommended to use heuristic rules to filter, such as eliminating non-target task languages, discarding low perplexity data, deleting sentences with too many punctuation/symbols or too long or too short, and deleting sentences with certain specific vocabulary (such as html tags, links, etc.) , swear words, sensitive words) sentences.

The second step is deduplication. Sentences containing a large number of repeated words or phrases can be deleted; paragraphs with a high repetition rate (word/n-grams co-occurrence) can be deleted; content in the training set that may be too highly correlated with the test set can be deleted. In this way, the quality of the training set can be improved, the problem of duplication of content generated by the language model can be alleviated, and the overfitting problem caused by the leakage of the test set can be avoided.

The third step is to remove user privacy information (name, address, phone number, etc.)

Finally, after the three-step cleaning is completed, the words can be segmented and ready for training. In terms of word segmentation, there is no black technology. Either directly use an off-the-shelf word segmenter such as GPT-2, or construct a word segmentation method based on algorithms such as SentencePiece and Byte Pair Encoding for the training corpus.

some attention to detail

The characteristics of the large model lead to the need to pay attention to some special details when processing the pre-training corpus:

  • It is necessary to adjust the mixing ratio of corpus from different sources, which cannot be directly based on the size of the corpus. A balanced corpus ratio helps to improve the generalization ability of the model, and a specific type of corpus can improve the specific ability of the model.

  • The size of the corpus should match the parameter size of the model. Experience shows that, given the computing power, when the number of tokens in the corpus is equal to the number of parameters in the model, the performance of the model is relatively better. So don't blindly pursue large corpus, it is also important to control the scale, improve the quality, and train adequately.

  • Corpus quality matters (again). Experiments have shown that when training a large model, it is better to use low-quality corpus than to use it. Excessive repetition of data can even invalidate the training process (crash or fall into meaningless local optima).

Model structure and tasks

The mainstream large-scale language models are based on the Transformers structure. As can be seen from the figure below, most of the models are based on the Casual decoder structure, that is, only the decoder (one-way attention masking) is used to process the input and output content. The editor guesses that after GPT-3 has demonstrated the strong ability of the casual decoder, combined with the scaling law and other research on this structure, people have lost interest in investigating other structures.

The structure of the other two large language models, the Encoder-decoder structure is similar to the original model for machine translation, using two components that do not share parameters to process input and output content respectively. The Prefix decoder is similar to the Casual decoder, but does not use one-way attention masking in the input part, but allows two-way attention. It's a bit like an Encoder-decoder structure with shared parameters.

0f883dd49efffe5b200198f9480f3cb6.png

Except for Transformer's structural selection. The table above also shows some model design details. Specifically include the following points:

  • Layer Normalization (layer normalization) is an important means to ensure model convergence and alleviate the problem of training crashes. Specifically, the classic Pre Norm adds layer normalization before each multi-head attention layer and feed-forward network layer. On the basis of Pre Norm, Pre RMS Norm removes the mean part of normalization, that is, only scales the standard deviation to make the optimization process smoother. It is the current mainstream recommendation method. In addition, adding Norm after Embedding will make the optimization smoother, but it will significantly reduce the performance of the model, so it is generally no longer used now.

  • In terms of activation functions, traditional ReLU is generally not enough. It is now believed that SwiGLU and GeGLU can bring better performance, but compared to activation functions such as GeLU, it will bring more parameters.

  • Position information encoding, traditional learned absolute position encoding (Learned) and relative position encoding for relative distance (Relative). The latter has better extrapolation to longer corpora at test time. Recently, RoPE has been widely used. Its feature is that it uses a method similar to kernel function and triangular rotation to encode the absolute position of the query and key vectors, so that the inner product contains items that express relative position information.

In addition, the above table also summarizes some hyperparameter information, such as #L layer number, #H head number, hidden layer size, and MCL maximum context length.

Compared with the details of the model structure, the design of the pre-training task is very simple. The most common pre-training task is the autoregressive language model, which allows the language model to predict the next word one by one based on the input history, which is widely adopted by language models such as GPT-3. And like T5 and GLM-130B, the introduction of noise reduction self-encoding training targets allows the model to restore the masked segments in the input content.

Optimization Settings and Tips

In order to make the training process more efficient and stable for large-scale language models, there are a series of "black technologies" in the training process. Specifically, these techniques can: 1. Improve the final performance of the model; 2. Improve the convergence speed of the model; 3. Avoid the model from converging to a local optimum with a high loss, or not converge; 4. Avoid the collapse of the training process. The optimization settings and techniques exposed by the existing large models are shown in the table below.

4ecead5726945a3417bf4a0d700daad4.png

The batch-size is generally set larger, in order to make better use of large-scale training data and make the model training process more stable. For example, use a batch-size of 8196 (each batch processes 1.6M token inputs). GPT-3 uses the method of dynamically adjusting the Batch-size to gradually increase the number of Tokens it processes from 32K to 3.2M.

The learning rate is generally small and includes a warm up setting to ensure smooth training. For example, in the first 0.1%~0.5% of the training steps, set a linear learning rate increment. The peak learning rate is generally below, for example, the learning rate of GPT-3 is . After that, the cosine decay strategy will be adopted to gradually reduce the learning rate, and the learning rate will be reduced by about 10% before convergence.

The optimizer generally uses Adam, AdamW, and Adafactor. Among them, Adafactor is a memory-saving variant of Adam.

Other techniques to stabilize the training process include gradient clipping, with a threshold of 1.0; weight decay (similar to L2 regularization) with a rate of 0.1. Even so, the training process for large models often crashes. PaLM and OPT proposed that when a crash occurs, training can be continued from a previous intermediate node, and the previous training data that caused the crash can be skipped. GLM found that the embedding layer often has abnormal gradients and needs to be adjusted appropriately.

Data parallelism is the most commonly used multi-card training method. Distribute the training data to multiple graphics cards, calculate the forward and backward propagation respectively, then summarize the gradient, update the parameters, and synchronize the model. This method can solve the problem that the single-card batch is too small.

Pipeline parallelism (Pipeline parallelism) only stores and calculates some adjacent layers on a graphics card. In order to alleviate the inefficiency caused by waiting for sequential operations, tools such as GPipe and PipeDream propose to collect data from multiple batches in the pipeline and update parameters asynchronously. This method can alleviate the situation that a single card cannot run with a batch-size of 1.

Tensor parallelism (Tensor parallelism) splits the A matrix in the large matrix multiplication operation: so that the operation can be transformed into the concatenation of the multiplication results of two smaller matrices: , and the two smaller matrix multiplications can be placed Run on two graphics cards. This method is implemented by tools such as Megatron-LM and Colossal-AI, which can alleviate the problem of high memory usage of a single large matrix multiplication, and will also bring a certain amount of communication costs.

Mixed-precision training uses half-precision floating-point calculations to replace some parameters of the training process (especially the forward propagation part), thereby reducing the speed of memory improvement. Graphics cards such as the A100 optimize half-precision floating-point calculations, making mixed-precision training more effective. Recently, it has also been proposed to replace the traditional FP16 with Brain Floating Point (BF16), increase the number of exponents, and reduce the number of significant figures. However, although mixed-precision computing is significantly faster, experience shows that accuracy and model performance will still be reduced.

01db6aa770ab32dbc2cf8e2fa3bc6e47.png

ZeRO is a solution proposed by DeepSpeed ​​to further optimize data parallelism, which is used to improve the parallelism of video memory space beyond model parameters. The mixed-precision calculation process in the above figure is very obvious, and there are a lot of storage resource consumption other than parameters. In fact, the storage space of half-precision GPT-2 with 1.5B parameters is only 3GB, but it cannot be trained on a 32GB graphics card. This is the reason. The main idea of ​​ZeRO is to store gradient, momentum and other update-related information on each card in a distributed manner, so that each card can update the parameters of the corresponding position and then synchronize when the update is summarized; release the gradient correlation after updating the gradient memory, etc. Since this method is relatively complicated, we will not describe it in detail here. Both PyTorch's DeepSpeed ​​and FSDP tools support ZeRO.

In actual use, the above optimization settings are usually used in combination. For example, the 384-block A100 of the BLOOM model uses 8-way data parallelism, 4-way tensor parallelism and 12-way pipeline parallelism, and adopts a mixed-precision training strategy based on BF16. Open source tools such as DeepSpeed, Colossal-AI, and Alpa support parallel related functions.

In addition, in order to reduce the cost of trial and error, GPT-4 also proposes predictable scaling, which uses smaller neural network models to predict the possible performance of large model settings. PyTorch's FSDP also supports the CPU to share part of the calculation pressure.

conclusion

The training of large-scale language models is not only a scientific problem, but also a complex engineering problem. Scientists and engineers must work together to effectively advance the development of large models. Various training techniques help to improve the training efficiency and stability of large models. However, the relevant engineering details can only scratch the surface through papers. To truly master it, you also need to carefully read the code of the open source project and try to run it.

b7b5fad571e6378e21517b6b63ac6cf9.jpegReply keywords in the background [ join the group ]

Join the NLP, CV, search promotion and job hunting discussion group

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/130212210