How to better continue pretraining (Continue PreTraining)

From: NLP Workstation

Enter the NLP group—> Join the NLP exchange group

write in front

Pretraining is a very resource-consuming task, especially in the LLM era. With the open source of LLama2, more and more people have begun to try to enhance Chinese on this powerful English base model. However, how can we ensure that the model learns "Chinese knowledge" without losing the original "English knowledge"?

Today I bring you a paper on Continual Pretraining (from Mr. He Zhi, Zhihu@何志), Continual Pre-Training of Large Language Models: How to (re)warm your model?

知乎:https://zhuanlan.zhihu.com/p/654463331
paper:https://arxiv.org/pdf/2308.04014.pdf

1. Experimental settings

The author uses a 410M size model Pythia, which has been pre-trained on the Pile data, and then fine-tuned on the downstream data set SlimPajama.

The paper directly uses Loss as the evaluation index, that is, the smaller the Loss, the stronger the effect in upstream (or downstream) tasks.

Pythia: https://huggingface.co/EleutherAI/pythia-410m-v0
Pile: https://huggingface.co/datasets/EleutherAI/pile
SlimPajama: https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama

2. Key conclusions

2.1 The number of warmup steps will not affect the final performance

Warmup is a commonly used strategy in finetune, which means that the learning rate slowly increases from a very small value to the maximum value. So, how long is the best time to last for this "slowly rising" stage?

The author used four different preheating steps: 0%, 0.5%, 1%, and 2% to conduct experiments:448675e96b892e8a7c5b966a22859c44.png

As you can see from the figure above: when the model is "fully" trained, the final performance will be similar no matter how long the warm-up steps are.

However, this premise is "full training". If you only look at the early stage of training, use a longer number of warm-up steps (yellow line). Whether it is an "upstream task" or a "downstream task", the loss of the model is lower than other warm-up steps (downstream learns faster, upstream forgets slowly).

2.2 The larger the learning rate, the better the downstream tasks and the worse the upstream tasks.

In order to explore the impact of learning rate on learning effect, the author used 4 different maximum learning rates to conduct comparative experiments.

In addition, the effect of the model trained from scratch was compared:eb4a0a5ea3c85a6e0f55cb361bef3a17.png

77a88925e8e52ca272e1b5cf87b11fb6.png

As can be seen from the figure: after sufficient training, the larger the learning rate (purple), the best downstream performance, and the worst upstream performance (forgetting the most). Similarly, we look at the early training. Although the purple line has the lowest loss at the end, the loss will increase very much in the early stage and then decrease.

PS: Let me explain why there is so much focus on the early stage of training here. It is because in real training, we may not necessarily enhance the 250B tokens shown in the figure, especially when the model parameters are large. Therefore, when resources do not allow sufficient training, a smaller learning rate and a longer number of warmup steps may be a good choice.

In addition, it can be seen in the figure that the model without pre-training (blue) is not as effective as the pre-trained model in both upstream and downstream tasks.

This encourages us to continue training on a pre-trained model (to take advantage of the prior knowledge) when performing training tasks today.

2.3 Using Rewarmup in initial pre-training will damage performance

Although the warmup strategy has better results in both Finetune and Continue Pretraining (compared to the constant learning rate), this is based on the premise of "switching the training data set (data distribution)".

The author did an experiment, not switching data sets, but continuing to train on the previous "pre-training data set (The Pile)":2e047e71e8b0037e82d37b7e82bb8293.png

From the results in the figure, we can find that no matter how large the learning rate warmup strategy is, the effect is not as good as using a constant learning rate.

This further proves that using warmup on the original data set and then training will cause performance damage. The larger the learning rate, the greater the damage, and this damage cannot be recovered in subsequent training.

PS: Here we are reminded that when we encounter a training interruption during pre-training and need to continue training, we should restore the learning rate to the state before the interruption (whether it is a value or a decay rate) when we restart training.

3. Experimental limitations

The author posted some limitations in reaching the above conclusion at the end of the paper.

3.1 Upstream and downstream data distribution is similar

Because there is some data overlap in the upstream data set [Pile] and the downstream data set [SlimPajama] selected in the experiment,

Therefore, the distribution of upstream and downstream data is relatively similar, but in our real training tasks, the difference between upstream and downstream data may be much larger than this.

3.2 The model is small in scale

The model size used in the paper is 410M, which is far from the scale of LLM that people started with 7B today.

However, the team plans to continue to try it at the 3B and 7B scales in the next work, and looks forward to their final experimental conclusions.


Enter the NLP group—> Join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132820341