Continual Pre-Training of Large Language Models: How to (re)warm your model?

This article is a series of LLM articles, focusing on the translation of "Continual Pre-Training of Large Language Models: How to (re)warm your model?".

Continuous pretraining of large language models: How to (re)warm your model

Summary

Large language models (LLMs) are typically pretrained on billions of tokens, but the process is restarted once new data becomes available. A cheaper and more efficient solution is to be able to continuously pretrain these models, i.e. update the pretrained model with new data instead of retraining it from scratch. However, distribution shifts caused by new data often lead to performance degradation on past data. In this work, we study the effect of different warm-up strategies. Our hypothesis is that when training on a new dataset, the learning rate must be re-increased to increase computational efficiency. We studied the warm-up phase of the model pretrained on Pile (upstream data, 300Btoken), while we continued to pretrain on SlimPapajama (downstream data, 297Btoken), following a linear warm-up and cosine decay schedule. We conduct all experiments on the Pythia410M language model architecture and evaluate performance by validating perplexities. We experimented with different pre-training checkpoints, different maximum learning rates, and different warm-up times. Our results show that while rearming a model initially increases the loss of upstream and downstream data, in the long run it improves downstream performance, outperforming models trained from scratch—even for large downstream datasets.

1 Introduction

2 settings

3 related work

4 Continuous heating

5 Discussion/Limitations

6 Conclusion

Our experiments show that warming up to a higher maximum learning rate helps models pretrained on Pile adapt to SlimPajama, while a smaller maximum learning rate maintains performance on Pile. However, in both cases, the rearmed model improved over the model trained from scratch. These results motivate the use of continuous pre-training on new datasets rather than training from scratch. However, more research is needed to establish similar results for larger model sizes, different distribution changes, and to verify that this strategy can be repeatedly applied to updated models.

Guess you like

Origin blog.csdn.net/c_cpp_csharp/article/details/132888150