Continuous pre-training of large language models

Continuous pre-training of large language models

Included in the collection #Paper Express 258

overview

The research background of this paper is that the pre-training process of large language models usually needs to start from scratch, which is time-consuming and labor-intensive. The authors try to explore how to enable these models to be continuously pretrained, i.e. updating the pretrained model as new data arrives, rather than retraining. Past approaches either trained from scratch or used low-cost hyperparameter optimization, and did not address the continuous update of pre-trained models. The method in this paper is to restart the model by gradually increasing the learning rate to improve computational efficiency. This approach is feasible and motivating. This paper proposes a way to restart a pre-trained model based on linear start-up and cosine decay, and continues pre-training on the pre-trained model, while trying different pre-training checkpoints, learning rates, and start-up lengths . This paper conducts experiments on the Pythia 410M language model architecture, and evaluates performance by verified perplexity. The results show that although the restart of the model initially increases the loss of upstream and downstream data, it can improve the performance of downstream tasks in the long run and surpass the model trained from scratch on large-scale downstream datasets.

picture

Discussion on important issues

1. Why does adding new data to the pre-trained model lead to performance degradation on old data?

Answer: Adding new data to the pre-training model will cause changes in the data distribution, so the distribution shift brought by the new data will cause the performance of the model to decline on the old data. This is because the features and representations learned by the model during pre-training do not adapt to the new data distribution.

2. Why do I need to re-increase the learning rate when training on new data? Answer: When training new data, re-increasing the learning rate can improve computational efficiency. Since new data has a different distribution than old data, increasing the learning rate can speed up the adaptability of the model to new data, thereby reducing the overall training time and cost.

3. Why does re- increasing the learning rate during long-term training improve downstream task performance ?

Answer: Re-increasing the learning rate may lead to increased loss on upstream and downstream data in the initial stage, but it can improve the performance of downstream tasks in long-term training. This is because as the training progresses, the model can gradually adapt to the distribution of new data and learn better representations and features, thereby improving performance on downstream tasks.

4. How does the size of the new data affect the effectiveness of re-increasing the learning rate?

A: The size of the new data will affect the effect of re-increasing the learning rate. Larger scale new data may require a longer period of re-increasing the learning rate so that the model can adequately adapt to the new data distribution. Smaller new data may not need to re-increase the learning rate process, because the model can quickly adapt to the new data.

5. What are the advantages of training a pretrained model with new data compared to training from scratch?

A: Training a pretrained model with new data can lead to higher performance than training from scratch. The pre-trained model has learned a large amount of language knowledge and features, which can provide a better initial representation, so it can achieve better performance after fewer training steps. This can significantly reduce training time and computational cost, and perform better on large-scale data.

Paper link: https://arxiv.org/abs/2308.04014.pdf

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132207159