Tsinghua version of Chatgpt: chatglm-6B tutorial - how to determine the most appropriate learning rate from training

When we use chatglm-6B, we always encounter a problem. That is, how the learning rate is determined. Let's first look at how the learning rate of chatglm's two training sh files is determined.

1. How to determine the learning rate at the first time
This is the LR of train.sh
This is the LR of train_chat.sh
We can see that in the standard given by chatglm, the default learning rate for chat training is lower than that of training advertisement words. The difference between the two is that advertisement words have more overlapping properties prompt, and the chat is more divergent.
So before training, you have to evaluate by yourself whether the training content you give is very divergent. If it is very divergent, then reduce the learning rate. If the prompt is very concentrated, you can adjust a relatively large learning rate in the early stage.
This learning rate can be 2e-2 as a standard, choose a start between 5e-2 and 5e-3.

2. Relatively good learning rate
We must understand a problem. The learning rate LR is not a fixed value, or an absolute value, but a relatively small value. Since it is relatively small, it must not be obtained through one training session. This value must be determined through at least three training sessions.
After each training, we print out the loss value. The loss value should be an inverse function.
I printed out the results of the three trainings:
Three Learning Rate Trends
the learning rate of these three times increased from top to bottom, that is, blue>yellow>green.
We have two points to judge. If the third time (green) is the largest, it means that increasing the learning rate may make the loss smaller. If the first time (blue) is the largest, then it means Reducing the learning rate may also make the loss smaller. If the yellow one in the middle is the largest, then the learning rate can be determined.
As far as these three lines are concerned, their slopes do not change slowly. When the change of the slope is within 25, it suddenly increases. Ideally, the slope should gradually decrease.
So in this model, the learning rate should be adjusted further.

3. What is the appropriate learning rate
The learning rate is not as low as possible, but between 0-1 is better. But this value is the best between 0.5 and 1. If it is closer to 0, it is more likely to form overfitting (if it is overfitting, you need to retrain, adjust the training set and test set), if it is greater than 1, then The results may not be ideal.

Guess you like

Origin blog.csdn.net/miaoxingjundada/article/details/130355146