shuffle understanding

Before model training, we often shuffle the data, that is, randomly shuffle the data. Why do we do this? What will happen if I don't do this? When should I shuffle, and when should I not shuffle?

Q1: Why do I need to shuffle?

A1: Whether it is machine learning or deep learning, we are always based on the assumption that data is independent and identically distributed, that is, the appearance of data should be random, rather than arranged in a certain order . The above is the root cause of the need for shuffle. Therefore, we need to shuffle the data at the beginning of each epoch .
 

Q2: What will happen if I don't shuffle?

A2: Poor generalization ability.

  • ① What the model learns may only be the order of the data, without learning useful information, resulting in poor generalization ability.
  • ②If the data is sorted, such as sorting by category, it will cause the model to overfit this category for a while, and overfit that category for a while. This will cause the loss of the training process to periodically oscillate; on the other hand, at the end of training At times, the model always overfits the most recently trained data, resulting in poor generalization ability.

For example, if you do formula recognition (convert the formula of the picture to latex format), if you do not shuffle, arrange the data according to the aspect ratio of the picture, and the aspect ratio of the first picture and the last picture are very different, in this case If you don’t train directly without shuffle, there will be periodic loss of loss. For example, at the beginning of each epoch, loss will suddenly rise a lot, and then gradually decrease. When the next epoch starts, loss will suddenly rise again, repeating the cycle. .
 

Q3: When should I shuffle and when should I not shuffle?

A3: When we use the optimizer for model training, such as the SGD optimization method, inevitably, when the model training is finished, the model will perform better on the type of data that has just been learned.

Therefore, ①If we want to make the model stronger in generalization , we should shuffle the data, so that the last data seen by the model can represent the whole to a certain extent, and has stronger generalization ability. Normally, We are all going to shuffle.

②If we want the model to learn a certain order relationship or we want the model to perform better on a certain part of the data, then we have to decide the order of the data according to our own purposes, and decide whether to shuffle it locally or not at all. For example, for time series data, we can predict future data based on past data. We hope that the more recent data will give the model more attention. One way is to put the recent data behind and wait until the model is trained. At that time, the data that I have seen recently is the recent data. Naturally, there is a higher focus on the recent data, so that the recent data can play a greater role when predicting future data.

Therefore, whether or not to shuffle needs to be analyzed on a case-by-case basis.

Reference: https://blog.csdn.net/G_B_L/article/details/109600536

Guess you like

Origin blog.csdn.net/weixin_43135178/article/details/114884377