The two most important parameters affecting model convergence, Batchsize and learn_rate

From the perspective of optimization itself , BatchSize and learning rate (and the impact of learning rate reduction strategy) in deep learning training are the
most important parameters that affect the performance convergence of the model.
The learning rate directly affects the convergence state of the model, and batchsize affects the generalization performance of the model. The two are directly related to the numerator and denominator, and can also affect each other.

Article Directory

1 The effect of Batchsize on training results (same number of epoch rounds)

Here we use GHIM-20, 20 images of each type in 20 categories, a total of 10000 (train9000) (val1000), and
a total of 100 epochs (equivalent to traversing 9000 images 100 times, of which acc (traversal is printed 10 times per iteration A total of 1000 val_list)
Insert picture description here
Alexnet used here (train_batchsize = 32 and train_batchsize = 64 respectively)

Comparative Results

1. Alexnet 2080s train_batchsize=32,val_batchsize=64。lr=0.01 GHIMyousan

  1. 2080s training s 4.68h, nearly 28k iterations == 100epoch x 9000/32, each train-batch records a loss dark blue
  2. val-batch records a val-loss light blue
    Question: Is the loss converged?
    acc-loss is always greater than train-loss to see that it is an overfitting problem
    Checked some answers
    1. Theoretically, it does not converge , that is to say, there is a problem with the network you designed, which is also the first factor that should be considered: Whether the gradient exists, that is, whether the back propagation has broken;
    2. Theoretically, it is convergent :
    1. The learning rate setting is unreasonable (in most cases) .If the learning rate setting is too large, it will cause non-convergence.If it is too small, it will cause the convergence rate to be very slow;

    2. Batchsize is too large, it falls into the local optimum and cannot reach the global optimum, so it cannot continue to converge;

    3. Network capacity, it is certain that the loss of the shallow network to complete complex tasks does not decrease. The network design is too simple. In general, the larger the number of layers and the number of nodes in the network, the stronger the fitting ability. If the number of layers and nodes is not enough , Unable to fit complex situations, will also cause non-convergence.
      The base of the decrease of the learning rate step is not too large, and the batchsize = 32 is not large, so it is converged? ? (After all, loss = 0.0008 is not too big.) The
      Insert picture description here
      Insert picture description here
      Insert picture description here
      acc curve is as follows: It is basically fitted after looking at the picture after 15k iterations (that is, epoch = 15000 * 32/9000 = 53 rounds). What is interesting is that it corresponds to learning at this time. Rate decreased from 0.01 to 0.001
      Question: Why does acc84% of val not go up
      The training set performs well, and the test set difference is overfitting, indicating that the learned features are still not generalized enough.
      Which layer of learning features is not good enough?
      Overfitting solution
      1. Reason:

      • The reason for the overfitting is that the magnitude of the training set does not match the complexity of the model,
      • The magnitude of the training set is less than the complexity of the model;
      • The feature distribution of training set and test set is inconsistent;
      • The noise data in the sample ...

      2. Solution
      (simpler model structure, data augmentation, regularization, dropout, early stopping, ensemble, re-cleaning data)

2. Alexnet 2080 train_batchsize=64,val_batchsize=64,lr=0.01 GHIM-me 14k itera

2080 train_batchsize = 64, the 4.5k iteration is basically fitted (4500 * 64/9000 = 32 rounds), and both train and val are better. And the fitting is good
Insert picture description here
Insert picture description here
As shown in the figure below

3. Alexnet 2080s train_batchsize=64,val_batchsize=64,lr=0.02 GHIM-yousan

2080s, no convergence or overfitting? ? ?
Insert picture description here
Insert picture description here
Insert picture description here

4 Squeezenet 2080s train_bs=64 val-bs=64 lr 0.01 GHIMyousan

Insert picture description here

Insert picture description here
Insert picture description here

4- Repeatability experiment Squeezenet 2080s train_bs = 64 val-bs = 64 lr 0.01 GHIM-me, the result is still overfitting 87% acc

Insert picture description here
Insert picture description here
Insert picture description here

5 mobilenetv1 2080s t-bs:64,v-bs:64 lr:0.01,GHIM-meOverfitting

Insert picture description here
Insert picture description here

Insert picture description here

5 mobilenetv2 2080s t-bs:64,v-bs:64 lr:0.01,GHIM-meOverfitting

Insert picture description here
Insert picture description here
Insert picture description here

6 mobilenetv1 2080 t-bs: 64 v-bs: 64 lr: 0.01 GHIM-me fitted

Insert picture description hereInsert picture description here
Insert picture description here

6 mobilenetv2 2080 t-bs: 64 v-bs: 64 lr: 0.01 GHIM-me fitted

Insert picture description here
Insert picture description here
lr=0.01

10 predecessors talk about batchsize

turn

1 Within a certain range, generally speaking, the larger the Batch_Size, the more accurate the downward direction it determines, resulting in less training shock.

Insert picture description here
Insert picture description here

  • As Batch_Size increases, the number of epochs required to achieve the same accuracy is increasing. Penultimate row
  • As Batch_Size increases, the faster the processing of the same amount of data.
  • Due to the contradiction between the above two factors, Batch_Size increases to a certain point, reaching the optimal time.
  • Since the final convergence accuracy will fall into different local extremums, Batch_Size increases to some times to reach the optimal final convergence accuracy.

2 The large batchsize performance decline is because the training time is not long enough, which is not essentially a problem with batchsize. The parameter update under the same epochs becomes less, so a longer number of iterations is required.

Insert picture description here
The error rate rises after batchsize = 8k

3 The large batchsize converges to the sharp minimum, while the small batchsize converges to the flat minimum, which has better generalization ability.

Insert picture description here
The difference between the two lies in the changing trend, one fast and one slow, as shown above, the main reason for this phenomenon is that the noise caused by the small batchsize helps escape the sharp minimum.

4 Batchsize increases, the learning rate should increase with others

Usually when we increase the batchsize to N times the original, to ensure that the weights updated after the same sample are equal, according to the linear scaling rule, the learning rate should be increased to the original N times [5]. However, if the variance of the weights is to be maintained, the learning rate should be increased to the original sqrt (N) times [7]. At present, both strategies have been studied, and the former is mostly used.

5 Increasing the size of Batchsize is equivalent to adding learning rate attenuation

Insert picture description here
In fact, it can be seen from the weight update formula of SGD that the two are indeed equivalent, and this is verified by sufficient experiments in the article

in conclusion

1 If the learning rate is increased, the batch size should also increase, so that the convergence is more stable.

2 Try to use a large learning rate, because many studies have shown that a larger learning rate is conducive to improving generalization ability. If you really want to decay, you can try other methods, such as increasing the batch size, the learning rate has a great impact on the model's convergence, and adjust it carefully.

3 The disadvantage of using bn is that you cannot use a too small batch size, otherwise the mean and variance will be biased. So now it is generally as much as the video memory can put. Moreover, when the model is actually used, it is really more important to distribute and preprocess the data. If the data is not good, it is useless to play any more tricks.

Reference reading
https://zhuanlan.zhihu.com/p/29247151
https://zhuanlan.zhihu.com/p/64864995

Published 63 original articles · praised 7 · views 3396

Guess you like

Origin blog.csdn.net/weixin_44523062/article/details/105457045