PaddleClas: Training Tips

training tips

1. Optimizer selection

The SGD optimizer with momentum has two disadvantages. One is that the convergence speed is slow, and the other is that the setting of the initial learning rate requires a lot of experience. However, if the initial learning rate is set appropriately and the number of iteration rounds is sufficient, the optimizer will also It stands out among many optimizers and achieves higher accuracy on the validation set.

Some adaptive learning rate optimizers, such as Adam, RMSProp, etc., tend to converge faster, but the final convergence accuracy will be slightly worse.

If you are pursuing faster convergence speed, we recommend using these adaptive learning rate optimizers. If you are pursuing higher convergence accuracy, we recommend using the SGD optimizer with momentum.

2. Selection of learning rate and learning rate reduction strategy

The choice of learning rate is often related to the optimizer, data and task. Here we mainly introduce the learning rate and the choice of learning rate reduction using momentum+SGD as the optimizer to train ImageNet-1k.

Learning rate reduction strategy:

In the initial stage of training, since the weights are in a randomly initialized state, the loss function is relatively easy to perform gradient descent, so a larger learning rate can be set.

In the later stage of training, since the weight parameters are already close to the optimal value, a larger learning rate cannot further find the optimal value, so a smaller learning rate needs to be set.

During the entire training process, the learning rate reduction method used by many researchers is piecewise_decay, which means a stepwise reduction of the learning rate. For example, in the ResNet50 standard training, we set the initial learning rate to 0.1, and the learning rate drops to the original value every 30 epochs. 1/10, a total of 120 epoch iterations. In addition to piecewise_decay, many researchers have also proposed other ways to decrease the learning rate, such as polynomial_decay (polynomial decrease), exponential_decay (exponential decrease), cosine_decay (cosine decrease), etc. Among them, cosine_decay does not need to adjust the hyperparameters and has relatively high robustness. Therefore, it has become the preferred learning rate reduction method to improve model accuracy.

The learning rate change curves of cosine_decay and piecewise_decay are shown in the figure below. It is easy to observe that cosine_decay maintains a large learning rate throughout the training process, so its convergence is slower, but the final convergence effect is better than that of piecewise_decay.

In addition, we can also see from the figure that cosine_decay has fewer rounds with a small learning rate, which will affect the final accuracy. Therefore, in order to make cosine_decay perform better, it is recommended to iterate more rounds, such as 200 wheel .

warmup strategy

If you use a larger batch_size to train a neural network, we recommend that you use the warmup strategy. As the name suggests, the Warmup strategy is to warm up the learning first. In the early stage of training, we do not directly use the maximum learning rate, but use a gradually increasing learning rate to train the network. When the learning rate increases to the highest point, then use learning The learning rate reduction method mentioned in the rate reduction strategy attenuates the value of the learning rate. Experiments show that when batch_size is large, warmup can steadily improve the accuracy of the model. In experiments with larger batch_size such as training MobileNetV3, we set the epoch in warmup to 5 by default, that is, first use 5 epoch to increase the learning rate from 0 to the maximum value, and then perform the corresponding learning rate attenuation.

3.Selection of batch_size

Batch_size is an important hyperparameter in training neural networks. This value determines how much data is sent to the neural network for training at a time. In the paper [1], the author found through experiments that when the value of batch_size is linearly related to the value of the learning rate, the convergence accuracy is almost unaffected. When training ImageNet data, most neural networks choose an initial learning rate of 0.1 and a batch_size of 256. Therefore, according to the actual model size and video memory, the learning rate can be set to 0.1*k and the batch_size can be set to 256*k.

4.Selection of weight_decay

Overfitting is a common term in machine learning. A simple understanding is that the model performs well on training data, but performs poorly on test data. In convolutional neural networks, there is also the problem of overfitting. In order to To avoid overfitting, many regular methods have been proposed. Among them, weight_decay is one of the widely used methods to avoid overfitting . Weight_decay is equivalent to adding L2 regularization after the final loss function. L2 regularization makes the weight of the network tend to choose smaller values. Finally, the parameter values ​​in the entire network tend to be closer to 0, and the generalization performance of the model is improved accordingly. In the implementation of major deep learning frameworks, the meaning of this value is the coefficient before L2 regularization. In the paddle framework, the name of this value is l2_decay, so it is called l2_decay below. The larger the coefficient, the stronger the regularization added and the more the model tends to be underfitted. In the task of training ImageNet, most networks set the parameter value to 1e-4. In some small networks such as the MobileNet series network, in order to avoid network underfitting, the value is set to 1e-5~4e-5 between. Of course, the setting of this value is also related to the specific data set. When the data set of the task is large, the network itself tends to be underfitting, and the value can be appropriately reduced. When the data set of the task is small, the network It tends to be overfitting, so the value can be increased appropriately. The following table shows the accuracy of MobileNetV1_x0_25 using different l2_decay on ImageNet-1k. Since MobileNetV1_x0_25 is a relatively small network, excessive l2_decay will make the network tend to be underfitted. Therefore, in this network, 3e-5 is a better choice than 1e-4.

Model L2_decay Train acc1/acc5 Test acc1/acc5
MobileNetV1_x0_25 1e-4 43.79%/67.61% 50.41%/74.70%
MobileNetV1_x0_25 3e-5 47.38%/70.83% 51.45%/75.45%

In addition, the setting of this value is also related to whether other regularization is used during the training process. If the data preprocessing during the training process is more complicated, which means the training task becomes more difficult, the value can be appropriately reduced. The following table shows that on ImageNet-1k, ResNet50 uses different l2_decay after using the randaugment preprocessing method. Accuracy. It is easy to observe that when the task becomes difficult, using a smaller l2_decay will help improve the accuracy of the model.

Model L2_decay Train acc1/acc5 Test acc1/acc5
ResNet50 1e-4 75.13%/90.42% 77.65%/93.79%
ResNet50 7e-5 75.56%/90.55% 78.04%/93.74%

To sum up, l2_decay can be adjusted according to specific tasks and models. Generally, for simple tasks or larger models, it is recommended to use larger l2_decay. For complex tasks or smaller models, it is recommended to use smaller l2_decay. l2_decay.

5.Selection of label_smoothing

Label_smoothing is a regularization method in deep learning. Its full name is Label Smoothing Regularization (LSR), which is label smoothing regularization. When calculating the loss function in traditional classification tasks, the real one hot label and the output of the neural network are calculated according to the cross entropy, while label_smoothing is to smooth the real one hot label so that the network learns the label It is no longer a hard label, but a soft label with a probability value. The probability at the position corresponding to the category is the largest, and the probability at other positions is a very small number. For specific calculation methods, please refer to the paper [2]. In label_smoothing, there is an epsilon parameter value, which describes the degree to which the label is softened. The larger the value, the smaller the label probability value of the label vector after label smoothing, and the smoother the label. On the contrary, the more the label tends to be hard label, this value is usually set to 0.1 in experiments training ImageNet-1k. In the experiment of training ImageNet-1k, we found that the accuracy of models of ResNet50 size level and above has been steadily improved after using label_smooting. The following table shows the accuracy indicators of ResNet50_vd before and after using label_smoothing.

Model Use_label_smoothing Test acc1
ResNet50_vd 0 77.9%
ResNet50_vd 1 78.4%

At the same time, since label_smoohing is equivalent to a regular method, on a relatively small model, the accuracy improvement is not obvious or even decreases. The following table shows the accuracy indicators of ResNet18 before and after using label_smoothing on ImageNet-1k. It can be clearly seen that after using label_smoothing, the accuracy has decreased.

Model Use_label_smoohing Train acc1/acc5 Test acc1/acc5
ResNet18 0 69.81%/87.70% 70.98%/89.92%
ResNet18 1 68.00%/86.56% 70.81%/89.89%

In summary, using label_smoohing on a larger model can effectively improve the accuracy of the model, while using label_smoohing on a smaller model may reduce the accuracy of the model. Therefore, before deciding whether to use label_smoohing, you need to evaluate the size of the model and the difficulty of the task. .

6. Change the crop area and stretch transformation degree of the image for small models

In the standard preprocessing of ImageNet-1k data, two values, scale and ratio, are defined in the random_crop function. The two values ​​determine the size of the image crop and the degree of stretching of the image respectively. The default value range of scale is 0.08- 1(lower_scale-upper_scale), the default value range of ratio is 3/4-4/3(lower_ratio-upper_ratio). In training very small networks, this type of data augmentation can cause the network to underfit, resulting in reduced accuracy. In order to improve the accuracy of the network, its data enhancement can be made weaker, that is, increasing the crop area of ​​the image or weakening the degree of stretching transformation of the image. We can achieve weaker image transformation by increasing the value of lower_scale or reducing the gap between lower_ratio and upper_scale respectively. The following table lists the accuracy of training MobileNetV2_x0_25 using different lower_scale. It can be seen that after increasing the crop area of ​​the image, both the training accuracy and the verification accuracy are improved.

Model Scale value range Train_acc1/acc5 Test_acc1/acc5
MobileNetV2_x0_25 [0.08,1] 50.36%/72.98% 52.35%/75.65%
MobileNetV2_x0_25 [0.2,1] 54.39%/77.08% 53.18%/76.14%

7. Use data augmentation to improve accuracy

Generally speaking, the size of the data set is crucial to performance, but image annotation is often expensive, so the number of annotated images is often scarce. In this case, data augmentation is particularly important. In the standard data augmentation for training ImageNet-1k, two data augmentation methods, random_crop and random_flip, are mainly used. However, in recent years, more and more data augmentation methods have been proposed, such as cutout, mixup, cutmix, AutoAugment etc. Experiments show that these data augmentation methods can effectively improve the accuracy of the model. The following table lists the performance of ResNet50 in 8 different data augmentation methods. It can be seen that compared with the baseline, all data augmentation methods have better performance. benefits, among which cutmix is ​​currently the most effective data augmentation . For more information on data augmentation, please refer to the Data Augmentation chapter .

Model Data augmentation methods Test top-1
ResNet50 standard transformation 77.31%
ResNet50 Auto-Augment 77.95%
ResNet50 mixup 78.28%
ResNet50 Cutmix 78.39%
ResNet50 Cutout 78.01%
ResNet50 Gridmask 77.85%
ResNet50 Random-Augment 77.70%
ResNet50 Random-Erasing 77.91%
ResNet50 Hide-and-Seek 77.43%

8. Determine the tuning strategy through train_acc and test_acc

In the process of training the network, the training set accuracy and verification set accuracy of each epoch are usually printed, both of which depict the performance of the model on the two data sets. Generally speaking, it is a relatively good state that the accuracy of the training set is slightly higher than the accuracy of the verification set or that the two are equivalent. If it is found that the accuracy of the training set is much higher than that of the verification set, it means that the task has been overfitted and more regularization needs to be added to the training process, such as increasing the value of l2_decay and adding more data augmentation strategies. Add label_smoothing strategy, etc.; if it is found that the accuracy of the training set is lower than that of the verification set, it means that the task may be underfitted, and the regularization effect needs to be weakened during the training process, such as reducing the value of l2_decay and reducing the data augmentation method. Increase the image crop area, weaken the image stretch transformation, remove label_smoothing, etc.

9. Improve the accuracy of your own data set through existing pre-trained models

In the current field of computer vision, it has become a common practice to load pre-trained models to train your own tasks. Compared with starting training from random initialization, loading pre-trained models can often improve the accuracy of specific tasks. Generally speaking, the pre-training model widely used in the industry is obtained by training the ImageNet-1k data set of 1.28 million images and 1,000 categories. The fc layer weight of the pre-training model is a k*1000 matrix, where k is fc The number of neurons before the layer. When loading the pre-trained weights, there is no need to load the weight of the fc layer. In terms of learning rate, if the data set for your task training is particularly small (such as less than 1,000 images), we recommend that you use a smaller initial learning rate, such as 0.001 (batch_size:256, the same below) to avoid a large learning rate Destroy pretrained weights. If your training data set is relatively large (greater than 100,000), we recommend that you try a larger initial learning rate, such as 0.01 or greater.

If you find this document helpful, please star our project: GitHub - PaddlePaddle/PaddleClas: A treasure chest for visual classification and recognition powered by PaddlePaddle

references

[1]P. Goyal, P. Dolla ́r, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.

[2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.

 

reference:

1:https://paddleclas.readthedocs.io/zh_CN/latest/models/Tricks.html

Guess you like

Origin blog.csdn.net/lilai619/article/details/131590755