2020 is coming soon, are you still worrying about deep learning tuning?

Preface

I previously answered a question about deep learning tuning, and compiled some experiences from January to December 2019. I didn’t expect it to become popular in the past few days. The increase was nearly 300 within 2 days. It seems that everyone is still Pay more attention to this aspect. Then last night, Questyle akkaze-郑安坤also published his tuning skills, which are also very practical and refined. Therefore, after obtaining his authorization, I reprinted his answer in this tweet and attached my own work this year. Summary techniques. I think these methods should be seen by more people, and give them some help and inspiration, so I have this article. The article is divided into two parts, one part is akkazetuning skills, the other part is BBuftuning skills, all summarized before New Year's Day 2020.

akkaze tuning skills

Original URL: https://www.zhihu.com/question/25097993/answer/951697614

  1. The network can be as deep as it is deep. When keeping a relatively small width, it is necessary to find a way to deepen the network. As it becomes deeper, the network gradually becomes fatter.
  2. Using a small convolution kernel, a small convolution kernel is conducive to a deeper network and better recognition robustness.
  3. The density of downsampling in the first few layers of the network is higher, (so that we can use as little accuracy loss as possible to increase the speed) The density of downsampling should be smaller later, and the maximum depth of the downsampling can be determined by the receptive field of this layer. And the size of the largest meaningful object in the data set is determined (it is naturally impossible for the largest meaningful object to be downsampled to a resolution of less than 1 in a certain layer, but the network can still work, but the last few layers may be abandoned (I believe cnn The learning ability of, because it can learn the unit convolution at the most, only the convolution kernel whose central element is not 0). More accurately, this is the limit of the maximum receptive field, and the maximum receptive field covers the most meaningful objects) .
  4. So the general approach is to lower the sampling frequency of the first few layers, lower the sampling frequency of the middle layer, and use the method of no downsampling to increase the depth.          
  5.  The higher the front, the faster the resolution decreases, and the middle must be deepened
  6. In shortcut connection, concat is not found, just use add to make do, and vice versa.
  7. Training a large model first and then cropping may be better than training a small model directly.
  8. The last part of the network can be done using non-sampling hole convolution, especially for segmentation.
  9. Don’t downsample at the end, consider using aspp to fuse receptive fields of different scales
  10. Transposed convolution can be replaced by upsampling + convolution.
  11. The convolution that can be replaced by separable convolution must be replaced. Generally, except for the first convolution, it can be replaced. After replacement, consider multiplying the number of alternative separable channels by 2, because of the parameters and calculations of separable convolution The quantity increases linearly, and there is still a speed gain in doing so. The same principle applies to 2+1 split convolution.
  12. The amount of calculation increases linearly, the number of channels and the depth multiplier can be controlled, and the cost is small
  13. Inception or shortcut connection or dense connection are actually equivalent to ensemble model. Considering the combination, the use of shortcut connection is almost painless.
  14. The receptive field is not as big as possible (large receptive field means more semantic information). For some low-level tasks, such as image noise reduction, the receptive field may be the local patch size. In other words, the significance of downsampling is smaller, and downsampling is used as little as possible.     
  15. For low-level, the receptive field does not need to be too large, specifically referring to tasks such as denoising, demosaicing, image enhancement, and key points.
  16. For detection, the anchor layer does not have to have three layers. The aspect ratio and size that can be set in the same layer according to requirements are optional. For example, blazeface concentrates all anchors on one layer. This should be designed according to requirements. bifpn is really useful.
  17. Batchnorm must be used. If you have multiple machines and multiple cards, you can also consider synchronous batchnorm.
  18. For metric learning (image comparison), generally speaking, a larger batchsize will result in better performance, because the range of positive and negative sample pairs that can be sampled is larger.
  19. If your one-stage detection model does not get good classification results, consider two-stage, first detection and then classification.
  20. The pre-trained model in the detection model is useful, at least it can improve the detection discrimination of foreground and background.
  21. If your classification accuracy is not enough because two or more categories are too similar, consider using other softmax, such as amsoftmax.     
  22. Magic modified softmax can better increase the gap between classes .
  23. If your classification accuracy is not enough because the sample is not balanced, consider using focal loss.
  24. Don't just look at accuracy, consider other metrics, and always use the visualization after going online as the final criterion.
  25. Use full convolutional networks to make landmarks as much as possible, don't use fc regression directly, the regression is really not stable.
  26. The image brightness transformation must be enhanced, the brightness robustness will be better, but it must be adjusted adaptively in accordance with the brightness distribution of your data.
  27. Finally, on the basis of not changing the network backbone, try more new losses, the cost of engineering is not great.

BBuf tuning skills

Let me first declare that I am just a migrant worker who has just entered the industry for a year. The depth and breadth are naturally not akkazeso strong, so there are inevitably errors or lack of description. Please contact me if you have any questions.

Do engineering

  1. Convolution is the mainstream component of CNN. Usually, some networks are designed to solve classification and regression tasks. The convolution kernels inside are basically set. If you want to say the reason, you should ask VGG16. The stacking of two convolution kernels can obtain the receptive field of the convolution kernel and has fewer parameters than the convolution kernel, so it is recommended to use a lot.
  2. Convolution can be used appropriately. Why do you want to mention this? This is because convolution can reduce the amount of calculation, and convolution can emphasize the receptive field in a certain direction, that is to say, if you want to classify a rectangular object, you can use the volume The convolution kernel paired with the product core sets a larger receptive field in the long-side direction, which may improve generalization performance.
  3. ACNet structure. This research comes from ICCV2019. You can add the bypass convolution kernel of sum on the basis of convolution. Finally, in the inference stage, all three convolution kernels are added fusionto the convolution kernel. It can be obtained on many classic CV tasks. 1 point improvement. You can take a look at the interpretation of this article, from the public AI technology review. 3*3Convolution + 1*3convolution + 3*1convolution = white accuracy improvement
  4. Convolution kernel weight initialization method. For weightthe initialization I generally use xavierinitialization. Of course, you can also try He Kaiming's He initialization. The initialization of bias is all set to 0.
  5. Batch Normalization. This is a technique I have been using, which can greatly speed up the convergence speed. Recommendations to build its own network plus much as possible when BN, if there is BNa whole layer connection is no need to add Dropoutup.
  6. Target detection cannot blindly remove the fpnstructure. yolov3You cannot blindly cut down the fpnstructure when adjusting the detection task for your own data . Although you analyze that the Anchor of a branch is basically unlikely to affect your predicted target, if you directly remove the branch, it is likely to cause leakage. Check.
  7. The choice of optimizer. I basically drive the quantity SGD. If the optimization doesn't work, you can try Adam.
  8. Activation function. You can use ReLUit as a version first , if you want to improve the accuracy, you can ReLUchange PReLUit to a try. I prefer to use it directly ReLU.
  9. batch_size: In different types of tasks, batch_sizethe impact is different. You can read this article batch_sizeon the impact of model performance from the official AI developer. How does Batch_size affect model performance
  10. The initial learning rate. Generally, I 0.01set it from the beginning. I personally think that this learning rate and learning rate decay strategy are related, but it is not appropriate to set too large or small, 0.01and 0.1should be more commonly used. The learning rate decay strategy I generally use multistep, step_sizethe setting depends on yours max_iter.
  11. Data and processing zero-center. The first time I saw this word was watching the video of cs231n. There are two main steps, the first is to subtract the mean, and the second is to divide by the variance. In this way, the final input range is [-1,1], generally minus the mean is the most commonly used, and the latter division by variance may need to be tested by yourself to see the effect.
  12. Residual structure and dense connections. resnetThe residual structures and dense netdense connecting structure, to do the project when taking into account the speed almost impossible to say exactly use the full version resnetand densenetstructural integrity, but we can do it yourself will replace some modules of our network for the residual structure and dense connection , When replacing, the complexity of these two structures can be appropriately reduced, similar to the number of channels is halved, only half of the connections are retained in dense connections, and so on. Here need to do some ablation experiments to verify the improved accuracy.
  13. About loss. The good ones lossgenerally improve the generalization performance of the model, but lossthey are often not lossas simple as direct replacement when they are used . You need to think carefully about lossthe mathematical principles behind them. Use the right place to improve. For example, how to use Focal Loss in YOLOv3 to improve map, you can read this post. https://www.zhihu.com/question/293369755.
  14. Find the reliable evaluation index for model tuning. When adjusting the parameter training model, you must find the correct evaluation index. If you don't adjust a parameter, you must record the model's evaluation index such as accuracy, mapvalue, miouvalue, etc. And when tuning the parameters, it is recommended to combine the adjusted parameters and the accuracy on the test set into a string for the model to re-command, which is convenient and quick afterwards review.
  15. Using the backbonenetwork with belts , such as the VGG16-SSDrecommended training finetunemethod, training from scratch is not only time-consuming and laborious, but also difficult to converge.
  16. When doing segmentation experiments, I found that the result of upsampling by upsamling adding 3*3convolution instead of deconvolution is smoother, and the mioudifference is not big, so I think both of them can be used.
  17. Anchor-basedIn order to improve the accuracy of some target detection algorithms, they are crazy to give the frame, the ap value does go up, but it also causes a fplot, and this part is fpnot returned, and it nmsmay not be filtered out at the stage. Compared to apimprovement, engineering reduction fpis also very important. Gaussian yolov3The fpcomparison to yolov3be reduced 40%, Anchor-freethe algorithm being contacted not by much, not quite understand.

Make a match

  1. Feature extraction. VGG16, VGG19, ResNet50, Xception are several feature extraction models that are very useful. It is recommended to use a trained classic model to extract feature vectors from the data set and store them locally, which is more convenient to use and can greatly reduce the memory consumption. ensemble:
    1. You can use different hyperparameters (such as learning rate batch_size,, optimizer) to train different models, and then do ensemble.
    2. You can use different initialization methods to train the model, and then do ensemble.
    3. Different network classic extracted feature vector, assuming VGG16the extracted feature vector dimension is [N,c1], ResNet50feature extraction is the vector dimension [N,c2], Xceptionfeature vector extraction is the dimension [N, c3], then we can use the three coefficients a、b、cwhich is a combination of a shape [N, a*c1+b*c2+c*c3]wherein a、b、cthree The value of each parameter represents which model we use has more features. If it is a classification regression competition, we can follow the feature processing network. You can take different ones to a、b、cget different features, and then do a variety of treatments on the results voting, soft-votingand generally the results will not be too bad.

Guess you like

Origin blog.csdn.net/qq_40716944/article/details/103731419