Deep learning tuning experience

Article source|  JeLian AI Cloud (the most cost-effective shared computing platform to help your technology grow-the first registration can get 100 hours of free GPU usage time! Official website address: Home-GeeLian AI Cloud Platform

Author | Chaser [Ji Lian AI Cloud Technology Original Reward Program] One thousand yuan cloud coin voucher winner

Original address|  Adjustment experience  (official website forum) has been authorized

Experimental results verification experience

When to add augment, when to add multi-scale, how to set the learning rate?

Data enhancement

My personal experience is that after the verification phase, when you start to run an ablation study, you must pay attention to the rationality of the data. Don't design blindly, which will lead to inefficient use of time. Many parameters will be run for nothing, and you will have to start again later.

The first is augmentation. There are two situations. First, do a few small experiments to see how the proposed method changes in performance with and without data Augmet, and then select the better-looking results. For example, I used the idea of ​​adding all the tricks to the rise point at the end, and the previous comparison experiment was not good at all (it worked, but the rise point changes fluctuatingly). But after adding augment data to enrich, the effect of the model will rise. Although the baseline is also added, it will test whether your work is solid enough. If your framework is not good at handling small or single data, it is better to add .

Multi-scale training

This thing will stably increase, but in order to show the effect of the model, rather than fish in troubled waters, it is recommended to add it at the end, because the augmentation is enough, and the full shuttle is not good in the early stage. Of course, you can also avoid the excessive brushing of the baseline, which causes your idea increment to be minimal and embarrassing.

Learning rate

This is very simple. Except for the situation where it is indistinguishable from Stoa, no-brain adam is enough, 0.001 or 0.0001, add a step and warmup is done. Real solid work doesn’t really care about this. You can see how it is adjusted (of course the basic law is still necessary). (Of course, the data set that has been brushed to the top is not counted. After all, when the model performance can’t be pulled, this thing is still Quite metaphysical).

Optimizer and learning rate optimization

· Adam is faster in terms of convergence speed and can be used to verify parameters, but fine-tuned sgd is often better.

· Cyclic learning rate and cosine learning rate learning rate settings. Refer to a gradual warmup+cosine: address.

· Bs is expanded by k times, and lr must be expanded by k times accordingly.

· Lr selection: The choice of lr depends on the effect. First, use the gradient lr to find the point of sudden increase. This lr is suitable, and then set the learning rate of change based on this as a benchmark.

In principle, as long as the loss has no bugs, the lower the loss is, the better the effect is. Therefore, the learning rate that can be effectively reduced by loss should be used as much as possible to avoid excessive vibration. (The local minima is an exception, you can consider restarting SGD, but generally cosine lr is enough) Principle: cosine lr is used for best search for optimization and laziness; daily training parameters are still directly classified and fixed learning rate.

☉ Loss function optimization

· Use exp, atan and other functions to make the optimization space smooth and wide, with better and faster convergence.

· Pay attention to balancing the scales between and within different items, such as the inconsistency of the numerical scales predicted by wh and dxdy (using liou to make up for it), cls has category imbalance, etc.

⊙ batch size

Using a larger bs can better estimate the optimal optimization direction to avoid falling into a local minimum, but a small bs is sometimes more accurate; the cumulative gradient can make a single card also have a larger bs, but the bn estimation will be inaccurate . Note that when the algorithm verifies over-fitting, accumulate*bs should not be greater than the total number of samples! Otherwise the convergence is very slow. In the over-fitting verification stage, directly using bs=1 to converge faster, but the larger bs will converge more slowly. Extreme case: When the model converges relatively smoothly, set bs=1 to slowly further optimize the performance.

Speed ​​optimization

apex mixed precision training. Very easy to use, just a few lines of code, the speed is increased several times, and the accuracy is slightly reduced.

Tutorial installation method:

1. Do not directly pip install, because there are duplicate names. 2. Direct setup install is not  allowed , it will only compile python, the acceleration effect is not good): git clone  https://github.com/NVIDIA/apexcd apexpip install -v --no-cache-dir --global-option= "--cpp_ext" --global-option="--cuda_ext" ./## If the third line above fails to compile, you can guide the environment variables first, and then execute 3export CUDA_HOME=/usr/local/cuda-10.0##.

If you still exit with a red letter error, you can return to a more stable commit version: 1. Enter the apex directory of clone. 2. Create the past commit branch: git checkout -b f3a960f80244cf9e80558ab30f7f7e8cbf03c0a03. Continue the installation in step 3 just now. 4. Don't worry about a bunch of warnings, just see the words Successfully installed apex-0.1.

Precautions:

· Speed ​​up above 1080Ti, not below

· According to the actual scenario test, if it is not speeded up or not obvious, you can use it

· Check whether the half-precision method is used: dtype method, such as: if a.dtype == torch.float16: xxx

· Inference may have a calculation incompatibility between the torch.float16 of the model and the torch.float32 of the target. Use the half() method to convert gt to half precision, such as a=a.half()

⊙ Verification and testing

The test on the test set can check the performance of the model and improve the effect; the test on the training set can see whether the algorithm is effective and whether there are bugs (such as low recall, resulting in missed detection even if overfitting).

⊙ Data enhancement

Enhanced + multi-scale training. You can use imgaug as an aid. See the notes for the usage method.

⊙ Experience

The anchors of single-stage designated learning (after screening by iou etc.) should be guaranteed to be a little more appropriate, not like the two stages, only a few will eventually return, so that it can converge faster. (But yolo only returns one, so it is difficult to learn)

Essence: Change the distribution of positive and negative samples during training to alleviate the problem of extreme sample imbalance (refer to the explanation of the GHM paper). So this approach is not the only one, it can be learned well by using a small amount of anchors through the optimization of loss.

 

If you want to get to know more deep learning partners and exchange more technical issues, please pay attention to the public account " JiLian AI Cloud " (to provide you with the most cost-effective shared computing power platform, official website address: Homepage-JiLian AI Cloud Platform )

Guess you like

Origin blog.csdn.net/weixin_51072772/article/details/108846076