Neural Network Tuning: A Recipe-Level Guide

This article was translated from: A Recipe for Training Neural Networks

Important findings:

  1. Neural network training is a flawed abstraction : switching packages is not as simple as using other APIs such as https.
  2. Network training often fails unconsciously : most of the time there is no obvious error, but you may forget to do the corresponding operation on the label when the data is enhanced, or you want to do gradient clipping but accidentally do loss clipping, or Improper regularization, initialization, learning rate, etc. All of these can cause your network to appear to be training normally, but the results are a mess. In the author's experience, the qualities most closely associated with success in deep learning are patience and attention to detail .

Neural Network Training Recipe Text

Based on the above two findings, the author proposes the inspection steps during neural network training. The basic principle is to add code incrementally, fully verify and do our best to ensure that no bugs are introduced. (This is a bit like the cognitive process suggested by Schopenhauer, that is, you should get in touch with biographical documentary books and avoid fictional stories such as novels, so as to gradually gain a true understanding of the world) If you use training neural network To compare your process of writing code, then you should use a very small learning rate and verify on the entire test set after each iteration.

The specific inspection steps are as follows:

1. Integrate with data

The first step in training a neural network does not require touching any neural network code, it should start with a thorough examination of the data. Including spending a lot of time browsing the data, detecting problems such as bad samples, sample and label mismatch, sample imbalance, etc.

In addition, since the neural network is actually a compressed version of the dataset, you can obtain the (wrong) prediction results of the network and analyze the reasons for such results (bad case analysis). If the network's predictions don't match what you see in the data, there may be a problem.

Once you have a qualitative feel, write some simple code to search/filter/sort anything you can think of (e.g. label type, label size, number of labels, etc.) and visualize their distribution, outliers or outliers , which almost always finds some errors in data quality or preprocessing.

2. Build an end-to-end training-test skeleton and get the baseline results

After checking the data, can we directly apply some advanced models, such as ASPP, FPN? NO, if this is the case, you are on the road to suffering (hhhh, original text: That is the road to suffering.) . At this point, you should build an end-to-end training and testing framework with a small model (such as a simple convNet, or a recognized baseline model on a specific task) that you are sure you will not make mistakes, and make sure that the loss, model output, etc. are all normal. It is also possible to do some ablation experiments with clear expected outputs.

The tricks at this stage are:

  • Fixed random seed : helps you get reproducible results and helps you stay sane hhh
  • Simplification : This stage should disable any fancy modules, such as data augmentation
  • Enrich test indicators : add some meaningful indicators when testing
  • Verify the initial loss : the loss at the time of network initialization can be inferred
  • Proper initialization : correctly initialize the weights of the last layer of the network (according to the distribution or range of the output)
  • human baseline : monitor human interpretable and inspectable metrics other than loss (such as acc, example)
  • Input-independent baseline : Set the input to all zeros, and then train a baseline to see if the model has learned knowledge that has nothing to do with the input
  • Overfitting a batch of data : verify that the model has the ability to fit simple data (loss eventually converges to 0), if not, it means there is a bug.
  • Verify whether the training loss will decrease with the increase of model parameters : because only a small model was used before, the training on the entire data set should be underfitting. In theory, increasing the number of model parameters at this time will improve the performance. If it does not meet the above It is expected that there is a bug.
  • Visualize data before feeding it to the network : Make sure that what is fed to the network is what you expect, this step will allow you to find bugs in the data preprocessing or data augmentation process.
  • Dynamic visualization of prediction results during training : How the prediction results for a specific batch of data change during training will give you an intuitive feel for the training process, such as training instability or learning rates that may be too large or too small .
  • Use backpropagation to determine dependencies : Sometimes some operations in tensor, such as broadcast, will mix information between different batches. At this time, you can manually set the loss to only be related to input-i, and then backpropagate to calculate it. The gradient should only be related to input-i, and the gradient on other unrelated paths should be zero. A similar approach can be used to ensure that the dependencies of dependent networks are all as expected.
  • From special to general : People often make bugs when trying to write a general function directly from scratch. The author recommends that first write a very clear function to ensure that it does not make mistakes, and then generalize it, and ensure that Get the same result as the previous version. For example, when vectorizing code, you can first write a fully looped version, and then gradually vectorize by removing one loop at a time (slow is fast --- Charlie Munger)

By the end of this phase we should have a good understanding of the data and have a full train-test pipeline where we can get a reliable and reproducible result for any given model. In addition, we should have some preliminary experimental results, such as input-independent baselines, some dumb baselines, and a rough understanding of the best human performance. Then, we can iteratively optimize the model in the next two stages.

The specific method is to first design a large model with the ability to overfit (make the training loss as small as possible, regardless of the test loss), and then add regularization (giving up some training loss to improve the verification loss). The reason for this approach is that if we cannot overfit a large model with any number of parameters, then there is a bug.

3. Overfitting

The tricks at this stage are:

  • In terms of model selection : Don't be a hero. Don't come up and stack up a bunch of fancy modules like Lego blocks, especially in the early stage of the project, you must strongly resist this temptation. The author's suggestion is to find the most relevant articles, and then Directly copy their simplest models that achieve good results, such as ResNet-50.
  • Adam is safe : The optimizer chooses Adam, which can reduce the trouble of tuning parameters in the early stage. Advice again don't be a hero and follow whatever the most related papers do.
  • Only make the model a little more complex at a time : If you have many modules to try, add them one at a time and ensure performance improves.
  • Don't trust the default learning rate decay strategy : the learning rate decay should be done at the end of the entire project, and the decay strategies between different models and data or training strategies are not interoperable, so use fixed learning early rate.

4. Regularization

The tricks at this stage are:

  • Increase the amount of data : The simplest and most effective way to add regularization to the model is to expand the real training data.
  • Data augmentation : only this and real data are semi-real data, try various types of data augmentation (based on your understanding of data and tasks)
  • Creative data augmentation : domain randomization, simulations, mixed data, GANs, and more
  • Pre-training : Even if there is enough data, there is absolutely no loss in using a pre-trained model.
  • Stick to supervised learning : Don't get too excited about unsupervised pre-training, unsupervised may be more useful in NLP, probably because language data has a higher signal-to-noise ratio
  • Smaller input dimensions : (for recognition tasks) remove data that may introduce spurious signals, any additional information will increase the possibility of overfitting. Likewise, try inputting lower resolution images if low-level details are not important.
  • Smaller models : In many cases, domain knowledge can be used to constrain the model, thereby reducing its number of parameters. For example, replace the full connection of the last layer in imageNet with avgpooling.
  • Smaller batch-size : Because of the existence of batch normalization, a smaller batch size means stronger regularization. This is due to greater randomness.
  • drop out : use with caution, because drop out does not perform well in the presence of batch norm.
  • weight decay:
  • early stopping : Monitor the loss on the validation set and stop training when it is about to overfit.
  • Try a larger model : The reason why the large model is mentioned here is that after using the early stop, the large model usually performs better than the small model.

Finally, to increase the understanding of the trained model, you can visualize the weights of the first layer of the network, make sure you get something meaningful (like edge detection kernels), if the visualization looks like noise, then something is wrong. Also, the internal activation values ​​of the network can sometimes hint at problems.

5. Adjust

At this time, it finally came to the parameter adjustment stage, tricks:

  • Random search instead of grid search : The sensitivity of neural networks to different parameters is not the same, and the search range should be expanded for more sensitive parameters.
  • About hyperparameter optimization : There are many fancy Bayesian parameter optimization tools, but the author's experience is that the sota parameter tuning method is to use interns :) hhhhhhhhhhhhhh

6. Squeeze some juice at the end

When you have obtained the best network structure and hyperparameters, there is one last trick left to improve the performance of the system.

  • Integration : Model integration can guarantee a performance improvement of about 2%. Of course, the cost is the speed of the test. If you are concerned about the speed and computational overhead of the test, you can consider using knowledge distillation (recommended here: Knowledge Distillation Overview: Knowledge Distillation : A Survey )
  • Leave it training : The network can continue to train for a long time. Once the author accidentally forgot to stop a model, and when he came back from the winter vacation, he found the result sota (???????? :) Great work is a miracle, right?)

Summarize

By the time you read this you already have all the necessary ingredients for success: a deep understanding of the technology, data, and tasks, a complete training and testing framework that allows you to have full confidence in the experimental results, incrementally increasing system complexity and Both obtained the expected results. So now you are fully prepared to read more papers, try more experiments and get your SOTA! Good luck!

Guess you like

Origin blog.csdn.net/daimaliang/article/details/129287488