Machine learning tuning skills

This article is reproduced from other people's articles, but I can't find the source...

Tuning skills

drawing

It is a good habit to draw pictures. Generally, after the training data has been traversed for one round, the accuracy of the training set and the verification set will be output. Draw on a graph at the same time. After training for a period of time, if the model has not converged, you can stop the training and try other parameters to save time. If the accuracy of the training set and test set is very low at the end of the training, it means that the model may be underfitting. Then the subsequent adjustment of the parameter direction is to enhance the fitting ability of the model. For example, increase the number of network layers, increase the number of nodes, reduce the dropout value, reduce the L2 regular value, and so on. If the accuracy of the training set is high and the accuracy of the test set is relatively low, the model may be overfitting. At this time, it is necessary to adjust the parameters in the direction of improving the generalization ability of the model.

Tuning parameters from coarse to fine-grained

In practice, a preliminary range search is generally performed first, and then a more refined search is narrowed down based on where good results appear.

  1. It is recommended to refer to the relevant papers first, and use the parameters given in the papers as the initial parameters. At least the parameters in the paper are not a bad result.
  2. If you can't find a reference, then you can only try it yourself. You can start with the more important parameters that have a greater impact on the experimental results, and fix other parameters at the same time. After getting a similar result, adjust other parameters on the basis of this result. For example, the learning rate is generally more important than the regular value and the dropout value. If the learning rate is not set properly, not only the result may deteriorate, but the model may even fail to converge.
  3. If you really can't find a set of parameters, you can let the model converge. Then you need to check whether there is a problem in other places, such as model implementation, data, and so on. You can refer to deep learning network debugging skills

accelerate

Tuning parameters is just to find the right parameters, not to produce the final model. Generally, the appropriate parameters on small data sets will not be too bad on large data sets. Therefore, you can try to reduce the data to improve the speed, and you can try more parameters in a limited time.

  • Sample the training data. For example, the original 100W pieces of data are first sampled into 1W, and the experiment is performed to see.
  • Reduce training categories. For example, the handwritten digit recognition task originally had 10 categories, so we can train on 2 categories first to see what the result is.

hyperparameter range

It is recommended to prioritize hyperparameter searches on a logarithmic scale. The more typical ones are the learning rate and the regularization term. We can try from 0.001 0.01 0.1 1 10 to order 10. Because their impact on training is a multiplicative effect. However, for some parameters, it is recommended to search on the original scale, such as dropout value: 0.3 0.5 0.7).

Empirical parameters

Here are the empirical values ​​of some parameters, so as to prevent you from being clueless when adjusting the parameters.

  • learning rate: 1 0.1 0.01 0.001, generally try from 1. It is rare to see a learning rate greater than 10. The learning rate generally decays with training. The attenuation factor is generally 0.5. The timing of decay can be when the accuracy of the verification set no longer increases, or after a fixed number of training cycles. However, it is more recommended to use adaptive gradient methods, such as adam, adadelta, rmsprop, etc. These generally use the default values ​​​​provided by related papers, which can avoid the need to adjust the learning rate. For RNN, there is an experience. If the sequence to be processed by RNN is relatively long, or the number of RNN layers is relatively large, it is generally better to have a smaller learning rate. Otherwise, the result may not converge, or even Nan and other problems may occur.
  • Number of network layers: Start with layer 1.
  • The number of nodes per layer: 16 32 128, the situation of exceeding 1000 is relatively rare. Never seen more than 1W.
  • batch size: 128 starts up and down. Increasing the batch size value can indeed improve the training speed. But it is possible that the convergence result will be worse. If the memory size allows, consider starting with a larger value. Because the batch size is too large, it generally does not have much impact on the results, and if the batch size is too small, the results may be poor.
  • clip c (gradient clipping): limit the maximum gradient, in fact, value = sqrt(w1 2+w2 2….), if the value exceeds the threshold, it is considered an attenuation coefficient, and the value of value is equal to the threshold: 5,10,15
  • dropout: 0.5
  • L2 regularity: 1.0, exceeding 10 is rare.
  • Word vector embedding size: 128, 256
  • Positive and negative sample ratio: This is very neglected, but it is a very important parameter in many classification problems. Many people are accustomed to using the default ratio of positive and negative categories in the training data. When the training data is very unbalanced, the model is likely to be biased towards a larger number of categories, which will affect the final training results. In addition to trying the default ratio of positive and negative categories of the training data, it is recommended to sample a small number of samples, such as replication. Increase their ratio and see how it works. This is also applicable to multi-classification problems. When using the mini-batch method for training, try to balance the proportion of each category in a batch, which is very important in multi-classification tasks such as image recognition.

Auto tuning

The artificial has been staring at the experiment, after all, it is too tiring. There are currently many studies on automatic parameter tuning. Here are a few more practical ways:

  • Grid Search. This is the most common. Specifically, each parameter determines several values ​​to be tried, and then traverses the combination of all parameter values ​​like a grid. The advantage is that simple violence can be achieved, and if it can be traversed completely, the result is more reliable. The disadvantage is that it is too time-consuming, especially like a neural network, and generally cannot try too many parameter combinations.
  • Random Search. Bengio pointed out in Random Search for Hyper-Parameter Optimization that Random Search is more effective than Grid Search. In actual operation, the grid search method is generally used first to obtain all candidate parameters, and then randomly selected from them for training each time.
  • Bayesian Optimization. Bayesian optimization takes into account the experimental result values ​​corresponding to different parameters, so it saves time. Compared with Internet search, it is simply the difference between an old cow and a sports car. For specific principles, please refer to this paper: Practical Bayesian Optimization of Machine Learning Algorithms , here are two Python libraries that implement Bayesian parameter tuning, which can be used immediately:

Summarize

  • Sanity checks, sure there is nothing wrong with the model, data and elsewhere.
  • While training, keep track of loss function values, training set and validation set accuracy.
  • Use Random Search to search for optimal hyperparameters, and search in stages from coarse (less training cycles for larger hyperparameter ranges) to fine (longer training cycles for smaller hyperparameter ranges).

References

Here are some parameter information, if you have time, you can read further.

Guess you like

Origin blog.csdn.net/zyl_wjl_1413/article/details/127989640