Summary of some debugging techniques for deep learning

1. What should I do if I meet Nan?

  1. Divide by 0 problem. There are actually two possibilities here, one is that the value of the dividend is infinity, that is, Nan, and the other is that the value of the divisor is 0. The Nan or 0 generated before may be passed down, causing all Nan to follow. Please check the possible divisions in the neural network, such as the softmax layer, and then carefully check the data. I once helped others to debug the code, and even encountered that some values ​​in the training data file were Nan. . . After reading in this way, start training, as long as you encounter Nan's data, Nan will be behind. You can try to add some logs, output the intermediate results of the neural network, and see at which step Nan starts to appear. The handling of Theano will be introduced later.
  2. The gradient is too large, causing the updated value to be Nan. Especially RNN, when the sequence is relatively long, it is prone to the problem of gradient explosion. There are generally the following solutions.
    1. Clipping the gradient (gradient clipping), limiting the maximum gradient, is actually value = sqrt(w1^2+w2^2....), if the value exceeds the threshold, it is an attenuation coefficient, and the value of value is equal to the threshold: 5 ,10,15.
    2. Reduce the learning rate. If the initial learning rate is too large, it may also cause this problem. It should be noted that even if you use an adaptive learning rate algorithm such as adam for training, you may encounter the problem of excessive learning rate, and this type of algorithm generally has a learning rate super parameter, which can be changed to a small value. some.
  3. If the initial parameter value is too large, the Nan problem may also occur. It is best to normalize the input and output values.

2. What if the neural network can't learn something?

  1. Please print out the cost value of the training set and the change trend of the cost value on the test set. The normal situation should be that the cost value of the training set continues to decrease, and finally tends to be flat or fluctuates in a small range. The cost value of the test set first decreases, and then starts Shock or rise slowly. If the cost value of the training set does not decrease, there may be a bug in the code, a problem with the data (it has a problem, a problem with data processing, etc.), or a hyperparameter (network size, number of layers, learning rate, etc.) The setting is unreasonable.
    Please manually construct 10 pieces of data and use neural network to train repeatedly to see if the cost decreases. If it does not decrease, then there may be a bug in the network code, which needs to be checked carefully. If the cost value drops, make predictions on these 10 pieces of data to see if the results are in line with expectations. It is very likely that the network itself is normal. Then you can try to check whether there is a problem with the hyperparameters and data.
  2. If the neural network code is all implemented by yourself, then it is strongly recommended to do a gradient check. Make sure that there is no error in the gradient calculation.
  3. Start the experiment with the simplest network first. Don't just look at the cost value, but also look at what the predicted output of the neural network looks like to ensure that it can run out of the expected results. For example, when doing a language model experiment, first use a layer of RNN, if one layer of RNN is normal, then try LSTM, and then further try multi-layer LSTM.
  4. If possible, you can input a piece of specified data, and then calculate the correct output result of each step by yourself, and then check whether the result of each step of the neural network is the same.

 Steps to improve the model: 1. The first thing that should be analyzed is whether the preprocessing of the training data is unreasonable? 2. Determine whether the model has high bias or high variance? High deviation means that the model is under-fitting. Changing to a better gradient descent algorithm, reducing the learning rate, weakening the regularization, increasing the number of hidden nodes, and increasing the number of network layers can all reduce the deviation; 3. High variance means over-fitting , You can expand the training data and enhance the regularization; 5. A bit more brute force can super-parameter search; 6, a bit foolish, and directly change to a better network model, which may be a qualitative leap.

3. Estimate the Bayesian optimal error rate based on human performance.

       The Bayesian optimal error rate is the theoretically possible optimal error rate, which means that there is no way to design a function from x to y that can be lower than this optimal error rate. For example, some pictures in a cat and dog recognition training set are indeed very fuzzy, and neither humans nor machines can judge the category of a certain picture, so the optimal error rate cannot be zero. The purpose of estimating human performance on a certain data set is to understand the upper limit of the accuracy of the data, so as to judge how far the accuracy of the model is.

4. Record the error rate of the training set and the error rate of the validation set every certain number of steps during training, and keep training until the error rate on the training set no longer drops, stop training;

5. Calculate the difference between the Bayesian error rate and the training error rate . The difference is called the model deviation. The difference between the training error rate and the verification error rate is calculated. The difference is called the model variance. The training set error recorded during training is calculated. The error rate and the validation set error rate are plotted as a curve to analyze whether the deviation or the variance of the model should be reduced in the next step. Assuming that the deviation is 10% and the variance is 3%, then the bias should be reduced first, and then the variance should be reduced after the deviation is small.

6. The methods to reduce the deviation are: (it may be under-fitting)

(1) Better optimization algorithms, such as mometum, RMSprop, Adam;

(2) Better hyperparameters, such as lowering the learning rate and weakening the regularization;

(3) Change the activation function;

(4) Increase the number of hidden nodes;

(5) Increase the number of layers;

(6) Use a new network architecture.

7. Methods to reduce variance include: (it may be over-fitting)

(1) Use more training data or data augmentation;

(2) Use regularization, such as L1, L2, dropout;

(3) Super parameter search;

(4) Use a new network architecture.

The debuggable items that reduce bias and variance are explained in the following table:

 8. How the error rate of the training set is acceptable, and the error rate of the validation set is not much different. Next, you can analyze the error rate on the test set. It is best to analyze each graph that predicts errors and summarize the model errors. For the reason, classify and summarize the error types, and then decide the next step to debug the model plan.

9. When debugging the code, you should use a small data set to train at the beginning. Although it is over-fitting, it means that the network can be over-fitted. It can be over-fitted, indicating that the network can still extract some features, indicating that the design idea is not The problem, if the small data set cannot be over-fitted, it means that there is a problem with the network design (may not be able to extract effective features), or there is a problem with the training parameter settings, or there is a problem with the scheme design. Once the small data set can be over-fitted, the rest is easy to handle. Enlarging the data set and increasing the number of training iterations are basically OK.

10. In the case of reaching the same receptive field, the performance of the multi-layer small convolution kernel must be better than the large convolution kernel, because the multi-layer small convolution kernel has stronger nonlinearity and is more conducive to feature sharing.

11. Resolution is very important. Try not to lose resolution. In order to maintain the resolution, you must ensure that there is enough receptive field on this layer before using downsampling. This receptive field is relative to the receptive field, which refers to this downsampling. The layer is relative to the receptive field of the previous downsampling layer. If the two downsampling layers are regarded as a sub-network, this sub-network must have a certain receptive field to encode spatial information into the network below, and the specific needs The relative receptive field size can only be tested. Generally speaking, the layer closer to the input layer has the highest spatial information redundancy, so the closer the input layer is, the smaller the relative receptive field should be. At the same time, in the layer close to the input layer, a large convolution kernel can be synthesized here to reduce the calculation amount, because at the input end, the calculation amount of each layer of convolution is very large. In addition, the relative receptive field must also change slowly.

12,

Reference: https://www.jianshu.com/p/e9e6d8db9f6f

Guess you like

Origin blog.csdn.net/wzhrsh/article/details/110948287