Reasons for non-convergence or training failure during neural network training

When facing a model that does not converge, you must first ensure that the number of training times is enough. During the training process, the loss is not always declining, the accuracy is always improving, and there will be some shocks. As long as the overall trend is converging. If the training times are sufficient (generally thousands, tens of thousands, or dozens of epochs) without convergence, consider taking measures to solve the problem.

1. Data and labels

  1. No preprocessing of the data was performed. Is the data classification labeling accurate? Is the data clean?
  2. No normalization was performed on the data. Since different evaluation indicators often have different dimensions and dimensional units, this situation will affect the results of data analysis. In order to eliminate the dimensional influence between indicators, data standardization is required to solve the problem of comparability between data indicators. sex. After the original data is processed through data standardization, each indicator is in the same order of magnitude, which is suitable for comprehensive comparative evaluation. In addition, most neural network processes assume that the input and output are distributed near 0, from weight initialization to activation functions, from training to optimization algorithms for training the network. Subtract the mean from the data and remove the variance.
  3. The amount of information in the sample is too large, resulting in the network not being able to fit the entire sample space. Few samples may only cause over-fitting problems. Have you checked whether the loss on your training set has converged? If the validation set does not converge, it means overfitting. At this time, various anti-overfit tricks should be considered, such as dropout, SGD, increasing the number of minibatch, reducing the number of nodes in the fc layer, momentum, finetune, etc.
  4. Are the labels set correctly?

2. Model

  1. The network settings are unreasonable. If you do a very complex classification task but only use a very shallow network, it may make it difficult for the training to converge. You should choose an appropriate network or try to deepen your current network. Generally speaking, the deeper the network, the better. You can start by building a 3-8 layer network. When this network is implemented well, you can consider experimenting with deeper networks to improve accuracy. Starting with a small network means training is faster, and you can set different parameters to see the impact on the network rather than simply stacking more layers.
  2. The learning rate is inappropriate. If it is too large, it will cause no convergence. If it is too small, it will cause the convergence speed to be very slow. The learning rate setting is unreasonable. When training a new network by yourself, you can try starting from 0.1. If the loss does not decrease, then lower it, divide by 10, and try with 0.01. Generally speaking, 0.01 will converge. If not, use 0.001. The learning rate is set too high. , easily shaken. However, it is not recommended to set the learning rate too small at the beginning, especially at the beginning of training. At the beginning, we cannot set the learning rate too low or the loss will not converge. My approach is to try it gradually, from 0.1, 0.08, 0.06, 0.05... and gradually reduce it until it is normal. Sometimes the learning rate is too low to avoid underestimation. Increasing the impulse is also a way to increase the mini-batch value appropriately so that it does not fluctuate much. If the learning rate is set too high, it will cause the problem of runaway (the loss suddenly remains very large). This is the most common situation for novices - why does the network suddenly take off after running around and looking at it to converge? The most likely reason is that you used relu as the activation function and also used softmax or a function with exp as the loss function of the classification layer. When a certain training is passed to the last layer, a certain node is overactivated (for example, 100), then exp(100)=Inf, overflow occurs, and all weights after bp will become NAN, and then the weight will remain unchanged from then on. Keep it NAN, and the loss will skyrocket. If lr is set too high, it will run away and never come back. At this time, if you stop and pick the weights of any layer and take a look, it is very likely that they are all NAN. For this situation it is recommended to try the dichotomy method. 0.1~0.0001. The optimal lr is different for different models and different tasks.
  3. The number of hidden layer neurons is wrong. Using too many or too few neurons can make the network difficult to train in some cases. Too few neurons have no ability to express the task, while too many neurons will cause training to be slow and the network will have difficulty cleaning up some noise. The number of hidden layer neurons can be set from 256 to 1024, and then you can look at the numbers used by researchers, which can be used as a reference. If they use numbers that are much different than this, imagine how that works. Before deciding on the number of units in the hidden layer, it is critical to consider the minimum number of actual values ​​you need to express information through the network, and then slowly increase this number. If you are doing a regression task consider using 2 to 3 times the number of neurons as input or output variables. In fact, the number of hidden units usually has a fairly small impact on the performance of a neural network compared to other factors. And in many cases, increasing the number of hidden units required simply slows down training.
  4. Error initializing network parameters. If the network weights are not properly initialized, the network will not train. The commonly used methods of initializing weights are 'he', 'lecun', and 'xavier'. In practical applications, these methods have very good performance and the network bias is usually initialized to 0. You can choose the one that best suits your task. Initialization method.
  5. There is no regularization. Regularization is typically dropout, adding noise, etc. Even if the amount of data is large or you think the network is unlikely to overfit, it is still necessary to regularize the network. Dropout usually starts by setting the parameter to 0.75 or 0.9, and adjust this parameter depending on how likely you think the network is to overfit. In addition, if you are sure that the network will not overfit, you can set the parameter to 0.99. Regularization not only prevents overfitting, but in this stochastic process, can speed up training and help handle outliers in the data and prevent extreme weight configurations of the network. Data amplification can also achieve regularization effects. The best way to avoid overfitting is to have a large amount of training data.
  6. Batch Size is too large. Setting the batch size too high will reduce the accuracy of the network because it reduces the randomness of gradient descent. In addition, under the same circumstances, the larger the batch size, the more training epochs are usually required to achieve the same accuracy. We can try some smaller batch sizes such as 16, 8 or even 1. Using a smaller batch size allows more weight updates to be performed in one epoch. There are two advantages here. First, you can jump out of the local minimum point. The second one can show better generalization performance.
  7. The learning rate is set incorrectly. Many deep learning frameworks enable gradient clipping by default, which can handle the gradient explosion problem. This is very useful, but it is also difficult to find the optimal learning rate by default. If you clean the data correctly, delete outliers, and set the correct learning rate, you don't need to use gradient clipping. Occasionally you will encounter gradient explosion problems, then you can turn on gradient clipping. However, this kind of problem usually indicates that there are other problems with the data, and gradient clipping is only a temporary solution.
  8. The activation function of the last layer is used incorrectly. Using the wrong activation function in the last layer will cause the network to eventually fail to output the range of values ​​you expect. The most common mistake is to use the Relu function in the last layer, and its output has no negative values. If you are doing a regression task, in most cases there is no need to use an activation function unless you know the value you expect as the output. Imagine what your data values ​​actually represent and what their ranges are after renormalization. The most likely scenario is to output positive and negative numbers with no bounds. In this case, the last layer should not use an activation function. If your output value can only be meaningful within a certain range, such as probability composition in the range of 0~1. Then the last layer can use the sigmoid function.
  9. The network has bad gradients. If the error does not change after training for several epochs, it may be that you are using Relu. You can try changing the activation function to leaky Relu. Because the Relu activation function has a gradient of 1 for positive values ​​and a gradient of 0 for negative values. Therefore, the slope of the cost function of some network weights will be 0. In this case, we say that the network is "dead" because the network can no longer be updated.

How to analyze the current status of the network through train loss and test loss?

The train loss continues to decrease, and the test loss continues to decrease, indicating that the network is still learning;

The train loss continues to decrease, and the test loss tends to remain unchanged, indicating that the network is overfitting;

The train loss tends to remain unchanged and the test loss continues to decrease, indicating that there is 100% problem with the data set;

The train loss tends to remain unchanged and the test loss tends to remain unchanged, indicating that learning has encountered a bottleneck and the learning rate or batch size needs to be reduced;

The train loss keeps rising, and the test loss keeps rising, indicating that the network structure is improperly designed, the training hyperparameters are improperly set, and the data set has been cleaned.

Guess you like

Origin blog.csdn.net/qq_15719613/article/details/134968713