Reasons for non-convergence or training failure during neural network training

This article summarizes the reasons for non-convergence or training failure during neural network training from three major aspects: data and labels, models, and how to analyze the current status of the network through train loss and test loss.

When faced with a model that does not converge, first of all, ensure that the training times are sufficient. During the training process, the loss does not keep decreasing, but the accuracy rate keeps improving, and there will be some shocks. As long as the general trend is converging. If there are enough training times (generally thousands, tens of thousands, or dozens of epochs) and no convergence, then consider taking measures to solve it.

Data and Labels

  1. No preprocessing of the data was performed. Is the data classification labeling accurate? Is the data clean?

  2. Data were not normalized. Since different evaluation indicators often have different dimensions and dimensional units, such a situation will affect the results of data analysis. In order to eliminate the dimensional influence between indicators, data standardization processing is required to solve the comparability between data indicators. sex. After the original data is processed by data standardization, all indicators are in the same order of magnitude, which is suitable for comprehensive comparison and evaluation. In addition, most neural network processes assume that the input and output are distributed around 0, from weight initialization to activation functions, from training to training network optimization algorithms. Subtract the mean from the data and remove the variance.

  3. The amount of information in the sample is too large, so the network is not enough to fit the entire sample space. A small number of samples may only bring about the problem of overfitting. Have you checked that the loss on your training set has converged? If it is just that the validate set does not converge, it means overfitting. At this time, various anti-overfit tricks should be considered, such as dropout, SGD, increasing the number of minibatches, reducing the number of nodes in the fc layer, momentum, finetune, etc.

  4. Whether the label setting is correct.

Model

  1. The network settings are unreasonable. If you do a very complex classification task, but only use a very shallow network, it may make it difficult for the training to converge. An appropriate network should be selected, or try to deepen the current network. Generally speaking, the deeper the network, the better. You can build a 3-8 layer network at the beginning. When the network is implemented well, you can consider experimenting with a deeper network to improve the accuracy. Starting from a small network means training is faster, and you can set different parameters to observe the impact on the network instead of simply stacking more layers.

  2. The learning rate is inappropriate. If it is too large, it will cause non-convergence. If it is too small, it will cause very slow convergence. The learning rate setting is unreasonable. When training a new network by yourself, you can start to try from 0.1. If the loss does not mean to decrease, then reduce it, divide by 10, and try with 0.01. Generally speaking, 0.01 will converge. If not, use 0.001. The learning rate is set too large. , easily oscillated. However, it is not recommended to set the learning rate too small at the beginning, especially at the beginning of training. At the beginning we cannot set the learning rate too low or the loss will not converge. My approach is to try gradually, from 0.1, 0.08, 0.06, 0.05... Gradually decrease until it is normal. Sometimes the learning rate is too low to be underestimated. Increasing the momentum is also a way to properly increase the mini-batch value so that it does not fluctuate much. If the learning rate is set too high, it will cause the problem of running away (the loss is suddenly very large). This is the most common situation for novices-why does the network suddenly fly when it is running and watching to converge? The most likely reason is that you use relu as the activation function and softmax or a function with exp as the loss function of the classification layer. When a certain training is passed to the last layer, a certain node is overactivated (such as 100), then exp(100)=Inf, overflow occurs, all weights after bp will become NAN, and then weight will always be Keep NAN, so the loss will fly hot. If the lr is set too high, there will be a situation where you will run away and never come back. At this time, if you stop and pick the weights of a layer at random, it is very likely that they are all NAN. In this case, it is recommended to use the dichotomy method to try. 0.1~0.0001. Different models have different optimal lr for different tasks.

  3. The number of neurons in the hidden layer is wrong. In some cases using too many or too few neurons can make the network difficult to train. Too little number of neurons has no ability to express the task, while too much number of neurons will lead to slow training, and it is difficult for the network to clean up some noise. The number of neurons in the hidden layer can be set from 256 to 1024, and then you can look at the numbers used by researchers, which can be used as a reference. If they used numbers that were much different than this, then imagine how this might work. Before deciding on the number of units to use in the hidden layer, the most critical thing is to consider the minimum number of actual values ​​you need to express information through the network, and then slowly increase this number. If you do regression tasks consider using 2 to 3 times as many neurons as the input or output variables. In fact, the number of hidden units usually has a fairly small impact on the performance of a neural network compared to other factors. And in many cases, increasing the number of hidden units required simply slows down training.

  4. Error initializing network parameters. If the network weights are not properly initialized, then the network will not be able to train. The commonly used methods for initializing weights are 'he', 'lecun', and 'xavier'. In practical applications, these methods have very good performance and the network bias is usually initialized to 0. You can choose the one that is most suitable for your task. initialization method.

  5. There is no regularization.  Regularization is typically dropout, adding noise, etc. Even if the amount of data is large or you think the network is unlikely to overfit, it is still necessary to regularize the network. Dropout usually starts with a parameter of 0.75 or 0.9, and adjusts this parameter based on how likely you think the network is to overfit. In addition, if you are sure that the network will not overfit, you can set the parameter to 0.99. Regularization not only prevents overfitting, but in this stochastic process, it can speed up training as well as help deal with outliers in the data and prevent extreme weight configurations of the network. The effect of regularization can also be achieved for data amplification. The best way to avoid overfitting is to have a large amount of training data.

  6. Batch Size is too large. Setting the Batch size too high will reduce the accuracy of the network because it reduces the randomness of the gradient descent. In addition, under the same circumstances, the larger the batch size is, the more training epochs are usually required to achieve the same accuracy. We can try some smaller batch sizes such as 16, 8 or even 1. With a smaller batch size, more weight updates can be performed in one epoch. There are two advantages here. First, you can jump out of the local minimum point. The second can show better generalization performance.

  7. The learning rate is set incorrectly. Many deep learning frameworks enable gradient clipping by default, which can handle the gradient explosion problem, which is very useful, but it is also difficult to find the optimal learning rate by default. If you clean the data correctly, remove outliers, and set the correct learning rate, then you don’t need to use gradient clipping, and occasionally you will encounter the problem of gradient explosion, then you can turn on gradient clipping. However, this kind of problem generally indicates other problems with the data, and gradient clipping is only a temporary solution. whaosoft aiot  http://143ai.com 

  8. The activation function of the last layer is wrong. Using the wrong activation function in the last layer will cause the network to fail to output the range value you expect. The most common mistake is to use the Relu function in the last layer, and its output has no negative values. If you are doing a regression task, you don't need to use an activation function in most cases, unless you know the value you expect as an output. Imagine what your data values ​​actually represent, and what their range is after renormalization, most likely outputting positive and negative numbers with no bounds. In this case, the last layer should not use an activation function. If your output value can only be meaningful within a certain range, such as the probability composition within the range of 0~1. Then the last layer can use the sigmoid function.

  9. The network has bad gradients.  If you have trained for several epochs and the error has not changed, it may be that you have used Relu, you can try to replace the activation function with leaky Relu. Because the Relu activation function has a gradient of 1 for positive values ​​and a gradient of 0 for negative values. Therefore, the slope of the cost function of some network weights will be 0. In this case, we say that the network is "dead" because the network can no longer be updated.

How to analyze the current status of the network through train loss and test loss?

The train loss keeps decreasing, and the test loss keeps decreasing, indicating that the network is still learning;

The train loss continues to decline, and the test loss tends to remain unchanged, indicating that the network is overfitting;

The train loss tends to remain unchanged, and the test loss keeps decreasing, indicating that the data set is 100% problematic;

The train loss tends to remain unchanged, and the test loss tends to remain unchanged, indicating that the learning encounters a bottleneck, and the learning rate or batch number needs to be reduced;

The train loss keeps rising, and the test loss keeps rising, indicating that the network structure is not properly designed, the training hyperparameters are not set properly, and the data set has been cleaned.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132240907