(Reprint) [deep learning] loss does not drop solution

Original Address https://blog.csdn.net/zongza/article/details/89185852 

When we train a neural network model, we often encounter such a headache, that is, loss of neural network model does not drop, so that we can not train, or can not get a better model. There are many reasons leading to loss does not drop during training, and, more generally speaking, does not fall loss are generally divided into three, namely: loss does not fall on the training set, verify that the loss does not fall on the set, and no loss test set decline. Here, the first default we can understand the concept of over-fitting and less fit.

Training set loss does not fall
training set loss delay decline in the training process, it is generally caused by these aspects.

Problems 1. model structure and characteristics of engineering

If the structure of a model of a problem, then it is difficult to train, usually, their own "independent research" designed network structure may be difficult to adapt to practical problems, by reference to the others has been designed and implemented and tested structures, and features works program, improvements and adaptations, faster and better target to complete the task. When bad model structure or too small, there is a problem when the characteristics of the project, its ability to fit the data deficiencies, is the first big problem a lot of people during a new study or engineering applications encountered.

For example, when I set up wavnet, where the output res_block to the wrong cause network problems very difficult training.

2. weight initialization scheme in question

Before training the neural network, we need to give it a given initial value, but how to choose the initial value, will have to refer to the relevant literature, select the most appropriate initialization scheme. Common initialization program has all zero initialization, initialization and normal random uniformly distributed random initialization. Appropriate initialization program is very important for the use, more effective, with the wrong model training conditions could not bear to look. Before blogger training a model initialization program does not train half a day training does not move, loss delay values ​​remain high, last changed initialization scheme, loss on decline in value as cliff-style.

It recommended no brain xaiver normal initialization or he normal

3. regularization of excessive

L1 L2 and Dropout is used to prevent over-fitting when the training set loss get down, we must consider is not a regular excessive, resulting in a model underfitting. Usually at the beginning is no need to add regularization, after over-fitting, and then adjusted according to the training. If the technology fails to start, then it is difficult to determine the structural design of the current model is correct, and it is also more difficult to debug.

Recommend bn, he also has the ability to prevent over-fitting of certain

4. Select the appropriate activation function, loss function

Just initialized, select the activation function of the neural network, the loss of function of the aspect, according to the type of task is the need to select the most appropriate.

For example, convolutional neural network, the output of the convolutional layer is generally used as the activation function ReLu, because it can effectively avoid the gradient disappears and a linear function of more advantages in performance of the above calculation. Circulated in a neural network layer cyclic tanh generally, or RELU, fully connected multi-layer also RELU, only in case the output layer of the neural network, using the classified fully connected layers, that will be used softmax activation function.

Loss function, for some classification task, usually cross entropy loss function using the mean square error regression tasks, tasks automatic alignment CTC loss and the like. Loss function equivalent to the fit of a model evaluation, the results of this indicator as small as possible. A good loss function, at the time of the neural network can be optimized to produce a better model parameters.

5. Select the appropriate optimization and learning rate

Optimizing neural network selection generally selected Adam, but Adam is difficult to train in some cases, this time need to use other optimizers such as SGD and the like. Learning rate determines the speed of the network training, but the learning rate is not the bigger the better, when the network tends to converge when selecting a smaller learning rate should be better guaranteed to find the most advantages. So, we need to manually adjust the learning rate, first select an appropriate initial learning rate, when the train does not move, slightly lower learning rate, and then a period of training, this time basically completely restrained. Adjust the general learning rate is multiplied / divided by a multiple of 10. But now there are some programs automatically adjusts the learning rate, but we have to know how to manually adjusted to the appropriate learning rate.

6. Training lack of time

I sometimes come across people ask this question, why train for several hours, and how loss did not drop much, or how not converge. More haste less speed! Depth study of a variety of training has a different amount of computation, when a large amount of computation required, how could a few hours of training completed, in particular, are still using their own PC CPU to train in the case of the model. General solution is to use faster hardware accelerated training, such as GPU, when it comes to computer vision tasks, accelerating the effect is significant, mainly because of the convolution of the network. When have no way to use hardware to accelerate, the only solution is - and so on.

7. model training bottleneck

Bottleneck here generally include: gradient disappeared, a large number of neurons inactivation, explosion and diffusion gradient, the learning rate is too big or too small and so on.

When the gradient disappears, loss model is difficult to decline, like walking on the plateau, almost anywhere are high-altitude, can be verified by examining the gradient of the current model state is in. Sometimes a bug and the backpropagation gradient updating the code, there will be a problem.

When using Relu activation function, when each neuron of the input X is negative, so that the neuron output will be constant at 0, resulting in inactivation, since at this time a gradient of 0 can not be recovered. One solution is to use LeakyRelu, this time, the left Y-axis of the graph will have a small positive gradient such that after a certain time of the neural network can be restored. But LeakyRelu not commonly used as part of the neuronal inactivation does not affect the result, on the contrary, this output is 0, there are many positive role. Relu equation because the input is negative, the output value is 0, this characteristic can be a good convolution output negative ignore information while retaining information.

 The root cause of the explosion and gradient diffusion gradient is generated, according to the chain rule, gradient depth study of the accumulation layer by layer. Such as the n-th power of 1.1 to infinity, n-th power of 0.9 infinitesimal. Some network layer will cause excessive output gradient explosion, this time should be taken for the output of an upper bound, the available maximum norm constraints. 

About learning rate can be adjusted in accordance with paragraph 5 content.

 

8.batch size is too large

batch size is too small, will cause late model wavering, delay is difficult to converge, and when too large, the early models because the average gradient, leading to convergence is slow. General size batch size is often chosen to be 32, or 16, some tasks such NLP can select the number of 8 as a batch of data. However, sometimes, in order to reduce communication overhead ratio and computational overhead, it may be adjusted to a very large value, in particular in a parallel and distributed.

A data set is not disrupted

Not to disrupt the data set, it will lead to a certain bias network problems in the learning process. For example, Joe Smith and John Doe often appear in the same batch of data, then the result is that the neural network saw Joe Smith will "think" John Doe. The main reason is, when the gradient updating, when the total average gradient Zhang and Li, gradient results in a fixed direction occurs, so that the richness of the data set is reduced. After the data is disrupted, Joe Smith will average together with Wang Wu, also with John Doe king five averaged together, then to the gradient will be more abundant, but also better able to learn the most useful on the entire data set of implied feature.

10. The data set in question

When a data set excessive noise, or marked with a large number of data errors, the neural network will make it difficult learn useful information, which appeared in the case of wavering. Like someone tells you that 1 + 1 = 2, there are people tell you that 1 + 1 = 3, will fall into confusion. Or there may cause an error when reading data, then labeled with the actual situation of the data error is the same. Further, the category will unbalance due to lack of information categories so that less difficult Acquisition essential characteristics.

11. not normalized

Not normalized scale can lead to imbalance, such imbalance 1km and 1cm, thus causing error becomes large, or at the same learning rate, second rate model will pace five centimeters, the swing left and right sides, go forward 1km. Sometimes, the imbalance is due to the different units of measurement due to, for example kg and m, we all know, 1kg 1m and there is nothing comparable, although the numbers are 1. Thus, we can by scaling, characterized in that the value of the number distribution closer.

Normalized before and after comparison

12. A feature selected project data feature in question

Select unreasonable data features, like tagging errors, like data, it will make it difficult to find the neural network to learn the essential characteristics of the data. The essence of machine learning is doing engineering features, as well as cleaning data (escape).

Validation set loss does not fall
validation set loss does not drop into two. One is the loss of training on the collection does not fall, then the problem is mainly on the loss of the training set, should make reference to the above solutions. Another is the loss on the training set could be reduced, but the loss on the validation set has not dropped, here we explain the main issues in this case.

Since the validation set is carved out from the same batch of training data, so the general problem of the data set does not exist, it is mainly over-fitting. Over-fitting to solve, in fact, not very complicated, nothing more than several ways, but the engineers more demanding conditions of their own.

1. appropriate regularization and dimension reduction

Regularization is used to a very important tool model over-fitting solution to the problem, such as by adding a regularization term, and the man given a regular coefficient lambda, carried the weight decay, some of the parameters related to the characteristics of items of little attenuation to almost zero, corresponding to that removed a feature, which dimensions with similar reduction, equivalent to reducing the feature dimensions. And remove substantially independent of dimensions, it avoids over-fitting model for this dimension features. Dropout Normal and also increase the neural network between two layers, also played a role in suppression of overfitting.

2. appropriate to reduce the size of the model

Overfitting very important reason is the complexity of the model is too high, just like an acre of land planted a wheat, other places do not weed species on the president, so they had to fit some noise. So, in addition to the regularization means, appropriate to reduce the size of the model is also very important, as far as possible the amount of information the hypothesis space neural network structure model needs to be stored and expectations to match.

3. Get more data sets

This is the ultimate solution, the depth of learning is developed on the basis of data on a large number. Three-piece depth study: data, models, and hardware. Models can be directly used, the hardware can spend money, but little by little need to collect data, and solve many problems would rely on large amounts of data, the data did nothing.

 

Loss test set does not drop
test set new model data is usually never seen before training, or real data in the target application scenarios. Since the training and validation sets of the loss does not drop, the content should be classified as the first two, so this section, we default to the training and validation sets of loss is normal. So, if the high loss test set, or the correct rate is low, generally because of inconsistent training data distribution and distributed application scenarios and scenarios with test data.

1. inconsistent application scenarios

For example, a speech recognition model, the input data sets are female recording audio, then for the male voice can not be well identified. This is done before the speech recognition bloggers when ever met a real case, the solution is to increase the data set contains a large number of audio recordings men to train.

 

2. The noise problem '

Noise problem is the practical application scenarios, frequently encountered. Data is easy to understand is the case, in speech recognition, voice data sets are collected directly in a quiet environment, but in practical applications, we have more or less noise during recording, then we need to deal with special noise, for example, a noise reduction process, adding noise or the like in the training data. In recognition of the image, then consider the picture block, haze, rotating, mirroring, and the size of the near-far problems.
--------------------- 
Author: sz Sun Xiaochuan 
Source: CSDN 
Original: https: //blog.csdn.net/zongza/article/details/89185852 
Copyright: This article is a blogger original article, reproduced, please attach Bowen link!

Published 40 original articles · won praise 17 · views 120 000 +

Guess you like

Origin blog.csdn.net/rendawei636/article/details/89245717