Several solutions to overfitting problems in deep learning

Today, let’s talk about how people solve the overfitting problem in daily deep learning.

Everyone should know that high bias corresponds to underfitting, and high variance corresponds to overfitting.

Next, we will solve the problem of overfitting

1.Regularization (Regularization)

Let's review the L2 regularization introduced in Logistic Regression. Its expression is:

Insert picture description here
There is also an L1 regularization method, the expression is:
Insert picture description here
Compared with L2 regularization, the w obtained by L1 regularization is more sparse, that is, many w are zero. Its advantage is to save storage space, because most of w is 0. However, in fact, L1 regularization is not more advantageous than L2 regularization in solving high variance. Moreover, L1 is more complicated in terms of differentiation. Therefore, generally L2 regularization is more commonly used.

In the deep learning model, the expression of
Insert picture description here
L2 regularization is: L2 regularization is also called weight decay. This is because, due to the addition of the regular term, dw[l] has an increment. When updating w[l], this increment will be subtracted, making w[l] smaller than the value without the regular term some. Keep iterating and updating, and keep reducing.
Insert picture description here

2.Dropout Regularization

In addition to L2 regularization, there is another effective method to prevent over-fitting: Dropout Regularization (loss regularization).

Dropout means that during the training process of the deep learning network, neurons in each layer are temporarily dropped from the network according to a certain probability. In other words, during each training, some neurons in each layer do not work, which can simplify the complex network model and avoid overfitting.
Insert picture description here

For m samples, in a single iteration of training, a certain number of neurons in the hidden layer are randomly deleted; then, the weight w and the constant term b are updated forward and backward on the remaining neurons after deletion; then, In one iteration, the previously deleted neurons are restored, a certain number of neurons are randomly deleted again, and w and b are updated in the forward and reverse directions. Continue to repeat the above process until the iterative training is completed.

It is worth noting that after the use of dropout training, when testing and actually applying the model, there is no need to perform dropout and randomly delete neurons, all neurons are working.

To sum up, for the same set of training data, after using different neural network training, finding the average of its output can reduce overfitting. Dropout uses this principle to drop a certain number of hidden layer neurons each time, which is equivalent to training on different neural networks. This reduces the dependence between neurons, that is, each neuron cannot depend on a certain number of neurons. Other neurons (referring to neurons connected between layers) make the neural network more capable of learning more robust features with other neurons.

When using dropout, there are several points to note. First, the dropout coefficient keep_prob of different hidden layers can be different. Generally speaking, the more hidden layers with neurons, keep_out can be set smaller, such as 0.5; the hidden layers with fewer neurons, keep_out can be set larger, such as 0.8, set to 1. In addition, in actual applications, it is not recommended to perform dropout on the input layer. If the input layer has a large dimension, such as a picture, you can set dropout, but keep_out should be set larger, such as 0.8, 0.9. Generally speaking, the more likely to appear the hidden layer of overfitting, the smaller the keep_prob is. There is no precise and fixed approach, and you can usually choose based on validation.

Dropout is widely used in the field of computer vision CV because the input layer has a large dimension and does not have enough samples. It is worth noting that dropout is a regularization technique to prevent overfitting. It is best to use dropout only when regularization is needed.

When using dropout, you can debug by drawing the cost function to see if the dropout is executed correctly. The general approach is to set the keep_prob of all layers to 1, and then draw the cost function, that is, cover all neurons, and see if J drops monotonically. In the next iteration of training, set keep_prob to another value.

3.Other regularization methods

In addition to L2 regularization and dropout regularization, there are other ways to reduce overfitting.

One way is to increase the number of training samples. However, the cost is usually high and it is difficult to obtain additional training samples. However, we can perform some processing on the existing training samples to "manufacture" more samples, which is called data augmentation. For example, in the picture recognition problem, the existing pictures can be flipped horizontally, vertically, rotated at any angle, zoomed or expanded, and so on. As shown in the figure below, these processes can "create" new training samples. Although these are based on the original samples, it is still very helpful to increase the number of training samples. No additional cost is needed, but it can prevent overfitting.
Insert picture description here
In digital recognition, you can also rotate or distort the original digital picture arbitrarily, or add some noise, as shown in the following figure:
Insert picture description here
There is another method to prevent over-fitting: early stopping. As the number of iterative training increases for a neural network model, the train set error generally decreases monotonically, while the dev set error decreases first and then increases. That is to say, when there are too many training times, the model will fit the training samples better and better, but the fitting effect on the validation set will gradually become worse, that is, overfitting occurs. Therefore, the number of iterative training is not the better. The train set error and dev set error can be used to select the appropriate number of iterations, that is, early stopping.
Insert picture description here
However, Early stopping has its own disadvantages. Generally speaking, machine learning training models have two goals: one is to optimize the cost function and minimize J; the other is to prevent overfitting. These two goals are opposed to each other, that is, reducing J may cause overfitting, and vice versa. We call the relationship between these two orthogonalizations. As mentioned at the beginning of this lesson, in deep learning, we can reduce Bias and Variance at the same time to build the best neural network model. However, the practice of Early stopping prevents overfitting by reducing the number of trainings so that J will not be small enough. In other words, early stopping combines the above two goals and optimizes at the same time, but it may not have the effect of "divide and conquer".

Compared with early stopping, L2 regularization can achieve the effect of "divide and conquer": enough iterative training can reduce J, and it can also effectively prevent overfitting. One of the disadvantages of L2 regularization is that the selection of the optimal regularization parameter λ is more complicated. At this point, early stopping is relatively simple. In general, L2 regularization is more commonly used.

Guess you like

Origin blog.csdn.net/weixin_49005845/article/details/110857179