[Deep learning] Effectively prevent over-fitting

To get a consistent hypothesis to make the hypothesis excessively complex is called overfitting. Overfitting shows that the trained model works well on the training set, but the effect on the test set is poor. In other words, the generalization ability of the model is weak.

Measures to prevent overfitting

1 Data Augmentation

In the problem of object recognition, data augmentation has become a special and effective technique. The position, posture, scale, and overall image sensitivity of the object in the image will not affect the classification result, so we can expand the database exponentially by means of image translation, flipping, zooming, and cutting . Or in speech recognition, adding noise is also seen as a way of data augmentation.

2 Early Stopping

During the training process, record the best validation accuracy so far. When the validation accuracy does not reach the best Accuracy for 10 consecutive Epochs, it is considered that the accuracy is no longer improved , and the iteration can be stopped at this time. Up. During the training process, record the best validation accuracy so far. When the validation accuracy does not reach the best Accuracy for 10 consecutive Epochs, it is considered that the accuracy is no longer improved, and the iteration can be stopped at this time. Up.

3 regularization (regularization)

The method of adding regularization term prevents overfitting . The loss function is divided into the empirical risk loss function and the structural risk loss function. The structural risk loss function is the empirical loss function + regularization representing the complexity of the model. The regularization term usually chooses L1 or L2 regularization. The structural risk loss function can effectively prevent overfitting.

L1 regularization refers to the sum of the absolute values ​​of each element in the weight vector w, which is usually expressed as the 1-norm of w. L1 regularization can generate a sparse weight matrix, that is, a sparse model that can be used for feature selection. To a certain extent, L1 can also prevent overfitting.
Sparse parameters (L1): The sparseness of the parameters enables the selection of features to a certain extent. A sparse matrix means that many elements are 0 and a few parameters are non-zero. Generally speaking, only a small part of the features contribute to the model, and most of the features do not contribute to the model or contribute very little. The introduction of sparse parameters makes the parameters corresponding to some features 0, so those features that can be usedless can be eliminated , So as to achieve feature selection .

L2 regularization refers to the square of the sum of the squares of each element in the weight vector w, which is usually expressed as the 2-norm of w. L2 regularization can prevent the model from overfitting.
Smaller parameter (L2): The more complex the model, the more it tries to fit all samples, which will cause larger fluctuations in a smaller interval. This larger fluctuation reflects the The greater the derivative. Only a larger parameter can produce a larger derivative. Imagine a model with large parameters, as long as the data is offset a little, it will have a great impact on the results, but if the parameters are relatively small, the offset of the data will have no effect on the effect of the results, then the model will also It can adapt to different data sets, that is, it has strong generalization ability, so overfitting is avoided to a certain extent .

4 Random dropout (dropout)

In the neural network, the Dropout method prevents the network from overfitting by modifying the number of neurons in the hidden layer, that is, by modifying the deep network itself.
When each batch of data is trained, Dropout randomly removes some neurons according to the given probability P, and only the parameters of the neurons that are not removed, that is, are retained, are updated. For each batch of data, the randomness of eliminating neurons makes the network have a certain degree of sparsity, which can reduce the synergistic effect between different features. And because the neurons that are removed each time are different, the parameters of the entire network neuron are only partially updated. Elimination weakens the joint adaptability between neurons and enhances the generalization ability and robustness of the neural network . Dropout is only used in training as a hyperparameter, but it cannot be used in the test set.
At present, Dropout is widely used in fully connected networks, and in the convolutional layer, because of the sparsity of the convolutional layer itself and the use of the ReLU activation function, Dropout is less used in the convolutional hidden layer.

5 Reduce the number of features

Guess you like

Origin blog.csdn.net/ao1886/article/details/109511266
Recommended