Deep learning--solve the problem of model overfitting

Article Directory
1. Overfitting
1. What is overfitting
2. Why does the model produce overfitting? This is because:
3. The purpose of parameter adjustment training model:
4. Explanation of underfitting
2. How to reduce overfitting
1. Obtain more training data.
2. Reduce network size
3. Add weight regularization
4. Add dropout regularization (discard some output data)
3. Practical operation to reduce overfitting
1. Reduce network size
2. Add weight regularization
3. Add dropout regularization
1. Overfitting
1. What is overfitting
? Is the performance of the model you just trained on the set-out verification data always reach the highest point after a few rounds, and then start to decline.
As shown in the figure below, the accuracy of the model on the training set has been increasing, but the accuracy of the model on the verification set reached the highest point in the second round and then began to decline. At this time, the model began to overfit on the training data. A phenomenon in which the model begins to learn patterns that are relevant only to the training data, but which are false or irrelevant for new data.

2. Why does the model produce overfitting? This is because:
(1) A lot of data has noise. The model will try to cover the noise points as much as possible, resulting in overfitting to the data.
(2) Insufficient training data causes the model to over-extract features that have nothing to do with the prediction direction.
(3) The constructed model is too complex and excessively extracts the features of the training data, which is not suitable for the test set.

3. The purpose of tuning the training model: the purpose of
training the model is to obtain a good neural network. A good network requires high prediction accuracy for the data. This requires:
1. Adjust the model to get the best training data. Performance, ie optimization, is well achieved.
2. But at the same time, to make the trained model perform well on data that has never been seen before, that is, good generalization ability, adjusting the model based on the training data is to improve the generalization ability of the model.
Deep learning models are often very good at fitting the training data, but the real challenge is generalization, not fitting.

4. Explanation of underfitting
For the trained model, if the performance in the training set is poor, the performance in the test set will also be poor, which may be caused by underfitting. Underfitting means that the degree of fitting of the model is not high, the data is far from the fitting curve, or the model does not capture the characteristics of the data well and cannot fit the data well. If the loss on the training data is smaller, the loss on the test data is also smaller. The model at this time is under-fitting, generally because the network is too small, and the network has not modeled all the relevant patterns in the training data. Appropriately increasing the number of hidden units or layers of the network can basically solve the problem.

2. How to reduce overfitting
1. Obtain more training data.
In order to prevent the model from learning wrong or irrelevant patterns from the training data, the optimal solution is to obtain more training data. The more training data the model has, the better the generalization ability will be.

If no more data is available, a sub-optimal solution is to regulate the amount of information the model is allowed to store, or to impose constraints on the information the model is allowed to store. If a network can only remember a few patterns, the optimization process forces the model to focus on learning the most important patterns, which are more likely to generalize well.
This method of reducing overfitting is called regularization.


2. The easiest way to reduce the size of the network to prevent overfitting is to reduce the size of the model, that is, to reduce the number of learnable parameters in the model (this is determined by the number of layers and the number of units in each layer) . But the model used at the same time should have enough parameters to prevent underfitting. This is a balance that you need to find.
To find this balance, we need to evaluate a range of different network architectures (evaluated on the validation set) in order to find the optimal model size for the data. To find the right model size, the general workflow is to start with a relatively small number of layers and parameters, and then gradually increase the layer size or add new layers until such increases have little effect on the validation loss.

3. Add weight regularization
Simple models are less prone to overfitting than complex models. A simple model refers to a model with fewer parameters. Another common way to reduce overfitting is to force the model weight to take only a small value, thereby limiting the complexity of the model, which makes the distribution of weight values ​​more regular. This method is called weight regularization.
This is achieved by adding a cost associated with larger weight values ​​to the network loss function. This cost takes two forms.
L1 regularization: The added cost is proportional to the absolute value of the weight coefficient (the L1 norm of the weight).
L2 regularization: The added cost is proportional to the square of the weight coefficient (the L2 norm of the weight). L2 regularization of neural networks is also called weight decay.

4. Add dropout regularization (discard some output data)
dropout is one of the most effective and commonly used regularization methods for neural networks. Using dropout for a certain layer is to randomly discard some output features of the layer during the training process ( set to 0). Suppose that during training, the return value of a certain layer for a given input sample should be the vector [0.2, 0.5, 1.3, 0.8, 1.1]. After using dropout, several random elements of this vector will become 0, such as [0, 0.5, 1.3, 0, 1.1]. The dropout rate (dropout rate) is the proportion of features that are set to 0, usually in the range of 0.2~0.5. No units are dropped during testing, and the output value of this layer needs to be scaled down by the dropout ratio, because more units are activated than during training, which needs to be balanced.

3. Practical operation to reduce over-fitting
The following uses the IMDB dataset as the actual combat dataset to study the method of reducing model over-fitting. It contains 50,000 severely polarized comments from the Internet Movie Database (IMDB). The IMDB dataset is built-in in the Keras library. It has been preprocessed: reviews (sequences of words) have been transformed into sequences of integers, where each integer represents a word in the dictionary.

from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
# 数据标签向量化
x_train = np.asarray(train_data).astype('float32')
x_test = np.asarray(test_data).astype('float32')
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')


In order to monitor the accuracy of the model on previously unseen data during the training process and adjust the model, it is necessary to set aside 10,000 samples of the original training data as a validation set. In the process of tuning the model, no data from the test set can be learned before the final model is formed.

x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]


Using a mini-batch of 512 samples, the model is trained for 20 epochs while monitoring loss and accuracy on a set-out of 10 000 samples.

训练一个网络
from keras import models
from keras import layers
original_model = models.Sequential()
original_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
original_model.add(layers.Dense(16, activation='relu'))
original_model.add(layers.Dense(1, activation='sigmoid'))
original_model.compile(optimizer='rmsprop',
                       loss='binary_crossentropy',
                       metrics=['acc'])
original_hist = original_model.fit(partial_x_train,
                                   partial_y_train,
                                   epochs=20,
                                   batch_size=512,
                                   validation_data=(x_val, y_val))

It can be seen that with each iteration, the training set loss decreases each round. But this is not the case for validation accuracy: it only reaches the lowest value at round 4. After the 5th round, due to over-optimization of the training data, the final learned representation is only for the training data and cannot be generalized to data outside the training set.

1. Reduce network size
Try replacing it with this smaller network below.

smaller_model = models.Sequential()
smaller_model.add(layers.Dense(4, activation='relu', input_shape=(10000,)))
smaller_model.add(layers.Dense(4, activation='relu'))
smaller_model.add(layers.Dense(1, activation='sigmoid'))
smaller_model.compile(optimizer='rmsprop',
                      loss='binary_crossentropy',
                      metrics=['acc'])
smaller_model_hist = smaller_model.fit(partial_x_train,
                                   partial_y_train,
                                   epochs=20,
                                   batch_size=512,
                                   validation_data=(x_val, y_val))
                                   epochs = range(1, 21)
epochs = range(1, 21)
original_val_loss = original_hist.history['val_loss']
smaller_model_val_loss = smaller_model_hist.history['val_loss']
import matplotlib.pyplot as plt
plt.plot(epochs, original_val_loss, 'b+', label='Original model')
plt.plot(epochs, smaller_model_val_loss, 'bo', label='Smaller model')
plt.xlabel('Epochs')
plt.ylabel('Validation loss')
plt.legend()
plt.show()

The smaller network starts to overfit later than the reference network and after it starts to overfit, its performance deteriorates more slowly.

2. Add weight regularization
Add the kernel_regularizer parameter to the layer parameters

from keras import regularizers
l2_model = models.Sequential()
l2_model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu', input_shape=(10000,)))
l2_model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                          activation='relu'))
l2_model.add(layers.Dense(1, activation='sigmoid'))

It can be seen that even if the number of parameters of the two models is the same, the model with L2 regularization (dots) is less prone to overfitting than the reference model (crosses).

3. Add dropout regularization
Add layers.Dropout(0.5) layer to the network

dpt_model = models.Sequential()
dpt_model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
dpt_model.add(layers.Dropout(0.5))
dpt_model.add(layers.Dense(16, activation='relu'))
dpt_model.add(layers.Dropout(0.5))
dpt_model.add(layers.Dense(1, activation='sigmoid'))

dpt_model.compile(optimizer='rmsprop',
                  loss='binary_crossentropy',
                  metrics=['acc'])

Again, it is seen that the performance of this method is significantly improved compared to the original network.
The above is a common method to reduce overfitting.
————————————————
Copyright statement: This article is an original article of CSDN blogger "Mario w", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement.
Original link: https://blog.csdn.net/weixin_45949840/article/details/124251900

Guess you like

Origin blog.csdn.net/modi000/article/details/130380277