The difference between BN and Dropout in training and testing

The difference between BN and Dropout in training and testing

Batch Normalization, BN
BN, Batch Normalization, is to keep the input of each layer of neural network in a similar distribution during the training process of deep neural network.

Are the parameters of BN training and testing the same?
For BN, during training, each batch of training data is normalized, that is, the mean and variance of each batch of data are used.

In testing, such as predicting a sample, there is no concept of batch. Therefore, the mean and variance used at this time are the mean and variance of the full training data, which can be obtained by the moving average method. For details, please refer to:

For BN, when a model is trained, all its parameters are determined, including mean and variance, gamma and bata.

Why not use the mean and variance of the full training set during BN training?

Because the mean and variance of the full training set of other layers other than the input layer cannot be obtained during the first complete epoch of training, the mean and variance of the trained batch can only be obtained during the forward propagation process. Can the mean and variance of the full data set be used after a complete epoch?

For BN, each batch of data is normalized to the same distribution, and the mean and variance of each batch of data will have a certain difference instead of using a fixed value. This difference can actually increase the robustness of the model and reduce overfitting to a certain extent.

However, the mean and variance of a batch of data and the full amount of data differ too much, and they cannot better represent the distribution of the training set. Therefore, BN generally requires that the training set be completely disrupted, and a larger batch value is used to reduce the difference with the full amount of data.

Dropout
Dropout is to deactivate neurons with a certain probability during the training process, that is, the output is 0, so as to improve the generalization ability of the model and reduce overfitting.

Is Dropout needed for both training and testing?

Dropout is used during training to reduce the dependence of neurons on some upper layer neurons, similar to integrating multiple models with different network structures to reduce the risk of overfitting.

When testing, the entire trained model should be used, so dropout is not required.

How does Dropout balance the difference between training and testing?

Dropout, inactivating neurons with a certain probability during training, actually makes the output of corresponding neurons 0.

Assume that the inactivation probability is p, that is, each neuron in this layer has a probability of inactivation of p. In the three-layer network structure shown in the figure below, if the inactivation probability is 0.5, an average of 3 neurons are inactivated in each training session, so each neuron in the output layer has only 3 inputs, but there will be no dropout in the actual test. Each neuron in the output layer has 6 inputs. In this way, during training and testing, the input and expectation of each neuron in the output layer will have an order of magnitude difference.

Therefore, during training, the output data of the second layer is divided by (1-p) and then passed to the output layer neurons as compensation for neuron deactivation, so that the input of each layer has approximately the same expectations during training and testing.

Alt
Alt
Problems when BN and Dropout are used together

Both BN and Dropout can reduce overfitting and speed up training when used alone, but if used together, it will not produce the effect of 1+1>2, on the contrary, it may get a worse effect than using it alone.

The author of this paper found that the key to understanding the conflict between Dropout and BN is the inconsistent behavior of neural variance in the process of network state switching. Imagine if there is a neural response X in Figure 1, when the network changes from training to testing, Dropout can scale the response through its random inactivation retention rate (ie p), and change the variance of neurons during learning, while BN still maintains the statistical sliding variance of X. This variance mismatch can lead to numerical instability (see the red curve in the figure below). And as the network gets deeper, numerical biases in the final predictions may accumulate, degrading the performance of the system. For simplicity, the authors named this phenomenon "variance shift". In fact, if there is no Dropout, the variance of neurons in the actual feedforward will be very close to the sliding variance accumulated by BN (see the blue curve in the figure below), which also ensures its high test accuracy.
Alt
The authors employ two strategies to explore how to break through this limitation. One is to use Dropout after all BN layers, and the other is to modify the formula of Dropout to make it less sensitive to variance, which is Gaussian Dropout.

The first solution is relatively simple. It is enough to put Dropout behind all BN layers, so that there will be no problem of variance offset, but it actually feels like evading the problem.

The second solution comes from a Gaussian Dropout mentioned in the original Dropout text, which is an extension of the Dropout form. The author further expanded the Gaussian Dropout and proposed a uniform distribution Dropout. This has the advantage that this form of Dropout (also known as "Uout") is less sensitive to the deviation of the variance. Generally speaking, the overall variance is not so strong.

reference link

Guess you like

Origin blog.csdn.net/Adam897/article/details/129349888