[Deep learning] The difference between BN and Dropout in training and testing

The difference between BN and Dropout in training and testing

Batch Normalization

BN, Batch Normalization, is to keep the input of each layer of neural network in a similar distribution during the deep neural network training process.

Are the parameters of BN training and testing the same?

For BN, during training, each batch of training data is normalized, that is, the mean and variance of each batch of data are used.
When testing, for example, to predict a sample, there is no concept of batch. Therefore, the mean and variance used at this time are the mean and variance of the full training data, which can be obtained by the moving average method.
For BN, when a model is trained, all its parameters are determined, including mean and variance, gamma and bata.
Why not use the mean and variance of the full training set during BN training?
Because in the first complete epoch of training, the mean and variance of the full training set of other layers outside the input layer cannot be obtained, and the mean and variance of the trained batch can only be obtained during the forward propagation process. Can the mean and variance of the full data set be used after a complete epoch?
For BN, each batch of data is normalized to the same distribution, and the mean and variance of each batch of data will have a certain difference, instead of using a fixed value, this difference can actually increase the model's Robustness will also reduce overfitting to a certain extent.
However, the mean and variance of a batch of data and the full amount of data are too different, and they cannot represent the distribution of the training set. Therefore, BN generally requires that the training set be completely shuffled, and a larger batch value is used to reduce the total amount. The difference in data.

Dropout

Dropout is to inactivate neurons with a certain probability during the training process, that is, the output is 0, so as to improve the generalization ability of the model and reduce over-fitting.

Is Dropout needed for both training and testing?

Dropout is used during training to reduce the dependence of neurons on some upper neurons, similar to integrating multiple models of different network structures to reduce the risk of overfitting.
In the test, the entire trained model should be used, so dropout is not required.

How does Dropout balance the difference between training and testing?

Dropout, inactivating a neuron with a certain probability during training. In fact, it is to make the output of the corresponding neuron 0.
Assuming the probability of inactivation is p, every neuron in this layer has a probability of inactivation of p , In the three-layer network structure shown in the figure below, if the inactivation probability is 0.5, on average, 3 neurons are inactivated in each training, so each neuron in the output layer has only 3 inputs, and there will be no in actual testing. With dropout, each neuron in the output layer has 6 inputs, so during training and testing, the input and expectations of each neuron in the output layer will be different in magnitude.
Therefore, during training, the output data of the second layer must be divided by (1-p) and then passed to the output layer neurons as compensation for neuron inactivation, so that the input of each layer during training and testing has Roughly the same expectations.

Insert picture description here

Insert picture description here

Part of dropout reference: https://blog.csdn.net/program_developer/article/details/80737724

Problems when using BN and Dropout together

Using BN and Dropout alone can reduce overfitting and speed up training, but if used together, it will not produce the effect of 1+1>2. On the contrary, it may get worse results than using them alone.

Related research reference papers:
"Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift"
https://arxiv.org/abs/1801.05134

The author of this paper found that the key to understanding the conflict between Dropout and BN is the inconsistent behavior of neural variance in the process of network state switching. Imagine if there is the neural response X in Figure 1, when the network is changed from training to testing, Dropout can scale the response by its random inactivation retention rate (that is, p), and change the variance of neurons during learning, while BN is still Maintain the statistical sliding variance of X. This variance mismatch can lead to numerical instability (see the red curve in the figure below). And as the network gets deeper and deeper, the final predicted numerical deviations may accumulate, thereby reducing the performance of the system. For simplicity, the authors named this phenomenon "variance shift." In fact, if there is no Dropout, the actual neuron variance in feedforward will be very close to the accumulated sliding variance of BN (see the blue curve in the figure below), which also guarantees its high test accuracy.

Insert picture description here

The author uses two strategies to explore how to break this limitation. One is to use Dropout after all BN layers, and the other is to modify the formula of Dropout to make it less sensitive to variance, which is Gaussian Dropout.
The first solution is relatively simple, just put Dropout behind all BN layers, so that there will be no variance offset problem, but it actually has the feeling of avoiding the problem.
The second solution comes from a Gaussian Dropout mentioned in the original text of Dropout, which is an extension of the form of Dropout. The author further expands Gaussian Dropout and proposes a uniformly distributed Dropout. This has the advantage that this form of Dropout (also known as "Uout") is less sensitive to the deviation of the deviation, and in general it is the whole The variance is not so severe anymore.

Guess you like

Origin blog.csdn.net/m0_37882192/article/details/109896588