[Implementation details of Batch Normalization in CNN]

Talking about Batch Norm all day long, CNN's paper is inseparable from Batch Norm. BN can make the input data distribution of each layer relatively stable, and accelerate the convergence speed of model training. But how is the BN operation implemented in CNN?

1. Implementation steps of BN in MLP

First, quickly review what BN is like in MLP. The steps are as follows:
insert image description here
Image source: BN original paper

In one sentence, for each feature, find a batch to find the mean and variance . Then the feature is subtracted from the mean and divided by the standard deviation, and then a linear scaling of the learnable parameters (that is, multiplied by γ plus β) is enough. Essentially, the input is linearly scaled so that the distribution of input values ​​at each layer is relatively stable.

For a more specific implementation in MLP, you can refer to the original paper of BN , and you can also read this explanation - the principle and actual combat of Batch Normalization , which is very clearly written.

2. Implementation details of BN in CNN

In fact, in essence, linear scaling is performed on each layer, so that the data distribution of each layer is relatively stable. However, since one layer in MLP is a 1D vector , and each layer of CNN is a 3D tensor , there are still differences in specific implementation details.
insert image description here
insert image description here
The overall process is still consistent with the above steps:

2.1 Training process

insert image description here
It's just that we don't find the mean and variance of a batch for each feature; instead, we find the mean and variance of a batch for each feature map .

It can also be expressed in this way: MLP is to find the mean and variance in the feature dimension (that is, each neuron), while CNN is to find the mean and variance in the channel (channel) dimension (that is, each feature map).

To state it a bit more clearly:

insert image description here
The red box in the figure above is a layer of CNN , with a shape of (C, H, W), where C is the number of channels (that is, the number of feature maps), and H and W are the length and width of the feature map. We can understand it this way: a layer of CNN is a 3D matrix, which is composed of a 2D feature map of the C layer.

If a batch has N pictures, the shape of the layer circled by the red frame is (N, C, H, W), that is, there are NxCxHxW values ​​in total.

During training, what BN does is to calculate the mean and variance of all the data of each feature map of the layer (that is, the second dimension C, a total of C feature maps) in the batch. That is, calculate the mean value μ and variance σ for NxHxW data. [A feature map has HxW data, and there are N pictures in a batch] That is, for the i-th feature map, there are corresponding μ_i and σ_i .

Because there are C feature maps in one layer, there are C μ_i and σ_i in total. i∈[1, C]

The rest of the process is the same as MLP: subtract the corresponding mean μ_i from each element of the i-th feature map and divide it by the standard deviation, and then perform a linear scaling of the learning parameters (that is, multiply by γ plus β).

2.2 Forward inference process

Like MLP, during the training process, each batch will find C means and variances (because each feature map will have a mean and variance).

In the forward inference process, the mean and variance of each feature map of all batches are averaged. In the paper, the unbiased estimate of the expectation and variance of all mean values ​​is obtained, as follows: each layer when obtaining forward
insert image description here
inference The mean μ_i_test and variance σ_i_test of the feature map. Then subtract the corresponding mean value μ_i_test from each element of the i-th feature map and divide it by the standard deviation, and then perform a linear scaling (that is, multiply by γ plus β). where γ and β are already trained by the training process.

In a word: MLP is to calculate the mean and variance of a batch in the feature dimension (ie, each neuron), while CNN is to calculate the mean and variance of a batch in the channel (channel) dimension (ie, each feature map).

END:)

I always feel that words still cannot fully express my thoughts clearly. The best way to learn is to read the original papers, and then think about it by yourself.

Reference:
1. [Original] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

2. Batch Normalization principle and practice

3. [Teacher Li Mu's course] Batch normalization [Hands-on deep learning v2]

Guess you like

Origin blog.csdn.net/qq_44166630/article/details/127266651
Recommended