batch norm, layer norm, instance norm, group norm

1. Batch Norm

Suppose our input is a vector of shape (N, C, H, W), where N represents the size of the Batch size, C represents the number of channels, H is the height, and W is the width. Then Batch Norm will calculate the mean and variance once on each channel of the entire batch, and normalize each channel. The calculation process is:

 It should be emphasized again that the above calculations are performed on each channel. If there are several channels, several means and variances will be calculated. It is also worth mentioning that there is a γ and β here, which are parameters that the model can learn. Not every layer of the model needs to be normalized. When γ = \sqrt{\sigma^2 + \epsilon}, and β = μ, it is restored to the data without normalization. ϵ is a small number added in case the denominator is 0.

Generally, during training, the average-mean and variance-variance parameters are pre-calculated on the training set by sliding average. During testing, these values ​​are not calculated, but these pre-calculated values ​​are directly used. However, When the distribution of training data and test data is different, the pre-calculated data on the training set cannot represent the test data, which leads to inconsistency in the three stages of training, verification, and testing.

How to set the batch size: scientific research parameters: how to set the batch_size? _No stalk blog-CSDN blog _batchsize is set to 1

 Two, Layer Norm

Compute mean and variance over all channels of a data 

3. Instance Norm

Normalize data on one channel with one piece of data

4. Group Norm

Take part of the channels as a group for normalization

When #G=1 in GN, GN becomes Layer Norm
When #G=C (number of channels) in GN, GN becomes Instance Norm

5. Comparison

When the same batch size (32), BN, GN effect is better

 When different batch sizes, BN and GN are compared. GN is less affected by batch size

 The performance of GN with different group sizes. #G is better when it is 32

Although Group Norm solves the problem that the performance of the single-card batch size hour model is not good, it also brings a hyperparameter group size that needs to be adjusted. It also means doing more experiments.

reference:

Can batch normalization be used when the batch-size is small? _Torment King's Blog-CSDN Blog_Do you still need to use batchnorm when the batchsize is 1?

Paper Reading-Group Normalization_zjuPeco's Blog-CSDN Blog 

Guess you like

Origin blog.csdn.net/qq_41021141/article/details/126057078