Personal understanding of several normalizations

A few classic normalizations:
Pirates:
Insert picture description here
1. BN normalizes (H×W×N). The
realization principle is very simple. It is to calculate the mean and standard deviation of a minibatch, then perform a normalization operation, and finally add Two (learnable) scale and shift parameters to prevent it from being too linear and reducing the expressive ability of the network.
On the Internet, it is said that BN forcibly pulls the input back to the normal distribution. I am confused. I think it should be emphasized that this normalize makes the entire minibatch return to the effective range of the activation function and should not emphasize normality. In actual operation, if there is no intervention, many variables may already be in the interval where the activation function is close to the extreme value when activated. It will lead to a large difference between several inputs, but the output is almost the same after activation, naturally the difference between the two cannot be expressed through the gradient. After the BN operation, the difference between different inputs can be effectively mapped to the output after the activation function, and the natural gradient is more obvious. The larger the batch, the more the gradient size generated by different inputs after the action of BN can represent the difference in real situations.
2. LN normalizes (H×W×C). It is
commonly used with RNN network, I shouldn't use it.
3. IN normalizes (H×W).
It is said that this item is often used for image stylization, and I don't know much about it at ordinary times. Both IN and LN are for BN. In order to avoid too small batchsize,
4, GN normalizes (G×H×W)
GN is another result of the great god He Yuming. Experiments show that the results are better than the first three algorithms. The channels are divided into 32 parts, and each part is normalized separately, and is no longer restricted by batch. However, the characteristics of different channels are similar and different. I don't understand why dividing them into 32 norms has better results.

Guess you like

Origin blog.csdn.net/qq_41872271/article/details/105416354