The principle of BN/LN/GN of neural network regularization

1. Principle of BN layer

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True,device=None,dtype=None)

  • Why use BN?
    Accelerate training. The previous training was slow because during the training process, the overall distribution gradually approached the upper and lower limits of the value range of the nonlinear function (refer to the sigmoid function, large positive or negative values), and the chain derivation leads to the gradient of the neural network in the lower layer disappear. BN is to forcibly pull the increasingly biased distribution back to the (standard) normal distribution, so that the activation value falls in the area where the nonlinear function is more sensitive to the input, so that a small change in the input will cause a large change in the loss function, making the gradient Larger to avoid the vanishing gradient problem.

  • Why should the distribution of BN be multiplied by the learnable parameter γ \gammacb \betaβ ?
    If it is forced to normalize to the standard normal distribution, the distribution learned by this layer before will also lose information. These two reconstruction parameters are introduced so that our network can learn torestorethe feature distribution to be learned by the original network.

  • Form(torch) :
    y = x − E [ x ] V ar [ x ] + ϵ × γ + β y = \frac{xE[x]}{\sqrt{Var[x]+\epsilon}}\times\ gamma + \betay=r [ x ] _+ϵ xAnd [ x ]×c+b
    insert image description here

  • Backpropagation .
    [External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-k6uCO3wH-1689670017182)(:/b9265fb96a1348ee97121a7aba9a92be)]

  • The mean and variance are calculated separately on each dimension of the mini-batch

  • Because it is re- CCBN done on C dimension, in ( N , H , W ) (N,H,W)(N,H,W ) slices, which is tocalculatethe mean and variance of (N, H, W), so it is academically called Spatial Batch Normalization.

  • γ \gamma cb \betaβ islearnableparameter whose size is equal to input size. Defaultγ \gammaγ is set to 1,β \betaβ is set to 0.

  • The standard deviation is an estimate, the torch.var(input, unbiased=False)same as. 1 m \frac{1}{m}m1

  • During training, run an estimator that computes the mean and variance, and then normalize during validation.

  • Inference phase :

    • γ \gammac ,b \betaβ directly uses the trained .
    • Use the unbiased estimation of mean and variance . That is, count the mean and variance of each dimension of each batch in training, and then calculate the expectation on the training set:
      E [ x ] ← EB [ μ B ] E[x]\leftarrow E_\mathcal{B}[\mu_\mathcal{B}]E [ x ]EB[ mB]
      V a r [ x ] ← m m − 1 E B [ σ B 2 ] Var[x] \leftarrow \frac{m}{m-1}E_\mathcal{B}[\sigma_\mathcal{B}^2] r [ x ] _m1mEB[ pB2]
      Finally (just replace the mean and variance, derivation is a small step):
      y = γ V ar [ x ] + ϵ x + ( β − γ E [ x ] V ar [ x ] + ϵ ) y = \frac {\gamma}{\sqrt{Var[x]+\epsilon}}x+(\beta-\frac{\gamma E[x]}{\sqrt{Var[x]+\epsilon}})y=r [ x ] _+ϵ cx+( br [ x ] _+ϵ γE[x])
  • Summary of BN advantages :

      1. Greatly improve the training speed and speed up the convergence
      1. To improve the generalization ability of the network, the explanation is a regular expression method similar to dropout to prevent overfitting, and dropout can be discarded
      1. It is easy to adjust parameters, and the requirements for initialization are not so high, and the learning rate can be increased
      1. Can disrupt the order of sample training. Can improve accuracy
      1. BN is essentially a normalized network layer that can replace the local response normalization layer (LRN) layer
  • Why is the BN layer generally used after the linear layer and the convolutional layer instead of after the nonlinear unit ? Because the output distribution shape
    of the nonlinear unit will change during the training process , normalization cannot eliminate its variance offset . On the contrary, the output of the fully connected and convolutional layer is generally a symmetric, non-sparse distribution, which is more similar to Gaussian distribution, normalizing them will produce a more stable distribution . In fact, think about it. For an activation function like relu, if the data you input is a Gaussian distribution, what shape can the data transformed by him be? Those less than 0 are suppressed, that is, the part of the distribution less than 0 directly becomes 0, which is not very high.

  • Disadvantages :
    A sufficiently large batch size is required, and a small batch size will lead to an increase in the inaccuracy of the batch statistics and significantly increase the model error rate. That is, BN is greatly affected by the batch. For example, detection and segmentation tasks.

2. LayerNorm principle (LN)

torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, device=None, dtype=None)

  • Official :
    insert image description here

  • The calculation of mean and variance is similar to BN, but the dimension is normalized_shapedetermined by, for example (3,5), in the last two dimensions.
    insert image description here

  • An example of an image, normalized on C,H,W
    insert image description here

In CV, along ( C , H , W ) (C,H,W)(C,H,W ) for normalization.

3. Instance Normlization (IN)和 Group Normalization (GN)

  • IN: Adherence ( H , W ) (H, W)(H,W ) axis calculation, each sample is calculated separately, and each channel is calculated separately
  • GN: Calculate the channel grouping, C/G is the number of channels in each group, along ( X / G , H , W ) (X/G, H, W)(X/G,H,W ) to calculate
    • When G=1, it becomes LN. GN is less restrictive than LN because each group of channels (rather than all channels) is assumed to be affected by a shared mean and variance; the model still has the flexibility to learn a different distribution for each group. This leads to an increase in the representative power of GN relative to LN.
    • When G=C, it becomes GN. But IN can only rely on the spatial dimension to calculate the mean and variance, and misses the opportunity to exploit channel dependence.

4. Summary

  • BN: Normalize in the batch direction, calculate N ∗ H ∗ WN*H*WNHmean of W
  • LN: Normalize in the channel direction, calculate C ∗ H ∗ WC*H*WCHmean of W
  • IN: Normalize in a channel, calculate H ∗ WH*WHmean of W
  • GN: First divide the channel direction into groups, and then perform normalization in each group, and calculate ( C / / G ) ∗ H ∗ W (C//G)*H*W(C//G)Hmean of W

Guess you like

Origin blog.csdn.net/mathlxj/article/details/131791177