Normalization(BN and LN) in NN

insert image description here

Batch Normalization

called batch normalization. Batch refers to a batch of data, usually mini-batch; standardization means that the processed data obeys N ( 0 , 1 ) N(0,1)N(0,1 ) Normal distribution. During the training process, the data needs to pass through a multi-layer network. If the scale of the data changes during the forward propagation process, it may cause the gradient to explode or disappear, making it difficult for the model to converge.

Suppose the input mini-batch data is B = x 1 . . . xm B={x_1...x_m}B=x1...xm, the learnable parameters of Batch Normalization are γ , β \gamma, \betac ,β , the steps are as follows:

  • Find the mean of mini-batch : μ B ← 1 m ∑ i = 1 mxi \mu_B\gets \frac{1}{m} {\textstyle \sum_{i=1}^{m}}x_imBm1i=1mxi
  • 求 mini-batch 的方差 σ B 2 ← 1 m ∑ i = 1 m ( x i − μ B ) \sigma_B^2\gets \frac{1}{m} {\textstyle \sum_{i=1}^{m}}(x_i-\mu _B) pB2m1i=1m(ximB)
  • Form : xi^ ← xi − µ B σ B 2 + ϵ \widehat{x_i}\gets\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon} }xi pB2+ ϵ ximB, where ϵ \epsilonϵ is a number that prevents the denominator from being 0.
  • affine transform(点放和平移) : yi ← γ xi ^ + β ≡ BN r , β ( xi ) y_i\gets \gamma \widehat{x_i} +\beta\equiv BN_{r,\beta}(x_i)yicxi +bBNr , b(xi) , this operation can enhance the capacity of the model, that is, let the model judge whether to standardize the data and how much to standardize. If
    γ = σ B 2 , β = μ B \gamma=\sqrt{\sigma_B^2}, \beta=\mu_Bc=pB2 ,b=mB, then the identity mapping is realized (the first three steps are done 标准化, and this step is done 标准化的反变换).

The Batch Normalization layer is generally the layer before the activation function.

In PyTorch, there are 3 Batch Normalization classes:

  • nn.BatchNorm1d(), the shape of the input data is B × C × 1 D feature ( L ) B \times C \times 1D feature(L)B×C×1Dfeature(L) :length
  • nn.BatchNorm2d(), the shape of the input data is B × C × 2 D feature ( H × W ) B \times C \times 2D feature(H \times W)B×C×2Dfeature(H×W) :hight, weight
  • nn.BatchNorm3d(), the shape of the input data is B × C × 3 D feature ( T × H × W ) B \times C \times 3D feature(T \times H \times W)B×C×3Dfeature(T×H×W) :time, hight, weight
torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

num_features: For a sample 特征维度C, this parameter is the most important
eps: The distribution correction item during the standardization operation
momentum: Exponentially weighted average estimate of the current mean and variance
affine: Whether affine transform is required, the default is True
track_running_stats: True is the training state, this The time mean and variance will change for each mini-batch. False is the test state, at this time the mean and variance will be fixed

For example, the shape of the input data is B × C × 2 D feature B \times C \times 2D featureB×C×2 D f e a t u re , (3, 2, 2, 2, 3), means that a mini-batch has 3 samples, , and每个样本有 2 个特征the dimension of each feature is 2 x 2 x3. Then 2 means and variances will be calculated, corresponding to each feature dimension. The momentum is set to 0.3, and the first mean and variance default to 0 and 1. Input the data of two mini-batches.

Layer Normalization

Proposed reason: Batch Normalization is not suitable for variable-length networks, such as RNN

Idea: Calculate the mean and variance of each network layer, γ \gammacb \betaβ is a sample-by-sample learnable parameter.

torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)

normalized_shape: The shape of each sample feature in this layer, which can be C × H × WC \times H \times WC×H×W H × W H \times W H×W W W W
eps: denominator correction item during normalization
elementwise_affine: whether sample-by-sample affine transform is required

For example, the shape of the input data is B × C × feature B \times C \times featureB×C×f e a t u re , (8, 2, 3, 4), means一个 mini-batch 有 8 个样本that each sample has 2 features, and the dimension of each feature is 3 x 4. Then 8 means and variances are calculated, one for each sample.

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/131957445