Normalization in pytorch: BatchNorm, LayerNorm and GroupNorm

1 Overview of normalization

Training deep neural networks is a challenging task. Over the years, researchers have proposed different methods to speed up and stabilize the learning process. Normalization is a technique that has proven to be very effective in this regard.

1.1 Why normalization is necessary

The normalization operation of data is a basic work in data processing. In some practical problems, the sample data we obtain are multi-dimensional, that is, a sample is represented by multiple features, and the different data samples Features may have different scales, which may affect the results of data analysis. In order to solve this problem, data normalization is required. After data normalization of the original data, each feature is in the same order of magnitude, which is suitable for comprehensive comparative evaluation.

For example, we now build a simple neural network model with two features. The two features are age: ranging from 0 to 65, and salary: ranging from 0 to 10,000. We feed these features to the model and calculate the gradient.

Inputs of different sizes lead to different weight updates and uneven optimization steps toward the minimum. This also makes the shape of the loss function disproportionate. In this case, a lower learning rate is needed to avoid overshoot, which means a slower learning process.

So our solution is to normalize the input, shrinking the features by subtracting the mean (centering) and dividing by the standard deviation.

This process, also known as "bleaching", treats all values ​​to have 0 mean and unit variance, which provides faster convergence and more stable training.

1.2 The role of normalization

In deep learning, data normalization is a key preprocessing step used to optimize the training process and performance of neural network models. Normalization technology helps solve the problems of gradient disappearance and gradient explosion, speeds up the convergence of the model, and improves the robustness and generalization ability of the model. The details are as follows:

  • Vanishing and exploding gradient problems: Vanishing and exploding gradients are common problems in deep neural networks. Data normalization can alleviate these problems, allowing gradients to propagate within a reasonable range, which helps improve the training effect of the model.

  • Inconsistent feature scale: Deep learning models are very sensitive to the scale of features. If different features have different scale ranges, some features may dominate the model training process, while the influence of other features may be ignored. Through data normalization, the scales of different features can be unified to the same range, allowing the model to treat all features in a balanced manner and avoid bias caused by inconsistent scales.

  • Model convergence speed: Data normalization can speed up the convergence speed of the model. When data is normalized to a smaller range, the model can find suitable parameter values ​​faster and reduce oscillations and instability during training. This saves training time and improves model efficiency.

  • Robustness and generalization ability: Through data normalization, the model can better adapt to different data distribution and noise situations. Normalization can increase the robustness of the model, making the model more tolerant to changes and perturbations in the input data. At the same time, normalization also helps to improve the generalization ability of the model, making the model perform better on unseen data.

1.3 Normalization steps

Normalization adjusts the distribution of input data so that it has zero mean and unit variance by normalizing the data on specific dimensions. The input is generally normalized through the following steps:

  • For the given input data, calculate its mean and variance in the given dimensions.

  • Standardizes the input data using the calculated mean and variance, zero-meaning it and giving it unit variance.

  • Scale and pan operations are performed on the standardized data, and adjustments are made through learnable parameters to restore the model's ability to express the data.

Furthermore, in normalization, learnable parameters are introduced through scaling and translation operations, namely scaling parameter (scale) and translation parameter (shift). These parameters are used to perform a linear transformation on the normalized data to restore the expressive power of the model.

Specifically, in each feature dimension, assuming that the normalized data is \hat{x}, the final output is calculated by the following formula: y=\gamma \hat{x}+\beta. Among them, and is the final output, \gamma is the scaling parameter (scale), and \beta is the translation parameter (shift). These two parameters are learnable, they can be updated through backpropagation and optimization algorithms such as stochastic gradient descent. During the training process, the model adjusts these parameters through gradient descent, allowing the model to adaptively scale and translate different data distributions. In this way, the model can freely adjust the importance and bias of each feature according to the actual situation, thereby better adapting to different data distributions.

2 Normalization in pytorch

BatchNorm, LayerNorm and GroupNorm are all commonly used normalization methods in deep learning. They prevent vanishing and exploding gradients and improve the model's generalization ability by normalizing the input into a distribution with mean 0 and variance 1.

2.1 BatchNorm

Generally in CNN, the convolutional layer will be followed by a BatchNorm layer to reduce gradient disappearance and explosion and improve the stability of the model.

In PyTorch, you can use batch normalization layers such as torch.nn.BatchNorm1d, torch.nn.BatchNorm2d or torch.nn.BatchNorm3d to implement batch normalization One transformation.

Example code:

import torch
import torch.nn as nn
import numpy as np

feature_array = np.array([[[[1, 0],  [0, 2]],
                           [[3, 4],  [1, 2]],
                           [[-2, 9], [7, 5]],
                           [[2, 3],  [4, 2]]],

                          [[[1, 2],  [-1, 0]],
                            [[1, 2], [3, 5]],
                            [[4, 7], [-6, 4]],
                            [[1, 4], [1, 5]]]], dtype=np.float32)

feature_tensor = torch.tensor(feature_array.copy(), dtype=torch.float32)
bn_out = nn.BatchNorm2d(num_features=4, eps=1e-5)(feature_tensor)
print(bn_out)

for i in range(feature_array.shape[1]):
    channel = feature_array[:, i, :, :]
    mean = feature_array[:, i, :, :].mean()
    var = feature_array[:, i, :, :].var()
    print(mean)
    print(var)

    feature_array[:, i, :, :] = (feature_array[:, i, :, :] - mean) / np.sqrt(var + 1e-5)
print(feature_array)

The running results show:

tensor([[[[ 0.3780, -0.6299],
          [-0.6299,  1.3859]],

         [[ 0.2847,  1.0441],
          [-1.2339, -0.4746]],

         [[-1.1660,  1.1660],
          [ 0.7420,  0.3180]],

         [[-0.5388,  0.1796],
          [ 0.8980, -0.5388]]],


        [[[ 0.3780,  1.3859],
          [-1.6378, -0.6299]],

         [[-1.2339, -0.4746],
          [ 0.2847,  1.8034]],

         [[ 0.1060,  0.7420],
          [-2.0140,  0.1060]],

         [[-1.2572,  0.8980],
          [-1.2572,  1.6164]]]], grad_fn=<NativeBatchNormBackward0>)
0.625
0.984375
2.625
1.734375
3.5
22.25
2.75
1.9375
[[[[ 0.37796253 -0.6299376 ]
   [-0.6299376   1.3858627 ]]

  [[ 0.28474656  1.0440707 ]
   [-1.2339017  -0.4745776 ]]

  [[-1.1659975   1.1659975 ]
   [ 0.7419984   0.3179993 ]]

  [[-0.53881454  0.17960484]
   [ 0.8980242  -0.53881454]]]


 [[[ 0.37796253  1.3858627 ]
   [-1.6378376  -0.6299376 ]]

  [[-1.2339017  -0.4745776 ]
   [ 0.28474656  1.8033949 ]]

  [[ 0.10599977  0.7419984 ]
   [-2.0139956   0.10599977]]

  [[-1.2572339   0.8980242 ]
   [-1.2572339   1.6164436 ]]]]

2.2 LayerNorm

LayerNorm will be used in the Transformer block. The general input size is: (batch_size, token_num, dim), and normalization will be done in the last dimension: nn.LayerNorm(dim)

Different from batch normalization, layer normalization is calculated for the feature dimensions of a single sample, and its purpose is to normalize the feature dimensions of each sample within a single layer to enhance the independence between features. , and provide a more stable feature representation. Has the following advantages:

  • Suitable for processing a single sample: Compared with batch normalization, the calculation of layer normalization is based on the feature dimension of a single sample within a single layer and does not rely on the statistical information of a small batch of samples. This makes layer normalization suitable for situations where a single sample is processed, such as in a recurrent neural network (RNN), where the input at each time step can be viewed as a separate sample.

  • Suitable for dynamic computation graphs and sequence data: Since the calculation of layer normalization does not rely on the statistical information of small batch samples, it is more suitable for calculation on dynamic computation graphs and sequence data. Layer normalization can provide better performance and results when processing variable-length sequence data or using dynamic computational graphs.

In addition, the use of layer normalization (Layer Normalization) in the Transformer model is mainly due to its independent feature dimension normalization properties: the core of the Transformer model is the self-attention mechanism, which detects all the features in the input sequence at each position. Pay attention to the location. Since the feature dimensions of each location can be viewed as independent, layer normalization for each location can provide a more stable feature representation and reduce the coupling between features, which helps the model better learn between locations. dependencies to improve the representation ability of the model. Moreover, the internal covariate transfer between features is reduced, which helps alleviate the common gradient disappearance and gradient explosion problems in deep neural networks and improves the training effect and convergence speed of the model.

Layer normalization (Layer Normalization) is generally applied before the activation function: applying the activation function after layer normalization can keep the input of the activation function within the normalized range and avoid the input of the activation function being too large or too small. This approach is similar to the order in which batch normalization is applied.

For example: In traditional RNN, the input sequence is usually linearly transformed and then an activation function (such as tanh or ReLU) is applied for nonlinear transformation. However, such operations may lead to vanishing or exploding gradient problems, and there may be large changes between inputs at different time steps. The above problem can be solved by applying layer normalization before the activation function.

Sample code:

import torch
import torch.nn as nn
import numpy as np

feature_array = np.array([[[[1, 0],  [0, 2]],
                           [[3, 4],  [1, 2]],
                           [[2, 3],  [4, 2]]],

                          [[[1, 2],  [-1, 0]],
                            [[1, 2], [3, 5]],
                            [[1, 4], [1, 5]]]], dtype=np.float32)


feature_array = feature_array.reshape((2, 3, -1)).transpose(0, 2, 1)
feature_tensor = torch.tensor(feature_array.copy(), dtype=torch.float32)

ln_out = nn.LayerNorm(normalized_shape=3)(feature_tensor)
print(ln_out)

b, token_num, dim = feature_array.shape
feature_array = feature_array.reshape((-1, dim))
for i in range(b*token_num):
    mean = feature_array[i, :].mean()
    var = feature_array[i, :].var()
    print(mean)
    print(var)

    feature_array[i, :] = (feature_array[i, :] - mean) / np.sqrt(var + 1e-5)
print(feature_array.reshape(b, token_num, dim))

Running the code shows:

tensor([[[-1.2247,  1.2247,  0.0000],
         [-1.3728,  0.9806,  0.3922],
         [-0.9806, -0.3922,  1.3728],
         [ 0.0000,  0.0000,  0.0000]],

        [[ 0.0000,  0.0000,  0.0000],
         [-0.7071, -0.7071,  1.4142],
         [-1.2247,  1.2247,  0.0000],
         [-1.4142,  0.7071,  0.7071]]], grad_fn=<NativeLayerNormBackward0>)
2.0
0.6666667
2.3333333
2.888889
1.6666666
2.888889
2.0
0.0
1.0
0.0
2.6666667
0.88888884
1.0
2.6666667
3.3333333
5.555556
[[[-1.2247357   1.2247357   0.        ]
  [-1.3728105   0.980579    0.3922316 ]
  [-0.98057896 -0.39223155  1.3728106 ]
  [ 0.          0.          0.        ]]

 [[ 0.          0.          0.        ]
  [-0.70710295 -0.70710295  1.4142056 ]
  [-1.2247427   1.2247427   0.        ]
  [-1.4142123   0.7071062   0.7071062 ]]]

2.3 GroupNorm

If the batch size is too large or too small, it is not suitable to use BN. Instead, use GN.

(1) When the batch size is too large, BN will normalize all data to the same mean and variance. This can cause the model to become very unstable while training and have difficulty converging.

(2) When the batch size is too small, BN may not be able to effectively learn the statistical information of the data.

For example, in Deformable DETR, GroupNorm is used

Sample code:

import torch
import torch.nn as nn
import numpy as np

feature_array = np.array([[[[1, 0],  [0, 2]],
                           [[3, 4],  [1, 2]],
                           [[-2, 9], [7, 5]],
                           [[2, 3],  [4, 2]]],

                          [[[1, 2],  [-1, 0]],
                            [[1, 2], [3, 5]],
                            [[4, 7], [-6, 4]],
                            [[1, 4], [1, 5]]]], dtype=np.float32)

feature_tensor = torch.tensor(feature_array.copy(), dtype=torch.float32)
gn_out = nn.GroupNorm(num_groups=2, num_channels=4)(feature_tensor)
print(gn_out)

feature_array = feature_array.reshape((2, 2, 2, 2, 2)).reshape((4, 2, 2, 2))

for i in range(feature_array.shape[0]):
    channel = feature_array[i, :, :, :]
    mean = feature_array[i, :, :, :].mean()
    var = feature_array[i, :, :, :].var()
    print(mean)
    print(var)

    feature_array[i, :, :, :] = (feature_array[i, :, :, :] - mean) / np.sqrt(var + 1e-5)
feature_array = feature_array.reshape((2, 2, 2, 2, 2)).reshape((2, 4, 2, 2))
print(feature_array)

The running results show:

tensor([[[[-0.4746, -1.2339],
          [-1.2339,  0.2847]],

         [[ 1.0441,  1.8034],
          [-0.4746,  0.2847]],

         [[-1.8240,  1.6654],
          [ 1.0310,  0.3965]],

         [[-0.5551, -0.2379],
          [ 0.0793, -0.5551]]],


        [[[-0.3618,  0.2171],
          [-1.5195, -0.9406]],

         [[-0.3618,  0.2171],
          [ 0.7959,  1.9536]],

         [[ 0.4045,  1.2136],
          [-2.2923,  0.4045]],

         [[-0.4045,  0.4045],
          [-0.4045,  0.6742]]]], grad_fn=<NativeGroupNormBackward0>)
1.625
1.734375
3.75
9.9375
1.625
2.984375
2.5
13.75
[[[[-0.4745776  -1.2339017 ]
   [-1.2339017   0.28474656]]

  [[ 1.0440707   1.8033949 ]
   [-0.4745776   0.28474656]]

  [[-1.8240178   1.6654075 ]
   [ 1.0309665   0.3965256 ]]

  [[-0.55513585 -0.23791535]
   [ 0.07930512 -0.55513585]]]


 [[[-0.3617867   0.21707201]
   [-1.5195041  -0.9406454 ]]

  [[-0.3617867   0.21707201]
   [ 0.79593074  1.9536481 ]]

  [[ 0.40451977  1.2135593 ]
   [-2.2922788   0.40451977]]

  [[-0.40451977  0.40451977]
   [-0.40451977  0.67419964]]]]

Guess you like

Origin blog.csdn.net/lsb2002/article/details/134916391