Comparison of LN and BN

LN paper: https://arxiv.org/pdf/1607.06450.pdf

Two different normalization methods are used in the cnn and transformer architectures. Understanding the difference between the two will help us deepen our understanding of the deep learning model.

1. The difference between the two diagrams

LN : Layer Normalization, LN is "horizontal", and normalizes between different neurons for a sample.

BN : Batch Normalization, BN is "vertical", and each dimension is normalized, so it has a relationship with the batch size.

The purpose of both is to speed up model convergence and reduce training time.

2. BN solves the Convariate Shift problem in the network

Batch Normalization (Bactch Normalization, BN) was born to overcome the difficulty of training due to the deepening of the neural network. As the depth of the neural network deepens, it becomes more and more difficult to train, and the convergence speed is very slow, which often leads to gradient dispersion problems ( Vanishing Gradient Problem).

There is a classic assumption in statistical machine learning : the data distribution of Source Domain and Target Domain are consistent. That is to say, the training data and test data satisfy the same distribution. This is a basic guarantee that the model obtained through training data can achieve good results on the test set.
Convariate Shift means that when the sample data of the training set and the distribution of the target sample set are inconsistent, the trained model cannot be generalized well. It is a branch problem under the assumption of inconsistent distribution, that is, the conditional probabilities of the Sorce Domain and Target Domain are consistent, but their marginal probabilities are different. Indeed, for the output of each layer of the neural network, after the intra-layer operation, the output distribution of each layer will be different from the corresponding input signal distribution, and the difference will increase as the network depth increases, but each layer The Label pointed to remains unchanged.

Solution : Generally, a correction is made to the training samples according to the ratio of the training samples to the target samples. Therefore, by introducing Bactch Normalization to standardize the input of some layers or all layers, the mean and variance of the input information of each layer are fixed.

Method: Bactch Normalization is generally used before the nonlinear mapping (activation function) to standardize x=Wu+b, so that the mean value of the result (each dimension of the output signal) is 0 and the variance is 1. Letting the input of each layer have a stable distribution will facilitate the training of the network.

Advantages: Bactch Normalization allows the activation function to be distributed in a linear interval through standardization. As a result, the gradient is increased and the model is bolder to perform gradient descent. It has the following advantages:

  • Increase the search step size to speed up the convergence speed;
  • Easier to jump out of local minima;
  • Destroy the original data distribution, alleviate the overfitting to a certain extent

Therefore, you can try to use Bactch Normalization to solve the situation where the convergence speed of the neural network is very slow or the gradient explosion (Gradient Explore) cannot be trained.

3. Defects of BN

1. BN is standardized on each dimension of the batch size sample, so the larger the size, the more reasonable μ and σ can be obtained for normalization, so BN is more dependent on the size of the size.
2. During training, the model is filled in batches, but when predicting, if there is only one sample or a small number of samples for inference, it is obviously very biased to use BN at this time, such as online learning scenarios.
3. RNN is a dynamic network, that is, the size changes, which can be large or small, resulting in the multi-sample dimensions cannot be aligned, so it is not suitable to use BN.

4. Advantages brought by LN:

1. Layer Normalization is the internal standardization of each sample, which has nothing to do with size and is not affected by it.
2. The LN in the RNN is not affected, and the internal standardization is done by itself, so the application of the LN is wider.

Guess you like

Origin blog.csdn.net/m0_53675977/article/details/129894259