Contrast Normalization in deep learning, BN/LN/WN

0. Description

In the past few days, the normalization has been involved. Is it necessary to add a BN after each layer?

Note: The purpose of BN and LN is for the input data distribution between layers to be stable and unchanging. Do not confuse with the vector energy Norm output by a certain layer, it does not matter

Use:  https://zhuanlan.zhihu.com/p/33173246   to quickly understand

Thank you author~ (to express thanks, put the note in the original text of the person, here~)

1. Purpose

Neither need normalization problems

1.1 Independent and identical distribution and whitening

The input of NN is hoped to be iid, so there are two steps in data preprocessing, which Juliuszh calls whitening .

I am used to split the two categories to understand

 

The iid between data and data:

  • Remove the correlation between features  —> independence; guess: de-correlation between dimensions
  • Make all features have the same mean and variance  -> the same distribution; guess: Assume that the x and the affected value of the condition under different people/languages ​​in the data obey the norm, so the simple same mean and variance are both equally distributed, removing the person ( condition)

The above is a good understanding, that is, try to keep the distribution of each batch during training consistent, and the distribution during prediction is also the same

The iid between each value in a data:

  • Remove the correlation between features  —> independence;
  • Make all features have the same mean and variance  —> same distribution; 

The above is not easy to understand, it is to keep the numbers in each dimension are norm(0, 1), and are independent of each other, not related, and are more equal when corresponding to the neuron parameter w

 

 

For example, using  PCAWhitening , but I have never seen neural network input using PCA to preprocess the input data first

http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/

The author also said:

The best theoretically correct method is to whiten each layer of data. However, the standard whitening operation is expensive, especially we also hope that the whitening operation is differentiable, to ensure that the whitening operation can update the gradient through backpropagation;

So there is Normalization

 

1.2 Internal Covariate Shift in Deep Learning

An important phenomenon in each layer is Internal Covariate Shift (ICS), which is not good:

  • Why is it difficult to train deep neural network models? Deep neural networks involve the superposition of many layers
  • The one close to the input is called the bottom layer, and the one close to the output is called the high layer
  • The parameter update of each layer will cause the input data distribution of its high-level to change. Through the superposition of layers, the high-level input distribution will change very sharply.
  • This makes the high-level need to constantly re-adapt to the bottom-level parameter update
  • In order to train the model, we need to be very careful to set the learning rate, initialization weights, and parameter update strategies as detailed as possible

In short, the input data of each neuron is no longer "independent and identically distributed".

  • First, the upper layer parameters need to constantly adapt to the new input data distribution, reducing the learning speed; the update of each layer will affect other layers, so the parameter update strategy of each layer needs to be as cautious as possible
  • Second, the input changes of the lower layer may tend to become larger or smaller, causing the upper layer to fall into the saturation zone, making learning stop prematurely. TODO don't understand, go ask

2.  The general framework and basic ideas of Normalization

See this part when you have time

Pay great attention to g and b! Pan and zoom again, they are different in different scenarios

2.1. Basic idea

slightly

2.2. Batch Normalization

Its normalization is performed on a single neuron, and the data of a mini-batch during network training is used to calculate the mean and variance of the neuron, so it is called Batch Normalization

and so:

Each mini-batch should be approximately identically distributed with each other and the overall data. Mini-batch with a small distribution gap can be regarded as introducing noise for normalization operation and model training, which can increase the robustness of the model; but if the original distribution of each mini-batch is very different, then different mini-batch The data will undergo different data transformations, which increases the difficulty of model training

 

Claim:

  • Each mini-batch is larger
  • Data distribution is relatively close
  • Before training, do a full shuffle

Count the first-order statistics and second-order statistics of each mini-batch during operation, and also save two values ​​for is_training=False

Pay attention to g and b! Then pan and zoom again, each neuron of BN has a g and b, which are used to transform the gradient back to the training range. So the scale of the whole vector is different in different dimensions ; But in batch, between samples, the corresponding dimension of the same neuron corresponds to a scale

Pay attention to the specific consideration of different distributions in batches

2.3. Layer Normalization

I don't understand why Layer Normalization can be used for RNN, but Batch Normalization cannot

Pay attention to g and b! Then pan and zoom again, all neurons in this layer of LN share the same g and b, which are used to change the range of the gradient back to the training. So the number scale in the different dimensions of the entire vector is the same

Pay attention to the specific consideration of different distributions in batches

I feel LN is not good.. Let's talk about specific encounters

2.3. Other Normalizations

slightly

Guess you like

Origin blog.csdn.net/u013625492/article/details/112527816