Original link: What is the difference between Batch/Layer normalization?
Hello everyone, I am Tai Ge. Before training the model, we usually need to normalize the data to speed up the model convergence. This article introduces the usage scenarios of batch normalization
and layer normalization
.
1 Why is it more commonly ML
used in China ?BN
Now there is an batch
internal personnel characteristic data, which are age, height and weight. We need to predict gender based on these three characteristics. Before the prediction, we must first perform normalization.
ML & batch normalization
BN
It is normalized for each column feature, such as the mean value calculated in the figure below:
BN
this is a kind of "column normalization", and the batch
same latitude of the data in the same data is normalized, so there are 3 dimensions mean.
ML & layer normalization
Instead LN
, it is normalized for each row of the data. That is, only look at one piece of data, and calculate the mean value of all the features of this data, such as the figure below:
LN
It is a kind of "row normalization", which is to normalize all dimensions of a single sample.
Why ML&BN
?
Here you can see that it LN
is completely unreasonable and interpretable to calculate the mean of the three characteristics of a person's age, height, and weight and normalize them, but there is no such effect, BN
because the unit attribute of each column are all the same.
In machine learning tasks, the data is often characterized by each column of data, and the processed data is usually explanatory, and the unit attributes between columns are not the same, so it is used more in machine learning tasks BN
.
2 Why is NLP
it used LN
more in China?
The picture above shows that 4 pieces of text data form one batch
, and we assume that each word embedding
is 1.
NLP & batch normalization
Then, BN
the features of each column are normalized, and the words in the same position of the four texts will be normalized, for example: Tian, Gong, Yao, and Ying.
Doing so will destroy the original meaning of a word in the original sentence.
NLP & layer normalization
Instead LN
, it is normalized for each sentence.
After normalization, the words in a sentence embedding
are in the same distribution.
3 root causes
ML
The data entered in is generally a matrix, and each column of data has the same attribute, so it is used more BN
often.
In NLP
, because the data dimension is generally all [batch_size, seq_len, dim_size]
, we finally hope to normalize the word vector in a sentence, so it is used LN
more.
4 Summary
In terms of the operation process, BN
it is for batch
all the data in the same one, but LN
for a single sample.
From the feature dimension, normalize the same latitude of the same data, so there are as many mean and variance as there are dimensions; and BN
normalize all dimensions of a single sample, so one in has a mean and a variance.batch
LN
batch
batch_size
More AI dry goods are available on the official account [AI has temperature]