What is the difference between Batch normalization and Layer normalization?

Original link: What is the difference between Batch/Layer normalization?

Hello everyone, I am Tai Ge. Before training the model, we usually need to normalize the data to speed up the model convergence. This article introduces the usage scenarios of batch normalizationand layer normalization.

1 Why is it more commonly MLused in China ?BN


Now there is an batchinternal personnel characteristic data, which are age, height and weight. We need to predict gender based on these three characteristics. Before the prediction, we must first perform normalization.

ML & batch normalization

BNIt is normalized for each column feature, such as the mean value calculated in the figure below:

BNthis is a kind of "column normalization", and the batchsame latitude of the data in the same data is normalized, so there are 3 dimensions mean.

ML & layer normalization

Instead LN, it is normalized for each row of the data. That is, only look at one piece of data, and calculate the mean value of all the features of this data, such as the figure below:

LNIt is a kind of "row normalization", which is to normalize all dimensions of a single sample.

Why ML&BN?

Here you can see that it LNis completely unreasonable and interpretable to calculate the mean of the three characteristics of a person's age, height, and weight and normalize them, but there is no such effect, BNbecause the unit attribute of each column are all the same.

In machine learning tasks, the data is often characterized by each column of data, and the processed data is usually explanatory, and the unit attributes between columns are not the same, so it is used more in machine learning tasks BN.

2 Why is NLPit used LNmore in China?

insert image description here

The picture above shows that 4 pieces of text data form one batch, and we assume that each word embeddingis 1.

NLP & batch normalization

Then, BNthe features of each column are normalized, and the words in the same position of the four texts will be normalized, for example: Tian, ​​Gong, Yao, and Ying.

insert image description here

Doing so will destroy the original meaning of a word in the original sentence.

NLP & layer normalization

Instead LN, it is normalized for each sentence.
insert image description here

After normalization, the words in a sentence embeddingare in the same distribution.

3 root causes

MLThe data entered in is generally a matrix, and each column of data has the same attribute, so it is used more BNoften.

In NLP, because the data dimension is generally all [batch_size, seq_len, dim_size], we finally hope to normalize the word vector in a sentence, so it is used LNmore.

4 Summary

In terms of the operation process, BNit is for batchall the data in the same one, but LNfor a single sample.

From the feature dimension, normalize the same latitude of the same data, so there are as many mean and variance as there are dimensions; and BNnormalize all dimensions of a single sample, so one in has a mean and a variance.batchLNbatchbatch_size

More AI dry goods are available on the official account [AI has temperature]
insert image description here

Guess you like

Origin blog.csdn.net/Antai_ZHU/article/details/121272709