What information does Batch Normalization learn about the data?

Original link:
What information does Batch Normalization learn about the data?

Hello everyone, I am Tai Ge. The previous section introduced the development process of data normalization. In this section, we will talk about the calculation ideas of Batch Normalization and see where it is strong? Why it can keep the model gradient smooth.

1 Normalization does not change the data distribution

The essence of any normalization is to translate and scale the data.

  • Translation: Each column of the data set is uniformly added or subtracted by a certain number
  • Scaling: the data in each column of the data set is uniformly divided or multiplied by a certain number
import torch
import matplotlib.pyplot as plt

# 生成200个黄色数据集,标签为1
yellow = torch.normal(mean = 2, std = 2, size=(200, 2))
yellow_label = torch.ones(size = (yellow.shape[0], 1))

# 生成200个紫色数据集,标签为0
purple = torch.normal(mean = 8, std = 2, size=(200, 2))
purple_label = torch.zeros(size = (purple.shape[0], 1))

# 合并数据集
data = torch.cat((yellow, purple), dim = 0)
label = torch.cat((yellow_label, purple_label), dim = 0)

# 查看其分布
plt.scatter(data[:, 0], data[:, 1], c = label)
而数据的平移和放缩,不会影响数据特征的分布情况。

Next, we use Z-score to normalize it, and then compare and view the distribution of data sets before and after normalization:

def z_score(t):
    std = t.std(0)
    mean = t.mean(0)
    ans = (t - mean) / std
    return ans 

zs_data = z_score(data)

# 然后对比查看归一化前后数据集分布
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.scatter(data[:, 0], data[:, 1], c = label)
plt.title('data distribution')
plt.subplot(122)
plt.scatter(zs_data[:, 0], zs_data[:, 1], c = label)
plt.title('zs_data distribution')

It can be found that the absolute value of the data in the coordinates will change, but the data distribution remains unchanged before and after normalization .

2 Normalization is actually an affine change

The affine transformation of the data is expressed in matrix form: x ^ = x ∗ w + b \hat x = x * w + bx^=xw+b where x is the original data, w is the parameter matrix, b is the intercept,x ^ \hat xx^ is the transformed data. Here, Z-Score is taken as an example to discuss the method of converting the normalization operation into an affine transformation.

During the normalization operation, we performed the following operations:
x − mean ( x ) std ( x ) \frac{x-mean(x)}{std(x)}std(x)xmean(x)

With a little transformation, it can be written as the following expression:

x s t d ( x ) − m e a n ( x ) s t d ( x ) = x ⋅ 1 s t d ( x ) − m e a n ( x ) s t d ( x ) \begin{aligned} \frac{x}{std(x)} - \frac{mean(x)}{std(x)} \\ = x \cdot\frac{1}{std(x)}- \frac{mean(x)}{std(x)} \\ \end{aligned} std(x)xstd(x)mean(x)=xstd(x)1std(x)mean(x)
x ^ = x ∗ w + b \hat x = x * w + b x^=xw+The form of b is the same, but in BN it will be written asx ⊗ γ + β x \otimes \gamma + \betaxc+β , where⊗ \otimes represents element-wise multiplication,γ and β \gamma and \betaγ and β are referential parameters.

3 Set translation and scaling as parameters

Then in the process of normalization, the scaling part can be completed by matrix multiplication, while the translation part is simpler, and the data set can be completed by adding a vector composed of the translation range of each column.

In actual operation, we regard the normalization operation as a special linear layer, which will also greatly expand the position where the normalization operation can appear . In the field of classical machine learning, the normalization of data only stays in the unified processing of data when inputting data, then in the iterative process, the data will gradually lose the good characteristics brought by initialization. At this time, if we can add Like the linear layer, adding a normalization layer before and after any hidden layer for processing can avoid the gradient instability problem caused by the gradual shift of the data during the iterative process.
grad 3 = ∂ loss ∂ y ^ ⋅ F ( F ( X ∗ w 1 ) ∗ w 2 ) grad_3 = \frac{\partial loss}{\partial \hat y} \cdot F(F(X * w_1) * w_2 )grad3=y^lossF(F(Xw1)w2)
In fact, normalization with a mean of 0 and a variance of 1 is not necessarily the optimal choice. According to the gradient calculation formulas of each layer, it is not difficult to see that the best data normalization method is not an absolute 0 mean 1 variance, but It can keep the gradient balanced after multiplying the final variables (input, parameters, activation function).

4 BN Actual Process

The panning and zooming of BN is divided into two stages:

  • The first stage is to first process the data with Z-Score
  • The second stage is to perform parameter translation on the mean of the data on this basis, and perform parameter scaling on its variance.

In fact, the two stages do one thing, which is to translate and scale the parameters. We can understand the first stage as setting the mean and variance of the data itself as initial values, and the second stage is to continuously update the parameters according to backpropagation.

At the end of the training, the data will be towards the variance γ \gammaγ , the mean isβ \betaThe direction of distribution of β is closer.

5 Derivatives of BN

After the birth of BN, according to different business scenarios, some methods with the same principle were derived, such as Layer Normalization, instance normalization, Group Nomalization.

They just select different dimensions of the data for scaling and translation, but they are consistent with the principle of BN.

Original link:
What information does Batch Normalization learn about the data?

Guess you like

Origin blog.csdn.net/Antai_ZHU/article/details/124397667