Understand the role of the BatchNormalization layer

deep learning



Preface

Batch Normalization, as an important achievement of DL in the past year, has been widely proven to be effective and important. Although the theoretical reasons cannot be explained clearly with some details, practice has proved that it is really good if it is easy to use. Don’t forget that DL has been an experience-based discipline since Hinton did Pre-Train on deep networks, and its experience is ahead of theoretical analysis. knowledge. This article is an introduction to the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift".
There is a very important assumption in the field of machine learning: the IID independent and identically distributed assumption, which assumes that the training data and the test data satisfy the same distribution. This is why the model obtained through the training data can be used in the test set A basic guarantee for obtaining good results. So what is the role of BatchNorm? BatchNorm keeps the input of each layer of neural network the same distribution during the training process of deep neural network.

Next, understand step by step what BN is.

Why does a deep neural network become more difficult to train and converge slower as the network depth deepens? This is a good question that is close to the essence in the field of DL. Many papers solve this problem, such as ReLU activation function, and Residual Network. BN essentially explains and solves this problem from a different perspective.

1. “Internal Covariate Shift” problem

As can be seen from the name of the paper, BN is used to solve the "Internal Covariate Shift" problem, so first we must understand what "Internal Covariate Shift" is?

The paper first explains the two advantages of Mini-Batch SGD over One Example SGD: the gradient update direction is more accurate; the parallel computing speed is fast; (Why do I say this? Because BatchNorm is based on Mini-Batch SGD, so I will praise Mini- Batch SGD, of course, is the truth); then complain about the shortcomings of SGD training: hyperparameters are very troublesome to adjust. (The author implicitly means that many shortcomings of SGD can be solved by using BN)

Then the concept of covariate shift is introduced: If the distribution of the input value X in the ML system instance set < Is it done? Our ML system still needs to learn how to cater to this distribution change. For deep learning, a network structure that contains many hidden layers, during the training process, because the parameters of each layer are constantly changing, each hidden layer will face the problem of covariate shift, that is, during the training process, the input of the hidden layer The distribution is always changing. This is the so-called "Internal Covariate Shift". Internal refers to the hidden layer of the deep network, which is what happens inside the network, rather than the covariate shift problem that only occurs in the input layer.

Then the basic idea of ​​BatchNorm was put forward: Can the activation input distribution of each hidden layer node be fixed? This avoids the "Internal Covariate Shift" problem.

BN is not a good idea that comes out of thin air, it has a source of inspiration: previous research shows that if the input image is whiten (Whiten) in image processing - the so-called whitening is to transform the input data distribution to 0 mean, normal distribution with unit variance - then the neural network will converge faster, then the BN author begins to infer: the image is the input layer of the deep neural network, and whitening can speed up the convergence, so in fact, for the deep network, where The neurons of a certain hidden layer are the input of the next layer, which means that in fact, each hidden layer of the deep neural network is an input layer, but it is only relative to the next layer. So, can it be done for each hidden layer? What about albino? This is the original idea that inspired BN, and BN does exactly this. It can be understood as a simplified version of the whitening operation on the activation value of each hidden layer neuron of the deep neural network.

2. The essential idea of ​​BatchNorm

The basic idea of ​​BN is actually quite intuitive: because the activation input value of the deep neural network before nonlinear transformation (that is, x=WU+B, U is the input), its distribution gradually changes as the depth of the network deepens or during the training process. Shift or change. The reason why training converges slowly is generally because the overall distribution gradually approaches the upper and lower limits of the value range of the nonlinear function (for the Sigmoid function, it means that the activation input value WU+B is a large negative value. or positive value), so this causes the gradient of the low-level neural network to disappear during backpropagation. This is the essential reason why the training of deep neural networks converges more and more slowly, and BN uses certain normalization means to convert any neural network into any layer of each layer. The distribution of the input value of the element is forcibly pulled back to the standard normal distribution with a mean of 0 and a variance of 1. In fact, it is to force the increasingly biased distribution back to a more standard distribution, so that the activation input value falls within the nonlinear function pair Input a relatively sensitive area, so that small changes in the input will lead to larger changes in the loss function, which means that the gradient will become larger and the problem of gradient disappearance will be avoided. Moreover, the larger gradient means that the learning converges quickly, which can greatly speed up training. speed.

THAT’S IT. In fact, one sentence is: for each hidden layer neuron, the input distribution that gradually maps to a nonlinear function and then moves closer to the limit saturation area of ​​the value interval is forced back to a relatively standard normal distribution with a mean of 0 and a variance of 1. Make the input value of the nonlinear transformation function fall into a region that is sensitive to the input, thereby avoiding the vanishing gradient problem. Because the gradient can always remain relatively large, it is obvious that the parameter adjustment efficiency of the neural network is relatively high, that is, the changes are large, that is, the steps towards the optimal value of the loss function are large, which means the convergence is fast. In the final analysis, BN is just such a mechanism. The method is very simple and the principle is very profound.

The above is still abstract, but the following will express more vividly what this adjustment means.

Insert image description here
Figure 1 Several normal distributions

Assume that the original activation input x value of a hidden layer neuron conforms to the normal distribution. The mean value of the normal distribution is -2 and the variance is 0.5, which corresponds to the leftmost light blue curve in the above figure. After passing through BN, it is converted to a mean value of 0, a normal distribution with a variance of 1 (corresponding to the dark blue graph in the above figure), what does it mean? It means that the normal distribution of the value of input x has shifted to the right by 2 (change in mean), and the graph curve has become flatter ( changes in variance). What this picture means is that BN actually deviates the activation input distribution of each hidden layer neuron from the normal distribution with a mean of 0 and a variance of 1 by compressing or expanding the sharpness of the curve by shifting the mean, and adjusts it to a mean of 0 and a variance of 1. normal distribution.

So what is the use of adjusting the activation input x to this normal distribution? First, let’s look at what the standard normal distribution with mean 0 and variance 1 means:

Insert image description here
Figure 2 Standard normal distribution with mean 0 and variance 1

This means that within one standard deviation, that is to say, there is a 64% probability that the value of x falls within the range of [-1,1], and within two standard deviations, that is to say, 95% The probability of x is that its value falls within the range of [-2,2]. So what does this mean? We know that the activation value x=WU+B, U is the real input, x is the activation value of a certain neuron, assuming the nonlinear function is sigmoid, then look at the graph of sigmoid(x):
Insert image description here
Figure 3. Sigmoid(x)

And the derivative of sigmoid(x) is: G'=f(x)*(1-f(x)), because f(x)=sigmoid(x) is between 0 and 1, so G' is between 0 and 0.25 Between them, the corresponding diagram is as follows:

Insert image description here
Figure 4 Sigmoid(x) derivative diagram

Assuming that the original normal distribution mean of x before BN adjustment is -6 and the variance is 1, it means that 95% of the values ​​fall between [-8,-4], then the corresponding Sigmoid (x) function The value is obviously close to 0. This is a typical gradient saturation area. In this area, the gradient changes very slowly. Why is it a gradient saturation area? Please look at the value of sigmoid(x). If the value is close to 0 or close to 1, the corresponding derivative function value is close to 0, which means that the gradient change is very small or even disappears. Assuming that after BN, the mean is 0 and the variance is 1, it means that 95% of the x values ​​fall within the [-2, 2] interval. It is obvious that this section is the area where the sigmoid(x) function is close to linear transformation. , which means that a small change in x will lead to a large change in the value of the nonlinear function, that is, a large change in the gradient. The corresponding area in the derivative function graph that is significantly greater than 0 is the gradient unsaturated area.

You should be able to tell from the above pictures what BN is doing, right? In fact, it is to pull the hidden layer neuron activation input x=WU+B from the eclectic normal distribution back to a normal distribution with a mean of 0 and a variance of 1 through the BN operation, that is, the center of the original normal distribution is shifted to the left or Move to the right, with 0 as the mean, and stretch or shrink the pattern to form a graph with 1 as the variance. What's the meaning? That is to say, after BN, most of the current Activation values ​​fall into the linear region of the nonlinear function, and their corresponding derivatives are far away from the derivative saturation region, thus accelerating the training convergence process.

But obviously, after seeing this, readers who have a little knowledge of neural networks will generally raise a question: If they all pass through BN, wouldn't it have the same effect as replacing a nonlinear function with a linear function? what does that mean? We know that if it is a multi-layer linear function transformation, this deep layer is meaningless, because a multi-layer linear network is equivalent to a one-layer linear network.

The article gives an example. In the middle part of the sigmoid activation function, the function is approximate to a linear function (as shown in the figure below). After using BN, the normalized data will only use this linear function. part.
Insert image description here
It can be seen that in the range of [0.2, 0.8], the sigmoid function basically increases linearly. Even in the range of [0.1, 0.9], the sigmoid function is similar to a linear function. If If only this paragraph is used, wouldn't the network become a linear network? This is obviously not what everyone wants to see.

This means that the expressive power of the network has declined, which also means that the meaning of depth is gone. Therefore, in order to ensure the acquisition of nonlinearity, BN performs a scale plus shift operation (y=scale*x+shift) on the transformed x that satisfies the mean value of 0 and the variance of 1. Each neuron adds two parameters, scale. and shift parameters. These two parameters are learned through training, which means that through scale and shift, this value is moved left or right from the standard normal distribution and becomes fatter or thinner. The degree of movement of each instance No, this is equivalent to the value of the nonlinear function moving from the linear area around the center to the nonlinear area. The core idea should be to find a better balance point between linearity and nonlinearity, so that one can not only enjoy the benefits of the strong expressive ability of nonlinearity, but also avoid being too close to both ends of the nonlinear area, which will make the network convergence speed too slow. Of course, this is my understanding, and the author of the paper did not explicitly say this. But obviously the scale and shift operations here are controversial, because according to the ideal state written by the author of the paper, the transformed x will be adjusted back to the untransformed state through the scale and shift operations. That is not a mercy. Has the circle gone back to the original "Internal Covariate Shift" problem? I feel that the author of the paper has not been able to clearly explain the theoretical reasons for the scale and shift operations.

3. How to do BatchNorm during the training phase

The above is an abstract analysis and explanation of BN. How to do BN under Mini-Batch SGD? In fact, this part of the paper is very clear and easy to understand. In order to ensure the completeness of this article, here is a brief explanation.

Suppose that for a deep neural network, the two-layer structure is as follows:
  Insert image description here
Figure 5 Two layers of DNN

To perform BN on the activation value of each hidden layer neuron, it can be imagined that each hidden layer is added with a layer of BN operation layer, which is located after the activation value of X=WU+B is obtained, nonlinear Before function transformation, its diagram is as follows:
  Insert image description here
Figure 6. BN operation

For Mini-Batch SGD, a training process contains m training instances. The specific BN operation is to perform the following transformation on the activation value of each neuron in the hidden layer:
  Insert image description here
  It should be noted that the x(k) of a neuron in layer t here does not refer to the original input, that is, it is not the output of each neuron in layer t-1, but the linear activation x of the neuron in layer t. =WU+B, U here is the output of the neuron in layer t-1. The meaning of transformation is: the original activation x corresponding to a neuron is calculated by subtracting the m activations x obtained from m instances in the mini-Batch and divided by the obtained variance Var(x). to perform the conversion.

As mentioned above, after this transformation, the activation x of a certain neuron forms a normal distribution with a mean of 0 and a variance of 1. The purpose is to pull the value toward the linear area of ​​the subsequent nonlinear transformation. , increase the derivative value, enhance the fluidity of backpropagation information, and speed up the training convergence speed. However, this will lead to a decrease in the expressive ability of the network. In order to prevent this, two adjustment parameters (scale and shift) are added to each neuron. These two parameters are learned through training and are used to inversely transform the transformed activation. , which enhances the expressive ability of the network, that is, performs the following scale and shift operations on the transformed activations, which is actually the inverse operation of the transformation:
Insert image description here
The specific operation process of BN is as described in the paper :

Insert image description here
The process is very clear, it is a streamlined description of the above formula. I won’t explain it here, but you should be able to understand it directly.

4. The inference process of BatchNorm

During training, BN can adjust the activation value according to several training instances in Mini-Batch, but during the inference process, it is obvious that the input is only one instance, and other instances of Mini-Batch cannot be seen, so at this time How to do BN on the input? Because it is obvious that one instance cannot calculate the mean and variance of the instance set. What should I do?

Since there is no statistic that can be obtained from Mini-Batch data, we need to find other ways to obtain this statistic, namely the mean and variance. You can use the statistics obtained from all training instances to replace the mean and variance statistics obtained from the m training instances in Mini-Batch, because the global statistics were originally intended to be used, but they were only used because the calculation amount was too large. Mini-Batch is a simplified method, so you can directly use global statistics during inference.

determines the data range for obtaining statistics, so the next question is how to obtain the mean and variance. It's very simple, because every time you do Mini-Batch training, there will be the mean and variance obtained by the m training instances in the Mini-Batch. Now if you want global statistics, you only need to record the mean and variance statistics of each Mini-Batch. Live, and then calculate the corresponding mathematical expectations for these means and variances to get the global statistics, that is:
Insert image description here
(During testing, the mean and variance used are the mean values ​​of the entire training set and variance. The values ​​of the mean and variance of the entire training set are usually calculated using the moving average method while training.)

Regarding how to find the sliding average, please see the link: Batch Normalization in Deep Learning_whitesilence's Blog-CSDN Blog

With the mean and variance, each hidden layer neuron also has the corresponding trained Scaling parameters and Shift parameters, and the activation data of each neuron can be calculated and transformed during derivation. The following method is used to perform BN during the inference process:
Insert image description here
This formula is actually equivalent to
Insert image description here
during training. This can be obtained through simple combined calculation and derivation. in conclusion. So why is it written in this transformation form? I guess what the author means by this is: in actual operation, following this variant can reduce the amount of calculation. Why? Because for each hidden layer node:
Insert image description here
are fixed values, so these two values ​​​​can be calculated and saved in advance, and can be used directly during reasoning. This is better than the original formula Each step is calculated without the division operation process. At first glance, it does not reduce the amount of calculation. However, if the number of hidden layer nodes is large, the amount of calculation saved will be greater.

This is actually consistent with training, because the mean of the training set must be subtracted first and then divided by the variance. Here you bring in the x^(k) during training and calculate it to understand the results during testing.

5. Benefits of BatchNorm

Why is BatchNorm NB? The key is that it works well. ① It not only greatly improves the training speed, but also greatly speeds up the convergence process; ② It can also increase the classification effect. One explanation is that this is a regular expression similar to Dropout to prevent over-fitting, so it can be achieved without Dropout. Quite effective; ③ In addition, the parameter adjustment process is much simpler, the initialization requirements are not so high, and a large learning rate can be used. All in all, after such a simple transformation, there are many benefits, which is why BN has become so popular now.

6. What is the difference between mini-batch and batch in machine learning?

In machine learning and deep learning, "mini-batch" and "batch" are two commonly used terms, and there are some differences between them.

Mini-batch: Mini-batch refers to a smaller subset of data selected from the training data set. When training a model, the entire training data set is usually divided into multiple mini-batches. Each mini-batch contains a certain number of training samples, usually a power of 2, such as 32, 64, or 128. The model uses samples from each mini-batch to perform forward propagation, compute loss and backpropagation, and then updates the parameters of the model based on the gradients of these samples. The main purpose of using mini-batch is to reduce computational overhead and memory usage, and improve training efficiency.

Batch: Batch refers to training the entire training data set as a large batch. At each iteration, the model performs forward propagation, computes the loss, and backpropagates using samples from the entire training data set, and then updates the parameters of the model based on the gradients of these samples. Compared with mini-batch, the training process using batch may take up more memory and computing resources because the entire data set needs to be processed simultaneously.

Therefore, the difference between mini-batch and batch lies in the size of the data processed. A mini-batch is a relatively small subset of data used for iterative updates during training, while a batch is a one-time processing of the entire training data set. The choice of mini-batch or batch depends on the size of the data set, computing resource constraints, and training efficiency requirements. Typically, mini-batch is the more common and common training method.

Guess you like

Origin blog.csdn.net/zyq880625/article/details/134734915