Deep understanding of Batch Normalization

Reprinted: https://www.cnblogs.com/guoyaohua/p/8724433.html

Batch Normalization, as an important achievement of DL in the past year, has been widely proven its effectiveness and importance. Although some details of the deal still not explain its reasons for the theory, but the practice has proved useful is really good, do not forget to do DL from Hinton Pre-Train is the beginning of a deep network experience ahead of the theoretical analysis of partial experience of a knowledge. This article is an introduction to the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift".

  There is a very important hypothesis in the field of machine learning: IID independent and identical distribution hypothesis is to assume that training data and test data meet the same distribution. This is a basic guarantee for the model obtained through training data to achieve good results on the test set. What is the role of BatchNorm? BatchNorm keeps the input of each layer of neural network in the same distribution during the deep neural network training process.

  The next step is to understand what BN is.

  Why does a deep neural network become more difficult to train and converge more slowly as the network deepens? This is a good question that is very close to the essence in the DL field. Many papers solve this problem, such as the ReLU activation function, and the Residual Network. BN essentially explains and solves this problem from a different perspective.

1. "Internal Covariate Shift" problem

  As can be seen from the name of the paper, BN is used to solve the problem of "Internal Covariate Shift", so we must first understand what is "Internal Covariate Shift"?

  The paper first explains the two advantages of Mini-Batch SGD over One Example SGD: the gradient update direction is more accurate; the parallel calculation speed is faster; (Why do you want to say this? Because BatchNorm is based on Mini-Batch SGD, so let’s praise Mini- Batch SGD, of course, is also the truth); then, the disadvantage of SGD training under Tucao: It is very troublesome to adjust the hyperparameters. (The author implies that using BN can solve many shortcomings of SGD)

  Then introduce the concept of covariate shift : If the distribution of the input value X in the ML system instance set <X,Y> is constantly changing, this does not conform to the IID assumption , and the network model is difficult to learn the law of stability . This cannot be done without introducing migration learning Yes, our ML system has to learn how to cater to this distribution change. For the deep learning network structure that contains many hidden layers, during the training process, because the parameters of each layer are constantly changing, each hidden layer will face the problem of covariate shift, that is, during the training process, the input of the hidden layer The distribution is always changing. This is the so-called "Internal Covariate Shift". Internal refers to the hidden layer of the deep network, which occurs inside the network, rather than the covariate shift problem that only occurs in the input layer.

  Then the basic idea of ​​BatchNorm was put forward: Can the activation input distribution of each hidden layer node be fixed ? This avoids the "Internal Covariate Shift" problem.

  BN does not just racking our brains beat out of good ideas, it is instructive sources: Previous studies have shown that if the operation on the input image Albino (Whiten) in image processing, then - the so-called bleaching , is to transform the input data to be distributed 0 Mean, the normal distribution of unit variance -then the neural network will converge faster, then the BN author began to infer: the image is the input layer of the deep neural network, and whitening can speed up the convergence. In fact, for the deep network, The neurons in a certain hidden layer are the input of the next layer, which means that each hidden layer of the deep neural network is the input layer, but it is relative to the next layer. Can you do it for each hidden layer? Albino? This is the original idea that inspired BN, and BN did exactly that. It can be understood as a simplified version of the whitening operation on the activation value of each hidden layer neuron in the deep neural network.

Second, the essential idea of ​​BatchNorm

  The basic idea of ​​BN is actually quite intuitive: because the activation input value of the deep neural network before nonlinear transformation (that is, x=WU+B, U is the input) as the network deepens or during the training process, its distribution gradually occurs Offset or change, the reason why the training convergence is slow is that the overall distribution gradually approaches the upper and lower limits of the value range of the nonlinear function (for the Sigmoid function, it means that the activation input value WU+B is a large negative value Or positive value), so this leads to the disappearance of the gradient of the low-level neural network during backpropagation . This is the essential reason for the slower and slower convergence of the training deep neural network , and BN uses a certain standardization method to set any neural network in each layer. The input value distribution of Yuan is forced back to the standard normal distribution with mean 0 and variance 1. In fact, the more and more skewed distribution is forced back to a more standard distribution, so that the activation input value falls on the nonlinear function pair. Enter a more sensitive area, so that small changes in the input will lead to a larger change in the loss function, which means that the gradient becomes larger to avoid the problem of gradient disappearance, and the larger gradient means that the learning convergence speed is fast, which can greatly accelerate the training speed.

  THAT'S IT. In fact, one sentence is: For each hidden layer neuron, the input distribution, which is gradually mapped to the nonlinear function and closer to the limit saturation zone of the value interval, is forced to return to a relatively standard normal distribution with a mean of 0 and a variance of 1. Make the input value of the non-linear transformation function fall into the area sensitive to the input, so as to avoid the problem of vanishing gradient. Because the gradient can always be kept relatively large, it is obvious that the efficiency of adjusting the parameters of the neural network is relatively high, that is, the change is large, that is, the step toward the optimal value of the loss function is large, that is, the convergence is fast. After all, BN is such a mechanism. The method is very simple and the reason is very profound.

  The above is still abstract, and the following is a more vivid expression of what this adjustment means.

  Figure 1 Several normal distributions

  Assuming that the original activation input x value of a hidden layer neuron conforms to the normal distribution, the mean value of the normal distribution is -2, and the variance is 0.5, which corresponds to the light blue curve at the far left end of the figure above, which is converted to the mean value after passing BN 0, the variance is a normal distribution of 1 (corresponding to the dark blue graph in the above figure). What does it mean? It means that the normal distribution of the input x is shifted to the right by 2 (change in the mean), and the graph curve is smoother ( Increased variance). The meaning of this figure is that BN actually changes the activation input distribution of each hidden layer neuron from a normal distribution with a mean value of 0 and a variance of 1 by compressing or expanding the sharpness of the curve by shifting the mean to adjust the mean value to 0 and variance to 1. The normal distribution.

  So what is the use of adjusting the activation input x to this normal distribution? First, let's look at what the standard normal distribution with a mean of 0 and a variance of 1 means:

 

Figure 2 The standard normal distribution with mean 0 and variance 1.

 

  This means that within a standard deviation range, that is to say 64% of the probability x its value falls within the range of [-1,1], within two standard deviations, that is to say 95% probability x its value Falls within the range of [-2,2]. So what does this mean? We know that the activation value x=WU+B, U is the real input, and x is the activation value of a certain neuron. Assuming the nonlinear function is sigmoid, then look at the graph of sigmoid(x):

Figure 3. Sigmoid(x)

And the derivative of sigmoid(x) is: G'=f(x)*(1-f(x)), because f(x)=sigmoid(x) is between 0 and 1, so G'is between 0 and 0.25 Between, the corresponding figure is as follows:

Figure 4 Sigmoid(x) derivative graph

  Assuming that the original normal distribution mean of x without BN adjustment is -6, and the variance is 1, it means that 95% of the value falls between [-8,-4], then the corresponding Sigmoid(x) function The value is obviously close to 0. This is a typical gradient saturation area. In this area, the gradient changes very slowly. Why is the gradient saturation area? Please see if the value of sigmoid(x) is close to 0 or close to 1, the corresponding derivative function value is close to 0, which means that the gradient changes very little or even disappear. Assuming that after BN, the mean is 0 and the variance is 1, it means that 95% of the x values ​​fall within the [-2,2] interval. Obviously this section is the area where the sigmoid(x) function is close to linear transformation. , Means that a small change in x will cause a larger change in the value of the nonlinear function, that is, a larger change in the gradient, and the region that is obviously greater than 0 in the corresponding derivative function graph is the gradient unsaturated region.

  You can see from the above pictures what is BN doing? In fact, the hidden layer neuron activation input x=WU+B is pulled from the eclectic normal distribution through the BN operation back to the normal distribution with a mean of 0 and a variance of 1, that is, the center of the original normal distribution is shifted to the left or Move to the right to use 0 as the mean, and stretch or shrink the shape to form a graph with 1 as the variance. What do you mean? That is to say, after BN, most of the current activation values ​​fall into the linear region of the nonlinear function, and the corresponding derivative is far away from the derivative saturation region, so as to accelerate the training convergence process.

  However, it is obvious that readers who have a little knowledge of neural networks will generally ask a question: if they all pass BN, isn't it the same as replacing a nonlinear function with a linear function? what does this mean? We know that if it is a multi-layer linear function transformation, this deep layer is actually meaningless, because the multi-layer linear network is equivalent to the one-layer linear network. This means that the expressive power of the network has declined, which also means that the meaning of depth is lost. Therefore, in order to ensure the acquisition of nonlinearity, BN performs scale plus shift operation (y=scale*x+shift) on the transformed x that satisfies the mean value of 0 and the variance is 1 , and each neuron adds two parameters scale And the shift parameter, these two parameters are learned through training, which means that the value is shifted from the standard normal distribution to the left or right by scale and shift to gain or lose weight. The degree of movement of each instance Not the same, this is equivalent to the value of the nonlinear function moving from the linear region around the center to the nonlinear region. The core idea should be to find a better balance between linearity and non-linearity, which can not only enjoy the advantages of the strong expression ability of non-linearity, but also avoid too much dependence on the two ends of the non-linear region to make the network convergence too slow. Of course, this is my understanding, and the author of the paper did not explicitly say that. But obviously the scale and shift operations here are controversial, because according to the ideal state written in the paper by the author of the paper, the transformed x will be adjusted back to the untransformed state through the scale and shift operations. That is not forgiving. Did you go back to the original "Internal Covariate Shift" problem again? I feel that the author of the paper has not been able to clearly explain the theoretical reasons for scale and shift operations.

Three, how to do BatchNorm during training

  The above is an abstract analysis and explanation of BN. How to do BN under Mini-Batch SGD? In fact, this part of the paper is very clear and easy to understand. In order to ensure the completeness of this article, here is a brief explanation.

  Assume that for a deep neural network, the two-layer structure is as follows:

  Figure 5 Two layers of DNN

  To perform BN on the activation value of each hidden layer neuron, you can imagine that each hidden layer adds a BN operation layer, which is located after the activation value of X=WU+B is obtained and before the nonlinear function transformation. The icons are as follows:

  Figure 6. BN operation

  For Mini-Batch SGD, a training process contains m training examples. The specific BN operation is to perform the following transformation for the activation value of each neuron in the hidden layer:

  It should be noted that the x(k) of a neuron in layer t does not refer to the original input, that is, it is not the output of each neuron in layer t-1, but the linear activation x=WU+B of this neuron in layer t, Here U is the output of neurons in layer t-1. Transformation means: the original activation x corresponding to a certain neuron is the mean value E(x) obtained by subtracting m activations x obtained from m instances in the mini-Batch and divided by the obtained variance Var(x) To perform the conversion.

  As mentioned above , the activation x of a certain neuron after this transformation forms a normal distribution with a mean value of 0 and a variance of 1. The purpose is to pull the value to the linear region of the subsequent nonlinear transformation to increase the derivative value Enhance the flow of back-propagation information and speed up training convergence. But this will cause the network expression ability to decline. To prevent this, each neuron adds two adjustment parameters (scale and shift). These two parameters are learned through training and used to inversely transform the transformed activation , So that the network expression ability is enhanced, that is, the following scale and shift operations are performed on the transformed activation, which is actually the inverse operation of the transformation:

  The specific operation process of BN is as described in the paper:

 

  The process is very clear, that is, the flow-based description of the above formula. I won't explain it here, and I should be able to understand it directly.

Fourth, the inference process of BatchNorm

  When BN is training, the activation value can be adjusted according to several training examples in Mini-Batch, but in the process of inference, it is obvious that there is only one instance of the input and no other instances of Mini-Batch can be seen, then at this time How to BN the input? Because it is obvious that an instance cannot calculate the mean and variance of the instance set. How can this be good?

  Since there is no statistic that can be obtained from Mini-Batch data, then think of other ways to obtain this statistic, that is, mean and variance. The statistics obtained from all training examples can be used to replace the mean and variance statistics obtained by the m training examples in Mini-Batch, because it was originally intended to use the global statistics, but it is only used because the calculation amount is too large. Mini-Batch is a simplified method, then the global statistics can be used directly when inference.

  Determine the data range for obtaining statistics, then the next question is how to obtain the mean and variance. It’s very simple, because every time you do Mini-Batch training, there will be the mean and variance of the m training examples in that Mini-Batch. Now you need global statistics. Just record the mean and variance statistics of each Mini-Batch. Live, and then calculate the corresponding mathematical expectation of these mean and variance to get the global statistics, namely:

  With the mean and variance, each hidden layer neuron already has the corresponding trained Scaling parameter and Shift parameter. The activation data of each neuron can be calculated and transformed during the inference process. BN adopts the following methods:

  This formula is actually

  It is equivalent, and this conclusion can be drawn through a simple combination calculation. So why is it written in this transformed form? I guess what the author meant by writing this is: In actual operation, according to this variant form can reduce the amount of calculation, why? Because for each hidden layer node:

          

  Both are fixed values, so these two values ​​can be calculated and stored in advance, and they can be used directly during inference. This way, each step of the original formula is now calculated without the calculation process of division, which is not less at first glance. How much calculation, but if the number of hidden layer nodes is large, the amount of calculation saved will be more.

Five, the benefits of BatchNorm

  Why BatchNorm is NB, the key is to have good results. ①It not only greatly improves the training speed, but also greatly accelerates the convergence process; ②It can also increase the classification effect. One explanation is that this is a regular expression similar to Dropout to prevent overfitting, so it can be achieved without Dropout Considerable effect; ③In addition, the parameter adjustment process is much simpler, the requirements for initialization are not so high, and a large learning rate can be used. All in all, after such a simple transformation, there are so many benefits, which is why BN has become popular so quickly.

Guess you like

Origin blog.csdn.net/dongjinkun/article/details/109001484