BN depth learning (batch standardization)

     BN as an important achievement of the past year deep learning, has been widely proven its effectiveness and importance. Although the theory can not explain the reason, but proved easy to use it is really good.

First, what is BN

     Machine learning has a very important assumption: iid hypothesis , is the assumption that the training data and test data meet the same distribution, which is obtained by the model training data can be obtained good results in the test set a basic protection. BN is the depth of the neural network training process so that the input of each remains the same distribution network.

Second, why use BN

     According to the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" is to talk, BN Internal Convariate Shift mainly to resolve the problem. ? So what is the Internal Convariate Shift it can be explained: the distribution of the input value X <X, Y> ML system instance if the collection is always changing, which does not meet the IID assumptions, the network model is difficult to learn the effective law. For depth study of this structure contains many hidden layer of the network, in the training process, because each layer parameters constantly changing, so each hidden layer will face the problem of covariate shift, which is in the training process, enter the hidden layer distribution always come and go, this is called "internal Covariate Shift", internal refers to the hidden layer deep network is what happens inside the network, rather than a covariate shift problem only occurs in the input layer. Thereby proposed BN: Let each hidden node activation input on the fixed area.

      Image whitening: converting input data to 0 mean and distribution, the variance of a normal distribution. After the introduction of whitening operation in the input layer of the neural network convergence will accelerate, so further thought: If you are adding layers of neural networks, is not a faster convergence rate increase it? This is the first proposed the idea of ​​BN.

       The basic idea BN: Deep neural network as input values ​​in the non-linear transformation before the activation (that is, y = Wx + B, x is the input) as the depth deepens, or the network during training, distribution or offset changes occur gradually, the reason why training convergence is slow, generally distributed gradually to the whole value interval of a nonlinear function of both ends close to the lower limit.

       For Sigmoid function, the mean activation input value Wx + B is a big negative or positive, so this leads to the lower gradient back-propagation neural network when the disappearance , which is trained DNN converge more slowly in nature reason. BN is a certain degree of standardization by means of the distribution of the dollar value of this input arbitrary each neural network neural forced back to zero mean and variance 1 standard normal distribution , in fact, forced to pull more and more skewed distribution back to the more standard distribution, so that activating the input value falls sensitive non-linear function of the input area, this small change will result in loss of function of the input large changes, which means so let gradient becomes larger, the gradient disappears avoid problems produce, and gradient increases means learning fast convergence can greatly speed up the training. BN final analysis, this mechanism is very simple, very profound truth.

      The activation of this input x to adjust to normal what is the use? First, we look at a mean of 0 and variance 1 standard normal distribution What is the meaning:

FIG: mean of 0 and variance is a standard normal FIG. 1

  This means that within a standard deviation range, i.e. 64% probability value x falls within the range [1,1] in the range of two standard deviations, i.e. 95% probability value x It falls within the [2,2] range. So what does this mean? We know that the activation value x = WU + B, U is the real input, x is the value of a neuron is activated, assuming non-linear function is sigmoid, then look sigmoid (x) its graphics:

FIG: Sigmoid (x)

And sigmoid (x) of the derivative is: G '= f (x) * (1-f (x)), since f (x) = sigmoid (x) is between 0 and 1, G' in the 0 to 0.25 between its corresponding graph below:

FIG: Sigmoid (x) Derivative FIG.

  Suppose the original mean of normal distribution without prior adjustment of x BN -6 variance is 1, it means that 95% of the value falls [-8, -4] between the Sigmoid then the corresponding (x) function significant value close to 0, which is typical of the gradient saturation region, the gradient change in this region is very slow, why is the gradient saturation region? See the sigmoid (x) If the value is close to 0 or close to 1 when the corresponding derivative function value close to 0, which means little or even disappear gradient. After assumed through BN, the mean is zero and variance is 1, it means that 95% of the value of x falls within the [2,2] interval, it is clear that this period is a region Sigmoid (x) function close to a linear transformation , x means that small changes can result in large changes in the non-linear function value, i.e. gradient is large, corresponding to the derivative function is significantly greater than the figure of the area 0, the gradient is non-saturation region.

  From the above chart you should look out several BN doing, right? In fact, the hidden layer neuron activation input from x = WU + B BN eclectic operation changes back to normal by a mean of 0 and variance of a normal distribution, i.e., left or center of the original normal moved to the right as mean 0, stretched or reduced form as a graphic form a variance. What does this mean? That is , after BN, currently most of the values fall within the linear region Activation of the nonlinear function, the derivative thereof remote from the corresponding derivative saturation region, so to accelerate the convergence of the training process.

  But clearly see here, a little understanding of the neural network of readers will generally raise a question: If all by BN, then do not just replace a nonlinear function to a linear function has the same effect? what does this mean? We know that if the transformation is a linear function of the deep layers of fact, this makes no sense, because multilevel network with a layer of linear networks are equivalent. This means that the network of skills declined, this also means that there is no sense of depth. Therefore, in order to ensure non-linear BN obtained, after satisfying the mean variance is converted into 0 and x 1 of the scale plus the shift operation performed (Y = X * scale + shift) , each neuron adds two parameters scale and shift parameters, these two parameters are learned through training, meaning that through the scale and shift this value from the standard normal distribution to the left or to the right a little and gain weight or lose weight a little bit, moving each instance the degree of It is not the same, so that a value equivalent to the non-linear function from the area around the center of the linear actuator to move the nonlinear region. The core idea should be to find a better balance of linear and non-linear point, both to enjoy the benefits of a strong non-linear skills, but also to avoid too far nonlinear region two so that the network convergence speed is too slow.

     

      

Guess you like

Origin www.cnblogs.com/jimchen1218/p/12186652.html