Batch Normalization of understanding

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/Oscar6280868/article/details/89516185

First, in a neural network, each layer network architecture will have an activation function, with our more common sigmoid function as an example, we can look at the image shown in Figure sigmoid activation function:
sigmoid
We Imagine if this time a input neurons is 4, then after a sigmoid function, the output neuron is Y = 0.982 y = 0.982 , if the 20 input neurons, after a sigmoid function, the output neurons Y = 0.999 y = 0.999 . This can be seen when the output neuron is no longer sensitive to the input, input from 4 to 20, whereas the output increase of only 0.01, showing little change. We can see through the derivative function sigmoid function more intuitive neuronal sensitivity to input, as is a derivative function of the image sigmoid function:
Derivative
it can be seen sigmoid function gradient across almost zero, and the data sensitive areas concentrated in [ 2 , 2 ] [-2,2] , so we want to focus on the input neuron activation function of the sensitive area, so we need to add a batch normalization full connection between the layer and the activation function to adjust the distribution of the input data. That is, for each hidden layer neuron is, BN is to gradually distributed closer to the nonlinear function maps the saturation region input change to a normal distribution with mean 0 and variance 1 up, so that the distribution of the input fall neurons more sensitive to the data area in order to avoid problems gradient disappears. With BN, then the gradient neural network has been able to maintain a relatively large state, then the neural network can maintain a good convergence rate. Let's look at how the batch normalization process looks like:

Input:

Input distribution data: B = { x 1 , . . . , x m } {\ Rm B} = \ {{x_1}, ..., {x_m} \} , The parameters need to learn: γ , β \ Gamma, \ beta
here γ \gamma represents a scaling factor distribution of data, β \beta represents the offset of the data distribution

Output:

Data distribution after BN: B N γ , β ( x i ) B{N_{\gamma ,\beta }}({x_i})

process:

1. Calculate the mean value of the input data (mini-batch mean):

μ B 1 m i = 1 m x i {\mu _{\rm B}} \leftarrow \frac{1}{m}\sum\limits_{i = 1}^m {{x_i}}

2. calculating the variance of the input data (mini-batch variance):

σ B 2 1 m i = 1 m ( x i μ B ) 2 \sigma _{\text{B}}^2 \leftarrow \frac{1}{m}\sum\limits_{i = 1}^m {{{({x_i} - {\mu _{\text{B}}})}^2}}

3. Standardization (normalization):

x ^ i x i μ B σ B 2 + ε {\hat x_i} \leftarrow \frac{{{x_i} - {\mu _{\rm B}}}}{{\sqrt {\sigma _{\rm B}^2 + \varepsilon } }}
First, here is our data distribution divided by the mean minus variance for standardization.

4. scaling and translation (scale and shift):

y i γ x ^ i + β = B N γ , β ( x i ) {y_i} \leftarrow \gamma {\hat x_i} + \beta = B{N_{\gamma ,\beta }}({x_i})
Activate a neuron input distribution after the first three steps of operation x x become zero mean and variance 1 is positive too distributions aimed at the data input distribution falls activation function sensitive areas, speed up the convergence of neural network training. But doing so would lead to a decline expressive power of the network, so we need to zoom and pan the fourth step operation to learn γ \gamma and β \beta These two parameters, used to activate the transformed inverse transformation to increase the expression capability of the network, i.e. the inverse transform operation:
y ( k ) = γ ( k ) x ^ i ( k ) + β ( k ) {Y ^ {(k)}} = {\ gamma ^ {(k)}} \ has x_i ^ {(k)} + {\ beta ^ {(k)}}
so that more BN processes, with BN, neural networks have the following advantages:

  1. It can greatly enhance the speed of neural network convergence.
  2. You can increase the effect of the classification, to prevent over-fitting.
  3. Scheduling is relatively simple, low initialization requirements, the learning rate can be increased.

The above is a summary of some of our BN, so now in the neural network often join BN, want here Bowen everyone can deepen the understanding of the BN, thank you.

Guess you like

Origin blog.csdn.net/Oscar6280868/article/details/89516185