Batch Normalization while writing code and learning

1. Why normalization

Normalization refers to the conversion of different data types and different value ranges into the same range according to certain rules, so that the comparison and processing of data in the same value range are more reasonable and meaningful. Normalization can eliminate the dimensional differences between data features, so that each feature plays the same role in weight calculation, and it can also increase the convergence speed of the algorithm and improve the prediction accuracy of the model.

Two common methods of normalization:

1. [0,1] is normalized, so that the result value is mapped to [0,1].

x^{*}=\frac{x - min}{max-min}

2. Normalization of normal distribution , this method gives the mean (mean) and standard deviation (standard deviation) of the original data to standardize the data. The processed data conform to the standard normal distribution, that is, the mean is 0 and the standard deviation is 1.

x^{*}=\frac{x-\mu }{\delta }  , where \muis the mean and \deltais the variance.

2. The problem solved by BN: Internal Covariate Shift

We know that once the network is trained, the parameters will be updated. In addition to the input layer data (because the input layer data, we have artificially normalized each sample), the input data distribution of each layer of the subsequent network is always It is changing, because during training, the update of the training parameters of the previous layer will lead to changes in the distribution of the input data of the latter layer. Take the second layer of the network as an example: the input of the second layer of the network is calculated by the parameters and input of the first layer, and the parameters of the first layer have been changing throughout the training process, so it will inevitably cause each subsequent Layer input data distribution changes. (You can understand it as a chain reaction) We call the change of data distribution in the middle layer of the network during the training process: "Internal Covariate Shift". The proposal of BN is to solve the situation that the data distribution of the middle layer changes during the training process, that is, the Internal Covariate Shift problem.
 

3. Why do we need BN?

We know that once the network is trained, the parameters will be updated. In addition to the data of the input layer (because the input layer data, we have artificially normalized each sample), the input data distribution of each layer of the subsequent network is always It is changing, because during training, the update of the training parameters of the previous layer will lead to changes in the distribution of the input data of the latter layer. Take the second layer of the network as an example: the input of the second layer of the network is calculated by the parameters and input of the first layer, and the parameters of the first layer have been changing throughout the training process, so it will inevitably cause each following Layer input data distribution changes. We call the change of data distribution in the middle layer of the network during the training process: "Internal Covariate Shift". The proposal of BN is to solve the situation that the data distribution of the middle layer changes during the training process.

4. Understand the principle of BatchNormalization

A standard normalization step is to subtract the mean and divide the variance. What is the effect of this normalization operation? We observe the following figure

The left picture in a is the input data without any processing, and the curve is a sigmoid function. If the data is in an area with a small gradient, the learning rate will be very slow or even stagnant for a long time. After subtracting the mean and dividing the variance, the data is moved to the central area as shown on the right. For most activation functions, the gradient in this area is the largest or has a gradient (such as ReLU), which can be seen as An effective means of combating vanishing gradients. This is the case for one layer. If you do this for each layer of data, the distribution of data is always in the area that is sensitive to changes, which is equivalent to not having to consider changes in data distribution, which makes training more efficient.

So why is there a fourth step? Isn't it possible to obtain the desired effect by only subtracting the mean and dividing the variance? Let's think about a question. The distribution obtained by subtracting the mean and dividing the variance is a normal distribution. Can we think that the normal distribution is the best or best reflects the characteristic distribution of our training samples? No, for example, the data itself is very asymmetric, or the activation function may not be the best effect on the data with a variance of 1, such as the Sigmoid activation function, the gradient between -1 and 1 does not change much, so the role of nonlinear transformation It cannot be well reflected, in other words, the performance of the network may be weakened after the operation of subtracting the mean and dividing the variance! For this situation, add step 4 after the first three steps to complete the real batch normalization.

The essence of BN is to use optimization to change the size of the variance and the position of the mean, so that the new distribution is more in line with the real distribution of the data, and the nonlinear expression ability of the model is guaranteed. The extreme case of BN is that these two parameters are equal to the mean and variance of the mini-batch, then the data after batch normalization is exactly the same as the input, of course, the general situation is different.

5. What does BN do?

Algorithm process:

Calculate the mean value u of each batch along the channel
Calculate the variance σ^2 of each batch along the channel
to normalize x, x'=(xu)/root sign (σ^2+ε)
adds scaling and translation Variables γ and β, normalized values, y=γx'+β
The reason for adding scaling and translation variables is to ensure that the original learned features are retained after each data is normalized, and at the same time, normalization can be completed Optimize operation and accelerate training. These two parameters are used for learning parameters.
 

import numpy as np
def Batchnorm(x, gamma, beta, bn_param):
    # x_shape:[B, C, H, W]
    running_mean = bn_param['running_mean']
    running_var = bn_param['running_var']
    results = 0.
    eps = 1e-5

    x_mean = np.mean(x, axis=(0, 2, 3), keepdims=True)
    x_var = np.var(x, axis=(0, 2, 3), keepdims=True0)
    x_normalized = (x - x_mean) / np.sqrt(x_var + eps)
    results = gamma * x_normalized + beta

    # 因为在测试时是单个图片测试,这里保留训练时的均值和方差,用在后面测试时用
    running_mean = momentum * running_mean + (1 - momentum) * x_mean
    running_var = momentum * running_var + (1 - momentum) * x_var

    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return results, bn_param

Called from keras code. BatchNormalization is called before the activation function and after each layer.

from keras.models import Sequential
from keras.layers import Dense, BatchNormalization, Activation

model = Sequential()
model.add(Dense(64, input_shape=(input_shape,)))  # Input layer
model.add(BatchNormalization())  # BatchNormalization layer
model.add(Activation('relu'))  # Activation function (ReLU in this case)

Guess you like

Origin blog.csdn.net/keeppractice/article/details/132094207