The principle and function of Batch Norm

What does Batch Normalization do?

When the data first comes in, it is hoped to be (IID) independent and identically distributed.

However, the author of batch Normalization thinks that it is not enough, and each layer in deep learning should be processed once to ensure that each layer is equally distributed.

He thought of it this way: Suppose the network has n layers, the network is being trained, and it has not yet converged. At this time X_{1}, it is input and has passed the first layer, but the first layer has not learned the correct weight, so after the matrix multiplication of weight, will the numbers in the second layer be messy? Could it be that some node values ​​in the second layer are single digits, and some node values ​​jump to hundreds? If you think about it carefully, it is indeed quite possible. The internal parameters are all randomly initialized, so it is really hard to say what the result will be. Then the scary thing came, the random numbers on the second layer were input to the third layer, then the input of the third layer was the random numbers, the output was of course not good, and so on.

So there are mainly two problems:

1. So when the previous network does not converge, the latter network does not actually learn much. (The bottom of a building is swaying, so the upper part will not get better. So you have to wait for the previous layer to converge before the training of the latter layer will be effective.)

2. Because generally speaking, each layer of the network needs to add a layer of activation to increase nonlinearity. If the value is relatively large, it will be closer to 0 or 1 on the S-curve after activation, the gradient is small, and the convergence will be very slow.

Therefore, batch Normalization wants to add a norm to each layer for standardization, so that the number distribution of each layer is the same, becoming a standard distribution with a mean of 0 and a variance of 1. The normalized formula of the Gaussian distribution is \hat{X}part of the following formula. The value minus the mean ( \mu) and then divided by the variance ( \delta) yields a standard normal distribution with a mean of 0 and a variance of 1. As for γ and β, they are two parameters that need to be learned. γ scales the variance of the data again, and β produces an offset to the mean of the data.

Why do we need to modify the variance and mean after normalizing to mean 0 and variance 1? Does normalization still make sense?

This is because we cannot guarantee what the features learned by this layer of network are. If it is simply normalized, it is likely to be destroyed. For example, the S-type activation function, if the features learned at this layer are at the top of S, then after we do normalization, we forcibly bring the features to the middle of S, and the features will be destroyed. It should be noted that γ and β are trained parameters, and each layer is different, so for the actual situation of each layer, it will try to restore the features learned by this layer of network .

practical work:

Batch Norm is just another network layer inserted between the hidden layer and the next hidden layer . Its job is to take the outputs from the first hidden layer and normalize them before passing them as input to the next hidden layer.    

Parameters: Two learnable parameters, beta and gamma .

Calculation of the Batch Norm layer:

(It is recommended that Batch Norm be added before the activation function - to prevent gradient dispersion during activation)
        1. Activation: The activation from the previous layer is passed to Batch Norm as input. Each feature in the data has an activation vector.
        2. Calculate the mean and variance: Calculate the mean and variance of all values ​​in the mini-batch for each activation vector.
        3. Normalization: Calculate the normalized value of each activation feature vector using the corresponding mean and variance. These normalized values ​​now have zero mean and unit variance.
        4. Scale and transfer: This step is the innovation point introduced by Batch Norm. Unlike input layers, which require zero mean and unit variance for all normalized values, Batch Norm allows shifting (to a different mean) and scaling (to a different variance) of its values. It does this by multiplying the normalized value by the factor gamma and adding the factor beta. Here is element-wise multiplication, not matrix multiplication. The novelty is that these factors are not hyperparameters (i.e., constants provided by the model designer), but trainable parameters that the network learns. Each Batch Norm layer is able to find the best factors for itself, thus shifting and scaling the normalization values ​​to get the best predictions.
        5. Moving Average: Batch Norm also keeps running counts of the Exponential Moving Average (EMA) for the mean and variance. During training it just calculates this EMA, but does not do any processing. At the end of training, it saves this value as part of the layer state to be used during the inference phase. The moving average calculation uses a scalar "momentum" denoted by alpha below. This is a hyperparameter used only for the Batch Norm moving average and should not be confused with the momentum used in the optimizer.

Code:

import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer1=nn.Sequential(
            nn.Conv2d(3,3,3,1,1,bias=False),
            #因为BatchNorm的计算里面加了bias,所以一开始要bias=False
            nn.BatchNorm2d(3),#对卷积使用BatchNorm2d,通道数必须大于1
            nn.ReLU(),
            nn.Conv2d(3,3,3,1,1)
        )
        self.layer2=nn.Sequential(
            nn.Linear(3*5*5,20,bias=False),
            nn.BatchNorm1d(20),#对卷积使用BatchNorm2d,批次数必须大于1
            nn.Linear(20,1),
        )
    def forward(self,x):
        OUT=self.layer1(x)
        OUT=OUT.reshape(-1,3*5*5)#NCHW-NV
        return self.layer2(OUT)
if __name__ == '__main__':
    net=Net()
    x=torch.randn(2,3,5,5)
    y=net(x)
    print(y)
    print(y.shape)

Guess you like

Origin blog.csdn.net/GWENGJING/article/details/127245058