Deep learning-BN (Batch Normalization)

1. Introduction
Batch Normalization is proposed in a paper in 2015. $normalization\color{blue}{data normalization}$ methods are often used before activation layers in deep neural networks. Its role can speed up the convergence speed of model training, make the model training process more stable, and avoid gradient explosion or gradient disappearance. And it plays a certain regularization role, almost replacing Dropout.
2. Formula
$\lbrace x_{1...m}\rbrace;\lambda,\beta(parameters\quad to \quad be\quad learned)$ $Output:\lbrace{y_i=BN_{\lambda,\beta}(x_i)}\rbrace$ $\mu_B\leftarrow\cfrac{1}{m}\sum_{i=1}^mx_i$ $\sigma_B^2\leftarrow\cfrac{1}{m}\sum_{i=1}^m(x_i-\mu_B)^2$ $\overline{x_i}\leftarrow\cfrac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}$ $y_i\leftarrow\gamma \overline{x_i}+\beta$

The specific operation of BN is: first calculate $The mean and variance of B$ , after which $The mean and variance of the B$ set are transformed into 0 and 1, and finallyEach element in $B$ $\gamma$ plus $\beta$ , the output. $\gamma$ 和 $\beta$ is a trainable parameter that participates in the BP of the entire network;
the purpose of normalization is to regularize the data into a uniform interval, reduce the divergence of data, and reduce the learning difficulty of the network. The essence of BN is that after normalization, use $\gamma$ 和 $\beta$ retains the distribution of the original data to a certain extent.
3.Composition of $B$

The dimensions of the tensor data passed in the neural network are usually recorded as [N, H, W, C], where N is batch_size, H and W are rows and columns, and C is the number of channels. Then the input set $B$ is the blue part in the figure below.
insert image description here
The calculation of the mean is to add up the numbers in each channel separately in a batch, and then divide by $\times H \times W$ . For example: there are 10 pictures in this batch, each picture has three RBG channels, and the height and width of each picture are H and W, then the mean value is calculated by dividing the sum of the pixel values of the R channels of the 10 pictures by 10 $10 \times H \times W$ , and then calculate the sum of all pixel values of channel B divided by $10 \times H \times W$ , finally calculate the sum of the pixel values of the G channel divided by $10 \times H \times W$ . The calculation of variance is similar.
Trainable parameter $\gamma$ 和 $\beta$ is equal to the number of channels of the tensor. In the above example, each of the three channels of RBG needs a $\gamma$ and a $\beta$ , so $\gamma$ 和 $\beta$ is equal to 3.
4. The mean and variance in BN during training and inference.
$\color{blue}{During training, the mean and variance are respectively the The mean and variance of the corresponding dimension of the data;}$
$\color{blue}{Inference When , the mean and variance are calculated based on the expectation of all batches and the expectation of the method, the formula is as follows:}$
$E[x]\leftarrow E_B[\mu_B]$ $Var[x]\leftarrow \cfrac{m}{m-1}E_B[\sigma_B^2]$
Reference:
https://zhuanlan.zhihu.com/p/93643523

Deep learning-BN (Batch Normalization)

Guess you like