Analysis of Batch Normalization Principle

Analysis of Batch Normalization Principle

foreword

This article is some Batch Normalization-related materials compiled by myself with reference to some books and blogs. The whole article is compiled based on my own understanding for future reference. References are posted after the text.

Batch Normalization can be used to solve the problem of gradient disappearance and gradient explosion, including the internal covariate shift (Internal Covariate Shift) mentioned in the original paper, so this article first sorts out some gradient disappearance and gradient explosion and internal covariance shift. Principle, and then analyze the principle of Batch Normalization.

1.1 Gradient disappearance and gradient explosion

In some papers (such as the one on resnet) and technical books, Batch Normalization is mentioned to be used to solve gradient disappearance and gradient explosion. Here, refer to the book "Pytorch in simple terms" to give the principle of gradient disappearance and gradient explosion. .

insert image description here
where $\mathbf{h}_{j}$ for the The input of neurons in layer $j$ $\mathbf{W}_j$ for the The weight of neurons in layer $j$ $\mathbf{h}_{j+1}$ For the output of this layer, that is, as the input of the lower layer, theoretically $\mathbf{h}_{j+1}=\mathbf{W}_j\mathbf{h}_{j}$ , After adding the activation function $\mathbf{h}_{j+1} = f_j(\mathbf{W}_j\mathbf{h}_j )$ 。

According to the chain rule in calculus, $f(\mathbf{x})$ 对 $\mathbf{x}$ 的求导为：
$\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}= \frac{\partial f(\mathbf{x})}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}$
We assume that the final loss function is $f_n(\mathbf{h}_n)$ , is the function of neurons in the output layer, derivatives on both sides, according to the chain rule:
$\frac{\partial L}{\partial \mathbf{W}_{j}} = \frac{\partial L}{\partial \mathbf{h}_{j+ 1}} \frac{\partial \mathbf{h}_{j+1}}{\partial \mathbf{W}_{j}} = \left(\frac{\partial L}{\partial \mathbf{ h}_{j+1}} \odot \frac{\partial f_j(\mathbf{W}_j\mathbf{h}_j )}{\partial \mathbf{W}_j\mathbf{h}_j} \right )\mathbf{h}_j^T$
$\frac{\partial L}{\partial \ mathbf{h}_{j}} = \frac{\partial L}{\partial \mathbf{h}_{j+1}} \frac{\partial \mathbf{h}_{j+1}}{ \partial \mathbf{h}_{j}} = \mathbf{W}_j^T \left(\frac{\partial L}{\partial \mathbf{h}_{j+1}} \odot \frac {\partial f_j(\mathbf{W}_j\mathbf{h}_j )}{\partial \mathbf{W}_j\mathbf{h}_j} \right)$
among $\frac{\partial L}{\partial \mathbf{h}_{j}}$ The formula can be regarded as the derivative of the loss function to the data, that is, the data gradient; and $\frac{\partial L}{\partial \mathbf{W}_{j}}$ is the derivative of the loss function to the weight, that is, the weight gradient. It can be seen from formula 2 that the data gradient is related to the weight, the weight gradient is related to the data, and the data gradient and weight gradient of the previous layer are related to the data gradient of the latter layer.

Next, the gradient disappearance and gradient explosion can be explained:

Gradient disappearance: When the constructed neural network is very deep, the learning speed of different layers varies greatly. It is shown that the learning of the layer close to the output in the network is very good, and the learning of the layer close to the input is very slow. There are many reasons for this problem, such as improper weight initialization, or improper use of the activation function. It is easier to understand the activation function as an example.

If Sigmoid or Tanh is used as the activation function, they are characterized by a gradient less than 1. That means that the derivative of the activation function to the data is less than 1 every time it propagates to the next level during backpropagation, that is, $\frac{\partial f(\mathbf{h}_{j})} {\partial \mathbf{h}_{j}}$ less than 1. And every time the data gradient is propagated to the previous level, it will be multiplied by $\frac{\partial f(\mathbf{h}_{j})}{\partial \mathbf{h}_{j} }$ , so the deeper the propagation, the smaller the final data gradient, and the smaller the corresponding weight gradient, which causes the gradient to disappear. So when building a network, we usually use the ReLU function as the activation function because its gradient is 1. Improper weight initialization, such as some weights being too small, can also cause this problem.
Gradient explosion: If the weight initialization sets some weight values too large, then during backpropagation, the data gradient will become larger for each level of forward propagation, and the corresponding weight gradient will also be superimposed and larger, so it will cause The weight gradient is too large.

In summary, the weight initialization and activation function are the main reasons for the gradient disappearance and gradient explosion, so try to distribute the weight initialization value around 1 when initializing the weight.

2.1 Internal covariance transfer

Internal covariance transfer is mentioned in the Batch Normalization paper. As mentioned above, the deep neural network involves the superposition of many layers. The parameter update of each layer will cause the input data distribution of the upper layer to change. Through the layer-by-layer superposition, the input distribution of the upper layer will change very drastically, which makes the higher layer need to constantly to re-adapt the underlying parameter updates.

That is to say, the data we input will undergo a nonlinear transformation through each layer of the network until the last layer. At this time, the distribution of the input data has been changed, but the ground truth will not change, which is As a result, the lower neurons in the network need to constantly adapt to update parameters to adapt to the new data distribution, and the update of each layer will affect the changes of the next layer, so the optimizer parameter setting needs to be very cautious.

3.1 Batch Normalization principle

The following is an analysis of the principle of Batch Normalization. In order to solve the internal covariance transfer, the input of each layer of the network must satisfy independent and identical distribution, and this is the approach of Batch Normalization.

Take Convolutional Neural Networks as an example. Suppose a certain layer of our network has $k$ neurons, its previous layer has $j$ neurons, then the $The output of layer j$ is [B, j, H1, W1], where B is Batch_Size, and j is the number of channels output by this layer. No. $The output of layer j$ is passed as input to the $k$ layer, and the $layer k$ has $k$ neurons, equivalent to the number of output channels of this layer is $k$ , that is, theThe weight dimension of each neuron in the k layer is [j, S, S], S is the size of the convolution kernel, and the weight of each neuron is convoluted with the input [B, j, H1, W1] to obtain the dimension $[$ B, 1, H2, W2], and a total of $k$ such neurons, so the $The overall dimension of the k$ -layer output is [B, k, H2, W2]. You can look at the picture below to deepen your understanding:
insert image description here
in the picture above $The output of layer j$ is [B, 4, H1, W1], which is passed as input to the $layer k$ , the $The k$ layer has two neurons, and the weight dimension of each neuron is [4, S, S], but the weight of each neuron is convoluted with the input, and the result is [B, 1, H2, W2] , then cat the results of the two neurons to get the overall result [B, 2, H2, W2].

Batch Normalization is the role of On the output of the $k$ $layer k$ has $k$ neurons, Batch_Size is $m$ , means $m$ data, so the $The dimension of the k$ layer output is $[m, k, H, W]$ , equivalent to a total of $m$ data, each data has $k$ channels, each channel is $[H, W]$ matrix, and Batch Normalization is for $Each dimension of m$ data is regularized, as shown in the figure below:
insert image description here
The above is the position where we add it when we use BN. Generally, a Conv layer is followed by a BN layer, and then an activation layer such as ReLU. Let's look at the specific formula of BN again.

Continuing with the example mentioned above, $m$ data pass $The k$ layer gets the dimension $[m, k, H, W]$ output, namely $m$ data, each data has $k$ channels, each channel is $[H, W]$ matrix. Batch Normalization is performed on the output, that is, $If m$ is an infinitesimal equation, solve the problem:
$\mu_{1} = \frac{1}{m} \sum_{ i=1}^{m}x_{1i}\\\sigma_{1}^{2} = \frac{1}{m}\sum_{i=1}^{m}\left(x_{1i} -\mu_{1}\right)^{2} \\\hat{x}_1 \leftarrow \frac{x_{1i}-\mu_{1}}{\sqrt{\sigma_{1}^{2} +\epsilon}} \\ y_{1} \leftarrow \gamma_1 \hat{x_{1}}+\beta_1 \equiv B N_{\gamma_1, \beta_1}\left(x_{1}\right)$
where $x_1$ Represents the first channel of the entire Batch, $x_{1i}$ Indicates the The first channel of $i data.$ This operation can be divided into two steps:

Standardization: first to $m$ _ $m$ performs Standardization to get the distribution of zero mean unit variance $\hat{x}_1$ ;
scale and shift: then for $\hat{x}_1$ Do scale and shift, scale and translate to a new distribution $y_1$ , with a new mean variance $\gamma_1$ 。

$\gamma_1$ and $\beta_1$ It is the scale and shift parameters to be learned, used to control $y_1$ variance and mean of .

The principle of Batch Normalization and gradient disappearance and gradient explosion

Analysis of Batch Normalization Principle

Table of contents

foreword

1.1 Gradient disappearance and gradient explosion

2.1 Internal covariance transfer

3.1 Batch Normalization principle

Guess you like