TensorFlow notes (x) - batch normalize Introduction

BatchNorm basic idea: to make the active input of each hidden layer node distribution fixed, thus avoiding the " Internal a covariate the Shift " problem.

BN basic idea is actually quite straightforward: because DNN before making the nonlinear transformation input value is activated (that is, x = WU + B, U is the input) with network depth or deeper in the training process, its gradual distribution shift or change, the reason why slow convergence training, generally gradually to the overall distribution of the non-linear function of the value interval near the ends of the lower limit (for Sigmoid function, the mean value WU + B activation input is large negative value or positive), so this leads to the lower gradient back-propagation neural network when the disappearance , which is trained DNN getting slower and slower convergence of the nature of reason , while the BN is a certain degree of standardization by means of neural networks each put any nerves the input value distribution element is forced back to mean 0 and variance 1 standard normal distribution , in fact, the distribution of the biasing force back increasingly more standard distribution, so that the input value is within the nonlinear activation function enter the more sensitive areas, such small changes in input will result in loss of function of large changes, which means so let gradient increases, to avoid gradient disappears Generate questions, and gradient increases means learning fast convergence can greatly speed up the training [1].

Here divided into five parts to explain briefly Batch Normalization (BN) [2].

1. What is BN?

      As the name implies, batch normalization Well, that is " batch standardization " strategy. BN is essentially solved the problem of the gradient back-propagation process. Google in ICML described herein is very clear, i.e. each SGD time (stochastic gradient descent), the mini-batch to do the standard operation of the corresponding Activation, such that the (output signal of each dimension) results mean 0 and variance 1 . This approach uses the data into several groups, according to the parameter set is updated, the data in a group together determine the direction of this gradient decreased randomness is reduced. Partly because the number of samples of batches of the entire data set much smaller compared to the amount of calculation also dropped a lot. The final "scale and shift" operation is due to make the required training and "deliberately" to join the BN can be possible to restore the initial input (ie when [official]), thus ensuring the capacity of the entire network. (About the capacity of explanation: in fact join BN can be seen as in the original model of the "new operation", the new operation will very likely change the course of a layer of the original input may not change, do not change, it is the "Restore the original input. "in this way, both can be changed while maintaining the original input, the model holding capacity (capacity) will promoted.)

      After data normalization and post-standardization can accelerate the solving speed gradient descent , which is very popular Batch Normalization and other technical reasons, it makes possible the use of a larger learning rate gradient spread more stable, or even increase the generalization ability of the network .

    BN network layer typically used in the convolution layer for re-adjusted data distribution . Assume that the input layer of a neural network is a batch X = [x1, x2, ... , xn], where xi represents one sample, n is the batch size.

First, we need to obtain the average of elements in mini-batch:

                                                               

Next, strike a mini-batch variance:

                                                        

So that we can be normalized for each element.

                                                             

Finally, scale and offset scaling operation, which may be transformed back to the original distribution, to achieve identity transformation, this purpose is to compensate for non-linear expression network, because then standardized into offset loss. Specifically expressed as follows, yi is the final output of the network.

                                                           

Can be summarized as the following figure;

               

If gamma equal variance, beta equal to the average, to achieve the identical transformation.

      In a sense, gamma and beta is actually representative of the variance of the input and offset data distribution. For the network does not BN, these two values ​​with the previous layer network to bring non-linear nature, whereas after conversion, just like the previous one has nothing to do, become a learning parameters of the current layer, which is more conducive to optimizing and will not reduce the ability of the network.

      For operation CNN, BN is carried out separately between the respective characteristic dimensions, i.e. each channel is separately Batch Normalization operation. If the blob size is output (N, C, H, W ), then the normalization is based on each layer N * H * W for numerical averaging and variance of the operation , we keep in mind here is compared later.

        About DNN of normalization, we all know that whitening (whitening), but whitening operation will bring high computational cost and computation time in the model training process. Therefore, this paper proposes two simplified methods:

  • 1) done directly normalized for each dimension of the input signal ( "normalize each scalar feature independently");
  • 2) calculated mini-batch mean and variance of each of the mini-batch variance and mean to replace the entire training set. This is the Algorithm 1.

2. How to Batch Normalize?

      How parameters of BN will not go into here, it is the classic chain rule:

              

3. Where to use BN?

      BN可以应用于网络中任意的activation set。文中还特别指出在CNN中,BN应作用在非线性映射前,即对[official]做规范化。另外对CNN的“权值共享”策略,BN还有其对应的做法(详见文中3.2节)。

4. Why BN?

  • BN带来的好处。
  • (1) 减轻了对参数初始化的依赖,这是利于调参的朋友们的。
  • (2) 训练更快,可以使用更高的学习率。
  • (3) BN一定程度上增加了泛化能力,dropout等技术可以去掉。
  • BN的缺陷
  • 从上面可以看出,batch normalization依赖于batch的大小,当batch值很小时,计算的均值和方差不稳定。研究表明对于ResNet类模型在ImageNet数据集上,batch从16降低到8时开始有非常明显的性能下降,在训练过程中计算的均值和方差不准确,而在测试的时候使用的就是训练过程中保持下来的均值和方差。
  • 这一个特性,导致batch normalization不适合以下的几种场景。
  • (1)batch非常小,比如训练资源有限无法应用较大的batch,也比如在线学习等使用单例进行模型参数更新的场景。
  • (2)rnn,因为它是一个动态的网络结构,同一个batch中训练实例有长有短,导致每一个时间步长必须维持各自的统计量,这使得BN并不能正确的使用。在rnn中,对bn进行改进也非常的困难。不过,困难并不意味着没人做,事实上现在仍然可以使用的,不过这超出了咱们初识境的学习范围。

首先来说说“Internal Covariate Shift”。文章的title除了BN这样一个关键词,还有一个便是“ICS”。大家都知道在统计机器学习中的一个经典假设是“源空间(source domain)和目标空间(target domain)的数据分布(distribution)是一致的”。如果不一致,那么就出现了新的机器学习问题,如,transfer learning/domain adaptation等。而covariate shift就是分布不一致假设之下的一个分支问题,它是指源空间和目标空间的条件概率是一致的,但是其边缘概率不同,即:对所有[official],[official],但是[official]. 大家细想便会发现,的确,对于神经网络的各层输出,由于它们经过了层内操作作用,其分布显然与各层对应的输入信号分布不同,而且差异会随着网络深度增大而增大,可是它们所能“指示”的样本标记(label)仍然是不变的,这便符合了covariate shift的定义。由于是对层间信号的分析,也即是“internal”的来由。
那么好,为什么前面我说Google将其复杂化了。其实如果严格按照解决covariate shift的路子来做的话,大概就是上“importance weight”(ref)之类的机器学习方法。可是这里Google仅仅说“通过mini-batch来规范化某些层/所有层的输入,从而可以固定每层输入信号的均值与方差”就可以解决问题。如果covariate shift可以用这么简单的方法解决,那前人对其的研究也真真是白做了。此外,试想,均值方差一致的分布就是同样的分布吗?当然不是。显然,ICS只是这个问题的“包装纸”嘛,仅仅是一种high-level demonstration。
那BN到底是什么原理呢?说到底还是In order to prevent "diffusion gradient" . About diffusion gradient, we all know that a simple chestnut: [official]. In BN, the specification is such that the activation by the activation of a reduced scale would otherwise consistent with mean and variance means becomes larger. It can be said is a more effective local response normalization method (see Section 4.2.1).

5. When to use BN?

      OK, finished BN advantage, of course, know when to use BN better. BN can try to solve for example, I met the convergence is slow, or gradient explosion in the neural network training, not training situation. In addition, under normal usage BN may also be added to speed up training and improve model accuracy.

 

reference:

[1] [learning] in-depth understanding of the depth of the batch Batch Normalization Standardization

[2] depth study Batch Normalization Why effective? : Https://www.zhihu.com/question/38102762

Guess you like

Origin blog.csdn.net/qq_37764129/article/details/94129059