In-depth understanding of the role of Batch normalization

Reprint:

https://www.cnblogs.com/wmr95/articles/9450252.html

This article explains it easy to understand. Easy to follow your own review

 

Batch Normalization as an important achievement of the past year DL has been widely proven its effectiveness and importance. Although some details of the deal still not explain its reasons for the theory, but the practice has proved useful is really good, do not forget to do DL from Hinton Pre-Train is the beginning of a deep network experience ahead of the theoretical analysis of partial experience of a knowledge. This article is the thesis: Introduction "Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift" is.

  Machine learning has a very important assumption: IID independent and identically distributed suppose , is the assumption that the training data and test data meet the same distribution, which is obtained by the model training data can be obtained good results in the test set a basic protection. What role BatchNorm is it? BatchNorm is such that each layer of the neural network input remains the same in the depth distribution of the neural network training process.

  The next step by step to understand what is BN.

  Why the depth of the neural network with the network depth deepens, the more difficult training together, converge more slowly? This is a field very close to the DL in the nature of a good question. Many papers are to solve this problem, such as ReLU activation function, another example on Residual Network, BN also explain the nature and from a different point of view to solve this problem.

A, "Internal Covariate Shift" problem

  As can be seen from the name of the paper, BN is used to solve the "Internal Covariate Shift" problem, then first we have to understand what is "Internal Covariate Shift"?

  Firstly description Mini-Batch SGD relative to the two advantages One Example SGD: the direction of the gradient updated to more accurately; fast parallel computation speed; (why say this because BatchNorm is based on the Mini-Batch SGD, so at first boast Mini-? Batch SGD, of course, is a big truth); and shortcomings Tucao SGD training: hyperparameter tune a lot of trouble. (BN Author Implied meaning is able to solve many of the shortcomings of SGD)

  Followed by introduction of the concept of covariate shift : If the ML system instances collection distribution of the input value X <X, Y> is always changing, which does not meet the IID assumptions , network model is difficult to stabilize school law , which shall not be introduced into the transfer learning to get we, our ML system will have to learn how to meet this distribution change ah. For depth study of this structure contains many hidden layer of the network, in the training process, because each layer parameters constantly changing, so each hidden layer will face the problem of covariate shift, which is in the training process, enter the hidden layer distribution always come and go, this is called "internal Covariate Shift", internal refers to the hidden layer deep network is what happens inside the network, rather than a covariate shift problem only occurs in the input layer.

  Then put forward the basic idea of BatchNorm: Can you make the active input of each hidden layer node distribution fixed it ? This avoids the "Internal Covariate Shift" problem.

  BN does not just racking our brains beat out of good ideas, it is instructive sources: Previous studies have shown that if the operation on the input image Albino (Whiten) in image processing, then - the so-called bleaching , is to transform the input data to be distributed 0 mean and unit variance of a normal distribution - then the neural network will be faster convergence, the author began BN inference: the image is a neural network input layer depth, whitening can do to speed up the convergence, then in fact, for the depth of networks, which a hidden layer of the neural element is the input of the next layer, meaning that in fact each hidden layer deep neural network are input layer, but the next level is relatively speaking it, so can do for each hidden layer albino it? This is inspired by the original idea generated BN, BN and indeed did just that, you can understand do a simplified version of whitening operation to activate the value of DNN each hidden layer neurons.

Two, BatchNorm the nature of thought

  BN basic idea is actually quite straightforward: because DNN before making the nonlinear transformation input value is activated (that is, x = WU + B, U is the input) with network depth or deeper in the training process, its gradual distribution shift or change, the reason why slow convergence training, generally gradually to the overall distribution of the non-linear function of the value interval near the ends of the lower limit (for Sigmoid function, the mean value WU + B activation input is large negative value or positive), so this leads to the lower gradient back-propagation neural network when the disappearance , which is trained DNN getting slower and slower convergence of the nature of reason , while the BN is a certain degree of standardization by means of neural networks each put any nerves the input value distribution element is forced back to mean 0 and variance 1 standard normal distribution , in fact, the distribution of the biasing force back increasingly more standard distribution, so that the input value is within the nonlinear activation function enter the more sensitive areas, such small changes in input will result in loss of function of large changes, which means so let gradient increases, to avoid gradient disappears Generate questions, and gradient increases means learning fast convergence can greatly speed up the training.

  THAT'S IT. In fact, word is: For each hidden layer neuron, the nonlinear function mapped to gradually move closer to the limit value interval saturated zone input distributions forced back to zero mean and variance 1 more standard normal distribution, such that the non-linear transformation function of the input value falls sensitive input area, thereby avoiding problems gradient disappears. Because the gradient has been able to maintain a relatively large state, so obviously adjust the parameters of the neural network efficiency is relatively high, it is a big change, that big loss to the optimal value function Midon pace, that converges fast. BN final analysis, such a mechanism is very simple, very profound truth.

  Above right still seems abstract, following more vividly this adjustment in the end What is the meaning.

  Figure 1 several normal distribution

  假设某个隐层神经元原先的激活输入x取值符合正态分布,正态分布均值是-2,方差是0.5,对应上图中最左端的浅蓝色曲线,通过BN后转换为均值为0,方差是1的正态分布(对应上图中的深蓝色图形),意味着什么,意味着输入x的取值正态分布整体右移2(均值的变化),图形曲线更平缓了(方差增大的变化)。这个图的意思是,BN其实就是把每个隐层神经元的激活输入分布从偏离均值为0方差为1的正态分布通过平移均值压缩或者扩大曲线尖锐程度,调整为均值为0方差为1的正态分布。

  那么把激活输入x调整到这个正态分布有什么用?首先我们看下均值为0,方差为1的标准正态分布代表什么含义:

 

图2  均值为0方差为1的标准正态分布图

 

  这意味着在一个标准差范围内,也就是说64%的概率x其值落在[-1,1]的范围内,在两个标准差范围内,也就是说95%的概率x其值落在了[-2,2]的范围内。那么这又意味着什么?我们知道,激活值x=WU+B,U是真正的输入,x是某个神经元的激活值,假设非线性函数是sigmoid,那么看下sigmoid(x)其图形:

图3. Sigmoid(x)

及sigmoid(x)的导数为:G’=f(x)*(1-f(x)),因为f(x)=sigmoid(x)在0到1之间,所以G’在0到0.25之间,其对应的图如下:

图4  Sigmoid(x)导数图

  假设没有经过BN调整前x的原先正态分布均值是-6,方差是1,那么意味着95%的值落在了[-8,-4]之间,那么对应的Sigmoid(x)函数的值明显接近于0,这是典型的梯度饱和区,在这个区域里梯度变化很慢,为什么是梯度饱和区?请看下sigmoid(x)如果取值接近0或者接近于1的时候对应导数函数取值,接近于0,意味着梯度变化很小甚至消失。而假设经过BN后,均值是0,方差是1,那么意味着95%的x值落在了[-2,2]区间内,很明显这一段是sigmoid(x)函数接近于线性变换的区域,意味着x的小变化会导致非线性函数值较大的变化,也即是梯度变化较大,对应导数函数图中明显大于0的区域,就是梯度非饱和区。

  从上面几个图应该看出来BN在干什么了吧?其实就是把隐层神经元激活输入x=WU+B从变化不拘一格的正态分布通过BN操作拉回到了均值为0,方差为1的正态分布,即原始正态分布中心左移或者右移到以0为均值,拉伸或者缩减形态形成以1为方差的图形。什么意思?就是说经过BN后,目前大部分Activation的值落入非线性函数的线性区内,其对应的导数远离导数饱和区,这样来加速训练收敛过程。

  但是很明显,看到这里,稍微了解神经网络的读者一般会提出一个疑问:如果都通过BN,那么不就跟把非线性函数替换成线性函数效果相同了?这意味着什么?我们知道,如果是多层的线性函数变换其实这个深层是没有意义的,因为多层线性网络跟一层线性网络是等价的。

文章中举了个例子,在sigmoid激活函数的中间部分,函数近似于一个线性函数(如下图所示),使用BN后会使归一化后的数据仅使用这一段线性的部分。

可以看到,在[0.2, 0.8]范围内,sigmoid函数基本呈线性递增,甚至在[0.1, 0.9]范围内,sigmoid函数都是类似于线性函数的,如果只用这一段,那网络不就成了线性网络了么,这显然不是大家愿意见到的。

这意味着网络的表达能力下降了,这也意味着深度的意义就没有了。所以BN为了保证非线性的获得,对变换后的满足均值为0方差为1的x又进行了scale加上shift操作(y=scale*x+shift),每个神经元增加了两个参数scale和shift参数,这两个参数是通过训练学习到的,意思是通过scale和shift把这个值从标准正态分布左移或者右移一点并长胖一点或者变瘦一点,每个实例挪动的程度不一样,这样等价于非线性函数的值从正中心周围的线性区往非线性区动了动。核心思想应该是想找到一个线性和非线性的较好平衡点,既能享受非线性的较强表达能力的好处,又避免太靠非线性区两头使得网络收敛速度太慢。当然,这是我的理解,论文作者并未明确这样说。但是很明显这里的scale和shift操作是会有争议的,因为按照论文作者论文里写的理想状态,就会又通过scale和shift操作把变换后的x调整回未变换的状态,那不是饶了一圈又绕回去原始的“Internal Covariate Shift”问题里去了吗,感觉论文作者并未能够清楚地解释scale和shift操作的理论原因。

三、训练阶段如何做BatchNorm

  上面是对BN的抽象分析和解释,具体在Mini-Batch SGD下做BN怎么做?其实论文里面这块写得很清楚也容易理解。为了保证这篇文章完整性,这里简单说明下。

  假设对于一个深层神经网络来说,其中两层结构如下:

  图5  DNN其中两层

  要对每个隐层神经元的激活值做BN,可以想象成每个隐层又加上了一层BN操作层,它位于X=WU+B激活值获得之后,非线性函数变换之前,其图示如下:

  图6. BN操作

  对于Mini-Batch SGD来说,一次训练过程里面包含m个训练实例,其具体BN操作就是对于隐层内每个神经元的激活值来说,进行如下变换:

  要注意,这里t层某个神经元的x(k)不是指原始输入,就是说不是t-1层每个神经元的输出,而是t层这个神经元的线性激活x=WU+B,这里的U才是t-1层神经元的输出。变换的意思是:某个神经元对应的原始的激活x通过减去mini-Batch内m个实例获得的m个激活x求得的均值E(x)并除以求得的方差Var(x)来进行转换。

  上文说过经过这个变换后某个神经元的激活x形成了均值为0,方差为1的正态分布,目的是把值往后续要进行的非线性变换的线性区拉动,增大导数值,增强反向传播信息流动性,加快训练收敛速度。但是这样会导致网络表达能力下降,为了防止这一点,每个神经元增加两个调节参数(scale和shift),这两个参数是通过训练来学习到的,用来对变换后的激活反变换,使得网络表达能力增强,即对变换后的激活进行如下的scale和shift操作,这其实是变换的反操作:

  BN其具体操作流程,如论文中描述的一样:

 

  过程非常清楚,就是上述公式的流程化描述,这里不解释了,直接应该能看懂。

四、BatchNorm的推理(Inference)过程

  BN在训练的时候可以根据Mini-Batch里的若干训练实例进行激活数值调整,但是在推理(inference)的过程中,很明显输入就只有一个实例,看不到Mini-Batch其它实例,那么这时候怎么对输入做BN呢?因为很明显一个实例是没法算实例集合求出的均值和方差的。这可如何是好?

  既然没有从Mini-Batch数据里可以得到的统计量,那就想其它办法来获得这个统计量,就是均值和方差。可以用从所有训练实例中获得的统计量来代替Mini-Batch里面m个训练实例获得的均值和方差统计量,因为本来就打算用全局的统计量,只是因为计算量等太大所以才会用Mini-Batch这种简化方式的,那么在推理的时候直接用全局统计量即可。

  决定了获得统计量的数据范围,那么接下来的问题是如何获得均值和方差的问题。很简单,因为每次做Mini-Batch训练时,都会有那个Mini-Batch里m个训练实例获得的均值和方差,现在要全局统计量,只要把每个Mini-Batch的均值和方差统计量记住,然后对这些均值和方差求其对应的数学期望即可得出全局统计量,即:

(在测试时,所使用的均值和方差是整个训练集的均值和方差。整个训练集的均值和方差的值通常是在训练的同时用移动平均法来计算的。

关于滑动平均值怎么求,请看链接:https://blog.csdn.net/whitesilence/article/details/75667002

  有了均值和方差,每个隐层神经元也已经有对应训练好的Scaling参数和Shift参数,就可以在推导的时候对每个神经元的激活数据计算NB进行变换了,在推理过程中进行BN采取如下方式:

  这个公式其实和训练时

  是等价的,通过简单的合并计算推导就可以得出这个结论。那么为啥要写成这个变换形式呢?我猜作者这么写的意思是:在实际运行的时候,按照这种变体形式可以减少计算量,为啥呢?因为对于每个隐层节点来说:

          

  都是固定值,这样这两个值可以事先算好存起来,在推理的时候直接用就行了,这样比原始的公式每一步骤都现算少了除法的运算过程,乍一看也没少多少计算量,但是如果隐层节点个数多的话节省的计算量就比较多了。

这里其实是和训练保持一致,因为都要先减去训练集的均值再除以方差。这里你把训练时候的x^(k)带进去算下就能理解测试时候的了。

五、BatchNorm的好处

  BatchNorm为什么NB呢,关键还是效果好。不仅仅极大提升了训练速度,收敛过程大大加快;②还能增加分类效果,一种解释是这是类似于Dropout的一种防止过拟合的正则化表达方式,所以不用Dropout也能达到相当的效果;③另外调参过程也简单多了,对于初始化要求没那么高,而且可以使用大的学习率等。总而言之,经过这么简单的变换,带来的好处多得很,这也是为何现在BN这么快流行起来的原因。

Guess you like

Origin www.cnblogs.com/hoojjack/p/12350707.html