Depth learning algorithm (No. 4) ---- TF advanced training of DNN

Welcome attention to micro-channel public number " smart algorithm " - the original link (for better reading experience):

Depth learning algorithm (No. 4) ---- TF advanced training of DNN

Content in Chapter 10, we showed you a brief introduction of ANN (artificial neural network), and trained our first DNN (depth neural networks), but a very shallow DNN, only two hidden layers. If you need to solve a very complex problem, such as not distinguish hundreds of different types of entities subject in high-resolution images, this time you will need to train a deeper DNN to complete, may be 10 layers, and each layer It contains hundreds of neurons by the hundreds or thousands of connected components. This time you will face the following problems:

1, you will face a very strange gradients disappear or explode, which will directly affect the construction of DNN and lead to superficial network is very difficult to train

2, such a large neural networks, direct training, it is extremely slow

3, the model has a large number of parameters in training, it is prone to over-fitting phenomenon

To solve three problems is to talk about the content of this article is facing above.

 

First, the disappearance or explosion gradient

 

Backpropagation between the input layer and the output layer, the gradient error propagation, once the complete algorithm calculates the gradient of the loss function, it will use these values ​​of gradient descent gradient method for updating each parameter. However, with the spread to the network layer shallow gradient it will often become smaller, resulting gradient descent algorithm in weight value between low-level connection almost no change, while training can not converge to the optimal solution. This is the gradient disappearing.

For this problem, a common practice is to choose a good weight initialization strategy, better use of the activation function, batch standardization.

1.1 initialization strategy

Xavier et al in the paper Understanding the Difficulty of Training Deep Feedforward Neural Networks mentioned, like the normal signal flow in the respective neurons, neural network needs to try to equal the output variance of each layer. But the reality is different due to the number of nodes in the input layer and output layer, and difficult to do, so this paper proposes a strategy called the initialization Xavier initialization to solve this problem. For different activation function, which parameters are initialized as follows:

ninputs noutputs input and output layer and the number of connections, additional initialization program for ReLU activation function He sometimes called initialization. Under tensorflow to initialize the policy set by the variance_scaling_initializer () is, by default, the whole connection even distribution of the TF initialization.

1.2 activation function

We usually use the neural network in front of the saw and the logistic function as the activation function, since the gradient of the central portion and both sides of the logistic difference is too large, too large if the weight initialization, activation values ​​are substantially Sigmoid sides, gradient on both sides is almost zero, there is no spread layers of the gradient. I.e., using a good initialization algorithm the activation value controlled within a reasonable range, there are several optimized few neurons and ran on the sides, once the sides, since the gradient is too small, it can not by gradient update it back. This is exactly the cause of the disappearance of the gradient appear, so select another activation function can solve this problem.

ReLU activation function

RELU activation function is a good choice, ReLU (z) = max (0, z). Because ReLU saturation does not occur in the positive value, and computationally fast. But there was a problem ReLU activation function, there will be dead ReLu, that is, during training, weights and neurons in the input layer combination of negative output, will directly lead to neuronal output only 0, but in the event it is difficult to jump. Sometimes, when used, or even half of the neurons die, especially with a lot of learning rate.

To continue ReLU, you need to use its variants, LeakyReLU (z) = max (αz, z).

Which over the parameter α defines how much of the leak. I.e., the slope in FIG z <0 portion, generally the value is set to 0.01. In the negative part to set up a small slope and to ensure LeakyReLU will never die, and have the opportunity to wake up again.

Another RReLU and PReLU, RReLU is predicted by the training period, each set to a random value α given range, and in the test set, the use of a random mean value α of the foregoing. In this way it is possible to perform well, and also to reduce the risk of over-fitting regularization effect. PReLU of α is in the training phase through to learning, rather than the hyper-parameters, but has become a back-propagation parameters, PReLU on the large amount of image data sets have performed very well, but on a small set of data is prone to over-fitting . tensorflow defined by the following:

ELU activation function

Another activation function called ELU. In the experimental test, ELU compared to ReLU, ELU and reduced training time should be excellent on the test set than all ReLU variants.

Map as follows:

ReLU function and compared with the following two differences:

  1. In the z <0, the result returns a negative value, and close to zero, and this feature ReLU variants like, well alleviate the problem gradient disappears, and the parameter α controls the super result is negative the value of z is returned, usually is set to 1.
  2. ELU at z <0, the gradient has a nonzero value, which is good to avoid the problem of dead neurons in training
  3. Everywhere ELU function, even at z = 0, the function which avoids the beating occurs at z = about 0, the acceleration gradient descent well

ELU the use of index calculation, its calculation speed compared to ReLU and its variants will be slower, although somewhat make up the convergence rate, but the overall ELU still slower than ReLU. tensorflow provided ELU () function, the activation function is switched by setting the activation parameter in tf.layers.dense.

1.3 Standardization batch (batch Normalization)

BN是2015年由Sergey提出,是另外一种解决梯度消失或爆炸问题的,通常在训练的时候,前一层的参数发生变化,后一层的分布也会随之改变。BN是在模型中每一层的激活函数前加入标准化操作,首先BN会对输入数据进行零均值方差归一化,该归一化能够加速收敛,甚至特征之间没有相关性,但是简单的归一化神经网络层的输入,可能会改变该层的表征能力,例如,对 sigmoid 函数的输入进行归一化,会使归一化后的输入趋向 s 型函数的线性端。因此还需要引入两个参数来做相应的缩放和平移。BN的算法如下:

其中μB是这个batch数据的均值,σB是标准差,γ是缩放因子,β是平移因子,ε是一个很小的数,防止除数为0,称为平滑因子。很多激活函数都可以使用BN方法而不会造成梯度消失问题,同时模型对初始化权值的要求也降低了。BN类似于一个正则器,它减少了对于其他正则化方法使用的需求。

但是BN会增加模型的复杂度,毕竟每层都需要进行标准化操作。因此如果你想要一个轻量级的神经网络,前面的ELU激活函数加上He初始化,可能会是更好的选择。现在来看一下tensorflow中对BN的使用,完整代码回复DNN可查看。

1.4 梯度裁剪

梯度裁剪主要用于避免梯度爆炸的情况,是通过在反向传播时,将梯度裁剪到一定范围内的值,虽然大家更加喜欢使用BN,但是梯度裁剪也非常的有用,特别时在RNN中,因此有必要知道梯度裁剪已经如何使用。tensorflow是使用clip_by_value()函数来裁剪的:

 

二、提升训练时间

 

训练大型神经网络时,还常常会遇到训练时间的问题,通常会采用复用部分网络和参数,来提升训练的时间。

2.1 复用预训练层

从0开始训练大型的DNN网络一般很慢,我们可以找到一个可以完成相似任务的网络,然后利用其部分浅层的网络及参数,对输入的特征进行简单的提取,这就是迁移学习(transfer learning)。这种方法不仅可以加快训练速度,也只需要更少的训练数据。

例如:我们已经训练了一个DNN模型来对100个不同类型的图片进行分类,包括动物,植物,汽车等等。现在我们想再训练一个DNN来对汽车类别中具体的类别分类,如果复用之前训练的DNN的话就能很简单的获取我们想要模型:

2.2 复用tensorflow模型

前面训练模型时,我们用restore将模型进行了保存。

但是我们通常只想复用原模型中的部分内容,一个简单的方法是配置Saver只保存原模型部分变量,例如下面只保存隐藏层的1,2,3层。

上面代码中,首先,我们构建一个新模型,并确定是否是需要复制原模型中1到3的隐藏层,然后通过scope的正则表达式"hidden[123]"匹配出1到3的隐藏层,接下来创建一个字典映射出原模型中变量的名称,接下来创建一个saver保存只包含这些1到3隐藏层的变量,创建另外一个Saver保存整个模型。

最后我们重新开启一个session,并初始化所有变量,restore需要的1到3隐藏层的变量,利用这些变量在新任务上训练模型并保存。

2.3 从其他框架复用

如果已经使用另一个框架训练了模型,你会需要手动导入权重,然后将它分配给合理的变量。下面例子展示了如何使用从另一个框架训练的模型的第一个隐藏中复制权重和偏差。

2.4 冻结浅层

由于第一个DNN的浅层可能已经学会检测图片类别的低级特征,这将在其他图片分类中有用,因此你可以复用这些浅层。通常来说训练一个新的DNN,将模型的权重冻结是一个很好的做法,如果浅层权重固定了,那么深层权重会变得容易训练。为了在训练阶段冻结浅层网络,最简单的方法是给训练的优化器一个除了浅层网络变量的变量列表。

第三行中获取到隐藏层3,4和输出层的层所有的训练变量列表。这样就除去了浅层1,2的变量。接下来将提供的训练变量列表给到优化器minimize函数。这样就是实现浅层1,2的冻结。

2.5 缓存冻结层

因为冻结层无法改变,可以为每个训练实例缓存最顶层冻结层的输出。因为训练多次遍历整个数据集,这会给你带来巨大的速度提升,因为训练实例每次只需要经过冻结层一次。例如,你可以首先运行整个训练集通过较低层

hidden2_outputs = sess.run(hidden2, feed_dict={X: X_train})

在训练阶段,构建批次的训练实例,你可以从隐藏层2构建批量输出,并将数据喂给模型训练。

最后一行运行前面定义的训练操作(隐藏层1,2),并把二个隐藏层的批量输出喂给模型作为整个模型输出中隐藏层1,2的输出,由于我们已经提供了隐藏1,2的输出,因此模型不会再尝试评估它。

 

三、内容小结

本文讨论了我们再训练更深的DNN模型时,遇到的问题,以及解决方法。训练大型DNN常遇到如下问题:1.梯度消失和爆炸,2.训练效率和速度问题,3.过拟合问题。对于梯度消失问题,我们提出如下解决方案,初始化策略方法:Xavier和He初始化,替换激活函数:ReLU,ReLU变种和ELU,以及BN(批量标准化)。对于训练效率,我们提出了各种复用模型内容的方法,以及下篇会将的各种优化方法。过拟合问题,有很多的正则化方法可用,这也是下篇会讲解的问题。

 

(如需更好的了解相关知识,欢迎加入智能算法社区,在“智能算法”公众号发送“社区”,即可加入算法微信群和QQ群)

本文代码回复关键字:DNN

Guess you like

Origin blog.csdn.net/x454045816/article/details/92140792