Click on the top " AI proper way ", select the "star" Public No.
Heavy dry goods, the first time served
Depth study we conclude that five tips:
1. Adaptive Learning Rate
We start Adaptive Learning Rate talking about, I Gradient Decent we have discussed:
AdaGrad:
Tight pace AdaGrad, we look further:
RMSProp
Neural network training time, Error Surface likely very complex
RMSProp fact and AdaGrad is the same idea, but requested a denominator when it takes into account the historical and new gradient g weights a.
Momentum
How to find the optimal network parameters?
optimize loss of time, most likely will encounter the following three questions:
-
慢 very slow
-
Local optimum local minimal
-
鞍点 saddle point
We can consider mapping the scene in the physical world: the ball slipped from the mountains, at the lowest local time, let it out of his momentum locally.
We review the gradient descent: Gradient direction and the opposite direction of Movement
When we consider the movement momentum:
-
Movement is not based on a gradient, but on the previous motion
-
Movement not just based on gradient, but previous movement.
其中 movement = laststep of movement - present gradient
Momentum will not guarantee that out of the "dilemma", but this is a huge step forward
Adam algorithm
Adam algorithm is a combination of RMSProp and Momentum, to find the optimal solution. Looks complicated,
After RMSProp actually understand and Momentum, it will soon understand.
2. New activation function
Depth study we conclude that the five techniques: In this section we will talk about the activation function Relu from the new.
The new activation function new activation function
我们知道,激活函数在基于神经网络的深度学习中将线性变换,转换为非线性变换。是神经网络能够学到东西的重要一环。常用的激活函数有sigma, tanh 等。
从辛顿大神在2012年imagenet 中的CNN网络中引入relu,这个神奇的看上去是线性的激活函数进入我们的视野,以后扮演者非常重要的作用。
那为什么要引入relu,sigma、tanh 函数有什么缺点呢?
最主要的问题在于deep learning 无法真正deep:
如图所示,训练上8层之后,正确率急速下降。 这是为什么呢?
主要原因在于梯度消失Vanishing Gradient Problem
如图所示:传统的激活函数,数据变化后,输出的变化比输入小,而且根据ChainRule, 层数越深,梯度值相乘的结果越小,小到接近于0的时候,就无法学习了。
所以,我们引入Relu,他的特点是:
1. 计算快速(导数是1)
2. 生物学原理(貌似是大脑回路,不太了解)
3. linear piece 可以模拟任何函数(在以后的深度学习理论会讲)
4. 重点是:可以解决梯度消失的问题
Relu 可以简化神经网络:
虽然Relu看起来很好(有严格数学证明,以后会深入讲),但是在小于0的时候导数为0,对于参数学习是不利的:所以我们引入Relu的变种:leaky Relu, Parametirc Relu, 以后还会谈到 Selu
本专栏图片、公式很多来自台湾大学李弘毅老师、斯坦福大学cs229、cs231n 、斯坦福大学cs224n课程。在这里,感谢这些经典课程,向他们致敬!
作者简介:武强 兰州大学博士,谷歌全球开发专家Google Develop Expert(GDE Machine Learing 方向)
CSDN:https://me.csdn.net/dukuku5038
知乎:https://www.zhihu.com/people/Dr.Wu/activities
漫画人工智能公众号:DayuAI-Founder
系列笔记:
推荐阅读
(点击标题可跳转阅读)
最新 AI 干货,我在看