Series notes | deep learning serialized (4): optimization techniques (on)

Click on the top " AI proper way ", select the "star" Public No.

Heavy dry goods, the first time served

Depth study we conclude that five tips:

1. Adaptive Learning Rate

We start Adaptive Learning Rate talking about, I Gradient Decent we have discussed:

AdaGrad:

Tight pace AdaGrad, we look further:

RMSProp

Neural network training time, Error Surface likely very complex

RMSProp fact and AdaGrad is the same idea, but requested a denominator when it takes into account the historical and new gradient g weights a.

Momentum

How to find the optimal network parameters?

optimize loss of time, most likely will encounter the following three questions:

  • 慢 very slow

  • Local optimum local minimal

  • 鞍点 saddle point

We can consider mapping the scene in the physical world: the ball slipped from the mountains, at the lowest local time, let it out of his momentum locally.

We review the gradient descent: Gradient direction and the opposite direction of Movement

When we consider the movement momentum:

  • Movement is not based on a gradient, but on the previous motion

  • Movement not just based on gradient, but previous movement.

其中 movement = laststep of movement - present gradient

Momentum will not guarantee that out of the "dilemma", but this is a huge step forward

Adam algorithm

Adam algorithm is a combination of RMSProp and Momentum, to find the optimal solution. Looks complicated,

After RMSProp actually understand and Momentum, it will soon understand.

2. New activation function

Depth study we conclude that the five techniques: In this section we will talk about the activation function Relu from the new.

The new activation function new activation function

我们知道,激活函数在基于神经网络的深度学习中将线性变换,转换为非线性变换。是神经网络能够学到东西的重要一环。常用的激活函数有sigma, tanh 等。

从辛顿大神在2012年imagenet 中的CNN网络中引入relu,这个神奇的看上去是线性的激活函数进入我们的视野,以后扮演者非常重要的作用。

那为什么要引入relu,sigma、tanh 函数有什么缺点呢?

最主要的问题在于deep learning 无法真正deep:

如图所示,训练上8层之后,正确率急速下降。 这是为什么呢?

主要原因在于梯度消失Vanishing Gradient Problem

如图所示:传统的激活函数,数据变化后,输出的变化比输入小,而且根据ChainRule, 层数越深,梯度值相乘的结果越小,小到接近于0的时候,就无法学习了。

所以,我们引入Relu,他的特点是:

1. 计算快速(导数是1)

2. 生物学原理(貌似是大脑回路,不太了解)

3. linear piece 可以模拟任何函数(在以后的深度学习理论会讲)

4. 重点是:可以解决梯度消失的问题

Relu 可以简化神经网络:

虽然Relu看起来很好(有严格数学证明,以后会深入讲),但是在小于0的时候导数为0,对于参数学习是不利的:所以我们引入Relu的变种:leaky Relu, Parametirc Relu, 以后还会谈到 Selu

本专栏图片、公式很多来自台湾大学李弘毅老师、斯坦福大学cs229、cs231n 、斯坦福大学cs224n课程。在这里,感谢这些经典课程,向他们致敬!

作者简介:武强 兰州大学博士,谷歌全球开发专家Google Develop Expert(GDE Machine Learing 方向) 

CSDN:https://me.csdn.net/dukuku5038 

知乎:https://www.zhihu.com/people/Dr.Wu/activities 

漫画人工智能公众号:DayuAI-Founder

系列笔记: 

系列笔记 | 深度学习连载(1):神经网络

系列笔记 | 深度学习连载(2):梯度下降

系列笔记 | 深度学习连载(3):反向传播


推荐阅读

(点击标题可跳转阅读)

干货 | 公众号历史文章精选

我的深度学习入门路线

我的机器学习入门路线图

 

 

 

最新 AI 干货,我在看 

发布了251 篇原创文章 · 获赞 1024 · 访问量 137万+

Guess you like

Origin blog.csdn.net/red_stone1/article/details/103917700