How to get started with deep learning?


There is a lot of information on the Internet about deep learning, but it seems that most of them are not suitable for beginners. Here are a few reasons: 1. Deep learning does require a certain mathematical foundation. If you don't explain it in a simple way, some readers will feel intimidated by difficulties, so they are easy to give up prematurely. 2. Books or articles written by Chinese or Americans are generally more difficult. I don't know why, but it's true.

Deep learning does require a certain mathematical foundation, but is it really that difficult? This, not really. Do not believe? Listen to me come to talk to you. After reading it, you will also feel that it is not that difficult.

This article is aimed at beginners. Experts can ignore it. If there is something wrong, please criticize and correct.

Here, I recommend a very good article first: "Understanding Deep Learning in 1 Day", a ppt of more than 300 pages, written by Professor Li Hongyi from Taiwan, which is very good. It is no exaggeration to say that it is the most systematic and easy-to-understand article on deep learning I have ever read.

Here is the link to slideshare: http://www.slideshare.net/tw_dsconf/ss-62245351?qid=108adce3-2c3d-4758-a830-95d0a57e46bc&v=&b=&from_search=3

Students who don't have a ladder can download it from my online disk: link: Password: 3mty

To say what to prepare first, I personally think that in fact, you only need to know the derivative and related function concepts. Haven't studied advanced mathematics? Very good, I just want liberal arts students to understand, you only need to have studied mathematics in junior high school.

In fact, there is no need to be afraid of difficulties. Personally, I admire Li Shufu's spirit. In a TV interview, Li Shufu said: Who said Chinese people can't build cars? What's so difficult about building a car, isn't it just four wheels and two rows of sofas. Of course, his conclusion is biased, but his spirit is commendable.

What is the derivative? It is nothing more than the rate of change. Wang Xiaoer sold 100 pigs this year, 90 last year, and 80 the year before. . . What is the rate of change or growth rate? How easy is it to grow 10 pigs a year. There is a time variable to note here - years. The growth rate of Wang Xiaoer selling pigs is 10/year, that is to say, the derivative is 10. The function y=f(x)=10x+30, here we assume that Wang Xiaoer sold 30 pigs in the first year, and every year thereafter Increase by 10 pigs, x represents time (years), y represents the number of pigs. Of course, this is a situation where the growth rate is fixed. In real life, in many cases, the amount of change is not fixed, that is to say, the growth rate is not constant. For example, the function may be like this: y=f(x)=5x²+30, where x and y still represent time and head count, but the growth rate has changed, how to calculate this growth rate, we will talk about it later. Or you can simply memorize a few formulas for derivation.

There is also an important mathematical concept in deep learning: partial derivative, how to understand the partial derivative? Migraine headache, or I don't let you guide, but you want to guide? No, let’s take Wang Xiaoer selling pigs as an example. As we mentioned just now, the x variable is time (year), but the pigs sold are not only related to time. With the growth of business, Wang Xiaoer not only The pig farm was expanded and many employees were employed to raise pigs together. So the equation changes again: y=f(x)=5x₁²+8x₂ + 35x₃ +30 where x₂ is the area, x₃ is the number of employees, and of course x₁ is time. As we said above, the derivative is actually the rate of change, so what is the partial derivative? The partial derivative is nothing more than the rate of change of a variable when there are multiple variables. In the above formula, if the partial derivative is taken with respect to x₃, that is, how much the employee contributes to the growth rate of the pig, or how much the pig increases as the (each) employee grows, here is equal to 35-- -For each additional employee, 35 more pigs will be sold. When calculating partial derivatives, other variables can be regarded as constants, which is very important. The rate of change of constants is 0, so the derivative is 0, so there are only 35x₃ Find the derivative, which is equal to 35. It is similar for the partial derivative of x₂. We use a symbol to express the partial derivative: for example, y/ x₃ means that y seeks the partial derivative of x₃.

废话半天,这些跟深度学习到底有啥关系?有关系,我们知道,深度学习是采用神经网络,用于解决线性不可分的问题。关于这一点,我们回头再讨论,大家也可以网上搜一下相关的文章。我这里主要讲讲数学与深度学习的关系。先给大家看几张图:

图1. 所谓深度学习,就是具有很多个隐层的神经网络。


图2.单输出的时候,怎么求偏导数

图3.多输出的时候,怎么求偏导数。后面两张图是日语的,这是日本人写的关于深度学习的书。感觉写的不错,把图盗来用一下。所谓入力层,出力层,中间层,分别对应于中文的:输入层,输出层,和隐层。

大家不要被这几张图吓着,其实很简单的。干脆再举一个例子,就以撩妹为例。男女恋爱我们大致可以分为三个阶段:1.初恋期。相当于深度学习的输入层。别人吸引你,肯定是有很多因素,比如:身高,身材,脸蛋,学历,性格等等,这些都是输入层的参数,对每个人来说权重可能都不一样。2.热恋期。我们就让它对应于隐层吧。这个期间,双方各种磨合,柴米油盐酱醋茶。3.稳定期。对应于输出层,是否合适,就看磨合得咋样了。

大家都知道,磨合很重要,怎么磨合呢?就是不断学习训练和修正的过程嘛!比如女朋友喜欢草莓蛋糕,你买了蓝莓的,她的反馈是negative,你下次就别买了蓝莓,改草莓了。------------------------------------------------------------------------------------------------看完这个,有些小伙可能要开始对自己女友调参了。有点不放心,所以补充一下。撩妹和深度学习一样,既要防止欠拟合,也要防止过拟合。所谓欠拟合,对深度学习而言,就是训练得不够,数据不足,就好比,你撩妹经验不足,需要多学着点,送花当然是最基本的了,还需要提高其他方面,比如,提高自身说话的幽默感等,因为本文重点并不是撩妹,所以就不展开讲了。这里需要提一点,欠拟合固然不好,但过拟合就更不合适了。过拟合跟欠拟合相反,一方面,如果过拟合,她会觉得你有陈冠希老师的潜质,更重要的是,每个人情况不一样,就像深度学习一样,训练集效果很好,但测试集不行!就撩妹而言,她会觉得你受前任(训练集)影响很大,这是大忌!如果给她这个映象,你以后有的烦了,切记切记!------------------------------------------------------------------------------------------------

深度学习也是一个不断磨合的过程,刚开始定义一个标准参数(这些是经验值。就好比情人节和生日必须送花一样),然后不断地修正,得出图1每个节点间的权重。为什么要这样磨合?试想一下,我们假设深度学习是一个小孩,我们怎么教他看图识字?肯定得先把图片给他看,并且告诉他正确的答案,需要很多图片,不断地教他,训练他,这个训练的过程,其实就类似于求解神经网络权重的过程。以后测试的时候,你只要给他图片,他就知道图里面有什么了。

所以训练集,其实就是给小孩看的,带有正确答案的图片,对于深度学习而言,训练集就是用来求解神经网络的权重的,最后形成模型;而测试集,就是用来验证模型的准确度的。

对于已经训练好的模型,如下图所示,权重(w1,w2...)都已知。

图4

图5

我们知道,像上面这样,从左至右容易算出来。但反过来呢,我们上面讲到,测试集有图片,也有预期的正确答案,要反过来求w1,w2......,怎么办?

绕了半天,终于该求偏导出场了。目前的情况是:

1.我们假定一个神经网络已经定义好,比如有多少层,都什么类型,每层有多少个节点,激活函数(后面讲)用什么等。这个没办法,刚开始得有一个初始设置(大部分框架都需要define-and-run,也有部分是define-by-run)。你喜欢一个美女,她也不是刚从娘胎里出来的,也是带有各种默认设置的。至于怎么调教,那就得求偏导。

2.我们已知正确答案,比如图2和3里的r,训练的时候,是从左至右计算,得出的结果为y,r与y一般来说是不一样的。那么他们之间的差距,就是图2和3里的E。这个差距怎么算?当然,直接相减是一个办法,尤其是对于只有一个输出的情况,比如图2; 但很多时候,其实像图3里的那样,那么这个差距,一般可以这样算,当然,还可以有其他的评估办法,只是函数不同而已,作用是类似的:

不得不说,理想跟现实还是有差距的,我们当然是希望差距越小越好,怎么才能让差距越来越小呢?得调整参数呗,因为输入(图像)确定的情况下,只有调整参数才能改变输出的值。怎么调整,怎么磨合?刚才我们讲到,每个参数都有一个默认值,我们就对每个参数加上一定的数值∆,然后看看结果如何?如果参数调大,差距也变大,你懂的,那就得减小∆,因为我们的目标是要让差距变小;反之亦然。所以为了把参数调整到最佳,我们需要了解误差对每个参数的变化率,这不就是求误差对于该参数的偏导数嘛。

关键是怎么求偏导。图2和图3分别给了推导的方法,其实很简单,从右至左挨个求偏导就可以。相邻层的求偏导其实很简单,因为是线性的,所以偏导数其实就是参数本身嘛,就跟求解x₃的偏导类似。然后把各个偏导相乘就可以了。

这里有两个点:

这里有两个点:一个是激活函数,这主要是为了让整个网络具有非线性特征,因为我们前面也提到了,很多情况下,线性函数没办法对输入进行适当的分类(很多情况下识别主要是做分类),那么就要让网络学出来一个非线性函数,这里就需要激活函数,因为它本身就是非线性的,所以让整个网络也具有非线性特征。另外,激活函数也让每个节点的输出值在一个可控的范围内,这样计算也方便。

貌似这样解释还是很不通俗,其实还可以用撩妹来打比方;女生都不喜欢白开水一样的日子,因为这是线性的,生活中当然需要一些浪漫情怀了,这个激活函数嘛,我感觉类似于生活中的小浪漫,小惊喜,是不是?相处的每个阶段,需要时不时激活一下,制造点小浪漫,小惊喜,比如;一般女生见了可爱的小杯子,瓷器之类都迈不开步子,那就在她生日的时候送一个特别样式,要让她感动得想哭。前面讲到男人要幽默,这是为了让她笑;适当的时候还要让她激动得哭。一哭一笑,多整几个回合,她就离不开你了。因为你的非线性特征太强了。

当然,过犹不及,小惊喜也不是越多越好,但完全没有就成白开水了。就好比每个layer都可以加激活函数,当然,不见得每层都要加激活函数,但完全没有,那是不行的。

由于激活函数的存在,所以在求偏导的时候,也要把它算进去,激活函数,一般用sigmoid,也可以用Relu等。激活函数的求导其实也非常简单:

求导: f'(x)=f(x)*[1-f(x)]这个方面,有时间可以翻看一下高数,没时间,直接记住就行了。至于Relu,那就更简单了,就是f(x) 当x<0的时候y等于0,其他时候,y等于x。当然,你也可以定义你自己的Relu函数,比如x大于等于0的时候,y等于0.01x,也可以。

另一个是学习系数,为什么叫学习系数?刚才我们上面讲到∆增量,到底每次增加多少合适?是不是等同于偏导数(变化率)?经验告诉我们,需要乘以一个百分比,这个就是学习系数,而且,随着训练的深入,这个系数是可以变的。

当然,还有一些很重要的基本知识,比如SGD(随机梯度下降),mini batch 和 epoch(用于训练集的选择),限于篇幅,以后再侃吧。其实参考李宏毅的那篇文章就可以了。

这篇拙文,算是对我另一个回答的补充吧:深度学习入门必看的书和论文?有哪些必备的技能需学习? - jacky yang 的回答

其实上面描述的,主要是关于怎么调整参数,属于初级阶段。上面其实也提到,在调参之前,都有默认的网络模型和参数,如何定义最初始的模型和参数?就需要进一步深入了解。不过对于一般做工程而言,只需要在默认的网络上调参就可以了,相当于用算法;对于学者和科学家而言,他们会发明算法,难度还是不小的。向他们致敬!

写得很辛苦,觉得好就给我点个赞吧:)

------------------------------------------------------------------------------------------------

关于求偏导的推导过程,我尽快抽时间,把数学公式用通俗易懂的语言详细描述一下,前一段时间比较忙,抱歉:)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325566615&siteId=291194637
Recommended