Development History
Defect standard gradient descent method
If the learning rate of improper election there will be more cases
So there are ways to automatically adjust the learning rate. In general, with the increase in the number of iterations, learning rate should be getting smaller and smaller, because the number of iterations increases, the resulting solution should be relatively close to the optimal solution, so to reduce the step size η, then what formula do? For example: But after doing so, when all parameters are updated still use the same learning rate, which can not meet all the learning rate parameter update.
The solution: to different learning rates of different parameters
Adagrad 法
Suppose N-ary function f (x), the iterative gradient descent process Adagrad argument for a study,
As can be seen, Adagrad with an adaptive algorithm of the gradient adjusting means (adaptive gradient), the learning rate is divided by a thing before the thing is n times the iterative process and squared plus a constant partial derivative of the final square root .
Example: Use Adagrad algorithm for the minimum point y = x2
Guide function g (x) = 2x
Initialization x (0) = 4, the learning rate η = 0.25, ε = 0.1
① first iteration:
② first iteration:
③ first iteration:
Solving process as shown in FIG.
from matplotlib import pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() x = np.arange(-4, 4, 0.025) plt.plot(x,x**2) plt.title("y = x^2") def f(x): return x**2 def h(x): return 2*x η = 0.25 ε = 0.1 x = 4 iters = 0 sum_square_grad = 0 X = [] Y = [] while iters<12: iters+=1 X.append(x) Y.append(f(x)) sum_square_grad += h(x)**2 x = x - η/np.sqrt(sum_square_grad+ε)*h(x) print(iters,x) plt.plot(X,Y,"ro") ax = plt.subplot() for i in range(len(X)): ax.text(X[i], (X[i])**2, "({:.3f},{:.3f})".format(X[i], (X[i])**2), color='red') plt.show()
RMSprop法
AdaGrad算法在迭代后期由于学习率过小,可能较难找到一个有用的解。为了解决这一问题,RMSprop算法对Adagrad算法做了一点小小的修改,RMSprop使用了加权平均的方法,由累积平方梯度变成平均平方梯度,解决了后期学习率太小的缺点.(类似于动量梯度下降)
假设N元函数f(x),针对一个自变量研究RMSprop梯度下降的迭代过程,
可以看出分母不再是一味的增加,它会重点考虑距离他较近的梯度(指数衰减的效果),也就不会出现Adagrad到后期收敛缓慢的问题
举例:使用RMSprop算法求y = x2的最小值点
导函数为h(x) = 2x
初始化g(0) = 1,x(0) = 4,ρ=0.9,η=0.01,ε=10-10
第①次迭代:
第②次迭代:
求解的过程如下图所示
对应代码为:
from matplotlib import pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() x = np.arange(-4, 4, 0.025) plt.plot(x,x**2) plt.title("y = x^2") def f(x): return x**2 def h(x): return 2*x g = 1 x = 4 ρ = 0.9 η = 0.01 ε = 10e-10 iters = 0 X = [] Y = [] while iters<12: iters+=1 X.append(x) Y.append(f(x)) g = ρ*g+(1-ρ)*h(x)**2 x = x - η/np.sqrt(g+ε)*h(x) print(iters,x) plt.plot(X,Y,"ro") ax = plt.subplot() for i in range(len(X)): ax.text(X[i], (X[i])**2, "({:.3f},{:.3f})".format(X[i], (X[i])**2), color='red') plt.show()
Momentum是动量的意思,想象一下,一个小车从高坡上冲下来,他不会停在最低点,因为他还有一个动量,还会向前冲,甚至可以冲过一些小的山丘,如果面对的是较大的坡,他可能爬不上去,最终又会倒车回来,折叠几次,停在谷底。
如果使用的是没有动量的梯度下降法,则可能会停到第一个次优解
最直观的理解就是,若当前的梯度方向与累积的历史梯度方向一致,则当前的梯度会被加强,从而这一步下降的幅度更大。若当前的梯度方向与累积的梯度方向不一致,则会减弱当前下降的梯度幅度。
从这幅图可以看出来,当小球到达A点处,负梯度方向的红箭头朝着x轴负向,但是动量方向(绿箭头)朝着x轴的正向并且长度大于红箭头,因此小球在A处还会朝着x轴正向移动。
下面正式介绍Momentum法
假设N元函数f(x),针对一个自变量研究Momentum梯度下降的迭代过程,
v表示动量,初始v=0
α是一个接近于1的数,一般设置为0.9,也就是把之前的动量缩减到0.9倍
η是学习率
下面通过一个例子演示一下,求y = 2*x^4-x^3-x^2的极小值点
可以看出从-0.8开始迭代,依靠动量成功越过第一个次优解,发现无法越过最优解,折叠回来,最终收敛到最优解。对应代码如下:
from matplotlib import pyplot as plt import numpy as np fig = plt.figure() x = np.arange(-0.8, 1.2, 0.025) plt.plot(x,-x**3-x**2+2*x**4) plt.title("y = 2*x^4-x^3-x^2") def f(x): return 2*x**4-x**3-x**2 def h(x): return 8*x**3 - 3*x**2 - 2*x η = 0.05 α = 0.9 v = 0 x = -0.8 iters = 0 X = [] Y = [] while iters<12: iters+=1 X.append(x) Y.append(f(x)) v = α*v - η*h(x) x = x + v print(iters,x) plt.plot(X,Y) plt.show()
Adam法
假设N元函数f(x),针对一个自变量研究Adam梯度下降的迭代过程,
下面依次解释这五个式子:
在②式中,借鉴的是RMSprop的指数衰减
③和④式目的是纠正偏差
⑤式进行梯度更新
举例:使用Adagrad算法求y = x2的最小值点
导函数为h(x) = 2x
初始化x(0) = 4,m(0) = 0,v(0) = 0,β1=0.9,β2=0.999,ε=10-8,η = 0.001
第①次迭代:
第②次迭代:
求解的过程如下图所示:
对应代码为:
from matplotlib import pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() x = np.arange(-4, 4, 0.025) plt.plot(x,x**2) plt.title("y = x^2") def f(x): return x**2 def h(x): return 2*x x = 4 m = 0 v = 0 β1 = 0.9 β2 = 0.999 η = 0.001 ε = 10e-8 iters = 0 X = [] Y = [] while iters<12: iters+=1 X.append(x) Y.append(f(x)) m = β1*m + (1-β1)*h(x) v = β2*v + (1-β2)*h(x)**2 m_het = m/(1-β1**iters) v_het = v/(1-β2**iters) x = x - η/np.sqrt(v_het+ε)*m_het print(iters,x) plt.plot(X,Y,"ro") ax = plt.subplot() for i in range(len(X)): ax.text(X[i], (X[i])**2, "({:.3f},{:.3f})".format(X[i], (X[i])**2), color='red') plt.show()