[] Machine learning machine learning portal 05 - gradient descent

1. Introduction gradient descent

1.1 gradient

In the multi-function differential calculus, we have contact with gradients (Gradient) concept.

In retrospect, what is the gradient?

Is intended a gradient vector (vector) represents a function at the point directional derivative along this direction obtain the maximum value , i.e. at which point the function along the direction (the direction of this gradient) changes the fastest change the maximum rate (for gradient mode).

This is the explanation given by Baidu Encyclopedia.

In fact, the definition of the gradient is a function of the partial derivative of polyol composition vector. In binary function  f (x, y)  as an example. f X , f represent  f   partial derivative of x, y number. Then  f gradient grad (x, y) at f =f X (X, Y), f Y (X, Y) )

So, the concept of gradient, and machine learning what does it matter?

It will be appreciated, for the loss function, its gradient direction pointing error of the fastest growing direction, the rate of increase for the point size of the error. Thus, to find the point of minimum loss, i.e. a gradient of 0 to find (or closest to 0) point. This is the meaning of the concept of gradient for machine learning. The ideas just described, is the prototype of gradient descent method.

Gradient descent 1.2

Gradient descent method is actually an iterative algorithm. I.e. starting from an arbitrary point, each move a small distance to the gradient direction, up until the gradient is equal to 0.

Similar iterative method of solving linear systems, starting from any vector, using an iterative formula iterated until the iteration results of two close enough.

In univariate example quadratic function, which is an iterative gradient descent process is actually shown in FIG from [theta] 0 began near the lowest point of the process.

 

 

 The following linear regression, for example, to explain the gradient descent process.

           

 

 

            

 

 

 (The above is taken https://blog.csdn.net/neuf_soleil/article/details/82285497 )

In addition, the following few Bowen also worth considering:

https://www.jianshu.com/p/424b7b70df7b

https://blog.csdn.net/walilk/article/details/50978864

https://blog.csdn.net/pengchengliu/article/details/80932232

This well-known almost Q

https://www.zhihu.com/question/305638940

2. achieved gradient descent

Having a complex mathematical derivation, directly or look at our lingua franca, gradient descent method of code implementation.

2.1 Python derivative operation of

There are two common python derivation method is to use derivative method Scipy library, another method to Sympy diff library.

1.  Scipy

scipy.misc.derivative(func, x0, dx=1.0, n=1, args=(), order=3)[source]

Find the n-th derivative of the function at the point a. I.e., a given function, dx centered difference formula to calculate the n-th derivative of the pitch used at x0.

parameter:

func: request function guide, write-only parameter names, but do not write parentheses, otherwise it will error

x0: Requirements that point guide, float type

DX (optional): The spacing should be a small number, float type

n (Optional): n order derivative. The default value is 1, int type

args (Optional): parameter tuple

order (Optional): an odd number of points must be used, int type

  def f(x):          return x**3 + x**2  derivative(f, 1.0, dx=1e-6)

2. Sympy expression derivation

sympy是符号化运算库,能够实现表达式的求导。所谓符号化,是将数学公式以直观符号的形式输出。下面看几个例子就明白了。

from sympy import *
# 符号化变量x = sy.Symbol('x')
func = 1/(1+x**2)
print("x:", type(x))print(func)print(diff(func, x))print(diff(func, x).subs(x, 3))print(diff(func, x).subs(x, 3).evalf())
具体可参考:https://www.cnblogs.com/zyg123/

2.2 梯度下降法的实现

首先我们需要定义一个点  作为初始值,正常应该是随机的点,但是这里先直接定为0。然后需要定义学习率 ,也就是每次下降的步长。这样的话,点  每次沿着梯度的反方向移动 距离,即 ,然后循环这一下降过程。

那么还有一个问题:如何结束循环呢?梯度下降的目的是找到一个点,使得损失函数值最小,因为梯度是不断下降的,所以新的点对应的损失函数值在不断减小,但是差值会越来越小,因此我们可以设定一个非常小的数作为阈值,如果说损失函数的差值减小到比阈值还小,我们就认为已经找到了。

 

 

 

 1 def gradient_descent(initial_theta, eta, epsilon=1e-6):
 2     theta = initial_theta
 3     theta_history.append(theta)
 4     while True:
 5         # 每一轮循环后,要求当前这个点的梯度是多少
 6         gradient = dLF(theta)
 7         last_theta = theta
 8         # 移动点,沿梯度的反方向移动步长eta
 9         theta = theta - eta * gradient
10         theta_history.append(theta)
11         # 判断theta是否达到损失函数最小值的位置
12         if(abs(lossFunction(theta) - lossFunction(last_theta)) < epsilon):
13             break
14 
15 def plot_theta_history():
16     plt.plot(plot_x,plot_y)
17     plt.plot(np.array(theta_history), lossFunction(np.array(theta_history)), color='red', marker='o')
18     plt.show()

 

以上就是梯度下降法的Python实现

 

首先,给出线性回归的模型
h_\theta(x) = \theta_0 + \theta_1xθ (x)=θ 0 +θ 1 x
假设我们用来拟合的数据共有 mm 组,根据最小二乘法,我们实际要找出令
\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2i=1m (h θ (x (i) )−y (i) ) 2 
最小的 \thetaθ 值,其中上标 (i)(i) 代表是第几组数据。设关于 \thetaθ 的损失函数
J(\theta) = \frac{1}{2}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2J(θ)= 21  i=1m (h θ (x (i) )−y (i) ) 2 
\thetaθ 可以看做是一个参数向量,即 \{\theta_0, \theta_1\}^T{θ 0 ,θ 1 } T 。式子的前面乘上系数,是为了方便计算。
根据前面梯度的概念,我们得到
\nabla J(\theta) = (\frac{\partial J(\theta)}{\partial \theta_0},\frac{\partial J(\theta)}{\partial \theta_1})∇J(θ)=( ∂θ 0 ∂J(θ) , ∂θ 1 ∂J(θ) )
也就是说,为了使损失函数达到局部最小值,我们只需要沿着这个向量的反方向进行迭代即可。
那么参数的值到底该一次变化多少呢?我们通常用 \alphaα 来表示这个大小,称为**“步长”**,它的值是需要我们手动设定的,显然,步长太小,会拖慢迭代的执行速度,而步长太大,则有可能在下降时走弯路或者不小心跳过了最优解。所以,我们应该根据实际的情况,合理地设置 \alphaα 的值。
于是,在每次迭代,中,我们令
\theta_0 = \theta_0 - \alpha\frac{\partial J(\theta)}{\partial \theta_0},\theta_1 = \theta_1 - \alpha\frac{\partial J(\theta)}{\partial \theta_1}θ 0 =θ 0 −α ∂θ 0 ∂J(θ) ,θ 1 =θ 1 −α ∂θ 1 ∂J(θ) 
即可使损失函数最终收敛到局部最小值,我们也得到了我们想要的参数值。这个过程如下图
————————————————版权声明:本文为CSDN博主「Evan-Nightly」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。原文链接:https://blog.csdn.net/neuf_soleil/article/details/82285497

Guess you like

Origin www.cnblogs.com/DrChuan/p/11976028.html