On the gradient descent

Gradient descent machine learning in the basis of comparison and the more important a minimization algorithm. Gradient descent algorithm is as follows:

1) random initial value

2)

Here, briefly talk about their understanding of the gradient descent method.

First, define a gradient vector, n is a partial derivative of the function f metadata in n variables, such as a gradient-ary function f (fx, fy, fz), the gradient of binary function f (fx, fy ) monohydric gradient of the function f fx. To understand the direction and gradient of the function f is the fastest growing direction, the opposite direction of the gradient f is reduced fastest direction.

Our functions of one variable, for example, introduce the gradient descent.

 

The figure shows the image function f and the initial value x0, we hope to obtain the minimum value of the function f, because the negative gradient direction in a small step, reducing the value of f, x0 just move it along a direction of the negative gradient small step can be.

 

And f is the derivative point x0 is greater than 0, such that f is positive in the direction of the gradient of point x0, i.e., the gradient direction f '(x0), it was found by the gradient descent method, the next iteration

 

, That is to say x0 moved to the left a small step to the x1, x1 empathy in the derivative of the same point is greater than zero, the next iteration x1 moved to the left a small step to reach x2, go on indefinitely, as long as the number of steps each movement is not great, we can get convergence solution 1 x.

Verification of the above confirms our analysis (blue italic font) of.

 

Similarly, if disposed of in the election to the left minimum, i.e., as shown:

 

Because f '(x0) <0, so that the gradient direction is negative, negative gradient direction is positive, it needs to move in the negative gradient direction a small step x0, i.e., a small step is moved rightward, so that the f value smaller. Method or falling with a gradient iterative formula

 

, We obtain x1 sequentially as shown, x2, ..., xk, ..., until convergence to a minimum.

For binary function, we can verify the rationality of the gradient descent method by way of example:

Each time to get a point (xk, yk), we need to calculate the (fx (xk), fy (yk)), the direction of the gradient f represents the fastest-growing direction, - (fx (xk), fy (yk) ) represents the fastest gradient descent direction, it simply (xk, yk) along - (fx (xk), fy (yk)) a small step in this direction, it is possible to reduce the value of f, until convergence to a minimum value, as shown in FIG.

 

On the few local gradient descent should be noted, is their understanding of the gradient descent method:

1) gradient descent may not necessarily converge to a minimum.

Gradient descent method converges to a local minimum will not necessarily converge to a global minimum.

such as:

 

 

我们初始值选择了如图的x0,由于f在点x0的导数大于0,梯度方向向右,负梯度方向向左,从而x0向左移动,逐渐收敛到了局部最小值,而不能收敛到全局最小值。

2)学习率的大小要适中。

学习率太小,每次移动步长太小,收敛太慢,这个比较容易理解。

学习率太大,每次移动步长大,可能导致不收敛,这里用一个图来表示一下:

 

 

由于距离最小值点越远,导数越大,从而导致步长越来越大,不会收敛。

3)不一定选择负梯度方向,只要是值下降的方向即可。

在每一次迭代选择方向时,我们只要选择与梯度方向夹角小于90度的向量的反方向就可,不一定要选择负梯度方向。但由于,满足这样条件的向量不太容易求出,我们就选择了与梯度方向0度的向量的反方向(负梯度方向),而且这个方向函数值减少的更快,更快的收敛,故是个不错的选择。

4)求最大值的梯度上升法。

f的梯度方向是f的值增长最快的方向。我们每次沿负梯度方向移动一小步可以逐步收敛到局部最大值,因此我们每次沿梯度方向也可以得到函数f的局部最大值。迭代公式为:

 


原文:https://blog.csdn.net/zhulf0804/article/details/52250220

Guess you like

Origin www.cnblogs.com/wangleBlogs/p/11119339.html